Responsibilities
- Developing a chain-of-thought monitorability benchmark and comparing monitorability properties across frontier AI systems, leveraging AISI’s unique access to reasoning traces from multiple labs.
- Designing and running experiments on open-weight models to study alignment and oversight-relevant phenomena – such as reproducing emergent misalignment from reward hacking, or red-teaming techniques like inoculation prompting and character training.
- Using white-box and interpretability methods – such as activation oracles, sparse auto-encoders or probes – to detect misalignment that isn’t visible through behavioural evaluation alone.
- Building tooling and infrastructure for our research – including agent orchestration, large-scale RL pipelines, mechanistic interpretability methodologies, and auditing agents.
- Reviewing frontier lab risk assessments and safety cases, providing independent analysis of alignment claims before deployment decisions.
- Conducting literature reviews and expert interviews to map the state of model transparency risks and inform AISI’s strategic priorities.
- Translating technical findings into actionable insights for AISI evaluation teams, UK government officials, and international partners.
Requirements
- A get-things-done mindset – you take ownership, move fast, and care about shipping work that matters.
- A combination of self-sufficiency and enthusiasm for teamwork – you’re equally happy defining your own agenda and contributing to shared goals.
- You’re excited about growing, giving and receiving feedback, and building something together.
- An ability to build, supervise and orchestrate AI agents to complete tasks effectively, while verifying and maintaining quality of work.
- A demonstrated track record of relevant, high-quality work – whether technical publications, blog posts, or other publicly visible contributions.
Team
Structure: The Model Transparency team is a research team within AISI focused on ensuring that evaluations, assessments, and monitoring of frontier AI systems remain reliable as models become less transparent.
Additional Information
- The deadline for applying to this role is Sunday 24th May 2026, end of day, anywhere on Earth.