San Francisco, CA Hybrid Employment $315,000 - $560,000 USD

Anthropic is hiring a Research Engineer, Interpretability

About the Role

The role involves researching the internal mechanisms of machine learning models to uncover how they process information and produce outputs, with the goal of increasing model safety and trustworthiness through better interpretability techniques.

Responsibilities

  • Develop tools and techniques to analyze the decision-making processes of neural networks
  • Collaborate with researchers to identify patterns in model activations and representations
  • Design experiments to probe model behavior across different inputs and contexts
  • Implement interpretability methods such as feature visualization and activation patching
  • Contribute to open-source projects related to model transparency
  • Publish findings in academic venues and internal reports
  • Work closely with engineering teams to integrate interpretability into model development
  • Evaluate the effectiveness of interpretability approaches on large-scale models
  • Help define best practices for auditing AI systems
  • Improve understanding of how models represent concepts internally
  • Support efforts to detect and mitigate unintended model behaviors
  • Translate research prototypes into scalable software tools
  • Maintain up-to-date knowledge of advances in interpretability literature
  • Assist in setting technical direction for interpretability initiatives
  • Communicate complex technical ideas to non-specialist stakeholders

Nice to Have

  • PhD in machine learning, neuroscience, cognitive science, or related field
  • Direct experience with interpretability methods such as circuit analysis or saliency maps
  • Track record of first-author publications at top-tier conferences
  • Experience working with large language models
  • Background in cognitive psychology or formal logic
  • Familiarity with causal inference techniques
  • Knowledge of reinforcement learning systems
  • Experience in software engineering at scale
  • Prior work in AI ethics or safety research
  • Demonstrated ability to lead research projects
  • Understanding of mechanistic interpretability frameworks
  • Proficiency with distributed computing environments
  • Experience mentoring junior researchers or engineers
  • History of collaboration across research and engineering functions
  • Familiarity with formal verification methods

Compensation

Competitive salary and benefits package

Work Arrangement

Hybrid or remote options available

Team

Part of a multidisciplinary research team focused on AI safety and alignment

Research Focus

The team investigates how neural networks form and use internal representations, aiming to map computations within models to human-understandable concepts.

Impact Goals

Work contributes to building safer AI systems by enabling detection of hidden behaviors and improving model accountability through transparent analysis methods.

Collaboration Style

Engineers work in close partnership with researchers and product teams, combining empirical investigation with engineering rigor to advance interpretability.

Available for qualified candidates

Need to work legally in Thailand?

Work permits without the paperwork nightmare

Thai immigration rules are strict and easy to get wrong. SVBL handles the bureaucracy — correct visa type, proper documentation, timely submissions. You focus on your work.

Right visa type for your situation
Document preparation & submission
Deadline tracking & renewals
Direct liaison with immigration
Talk to an expert
10+ years experience
About company
Anthropic
Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole.
All jobs at Anthropic Visit website
Job Details
Department Interpretability
Category other
Posted 2 hours ago