About the Role

The engineer will build and maintain systems that test and analyze how well large language models perform on complex tasks, with a focus on safety, accuracy, and consistency.

Responsibilities

Develop automated systems to evaluate model performance across diverse tasks
Design benchmarks that measure reasoning, safety, and factual accuracy
Implement scalable testing infrastructure for large language models
Collaborate with researchers to translate theoretical evaluation concepts into code
Analyze model outputs to identify failure modes and biases
Iterate on evaluation methodologies based on empirical results
Maintain and improve existing evaluation pipelines
Ensure evaluations are reproducible and statistically sound
Integrate human feedback into automated assessment frameworks
Monitor model behavior across versions and updates
Contribute to documentation for evaluation protocols
Work with cross-functional teams to align evaluations with product goals
Optimize evaluation efficiency without sacrificing rigor
Stay current with advancements in AI evaluation techniques
Support red teaming exercises to uncover model vulnerabilities

Nice to Have

Advanced degree in a relevant technical discipline
Published research in machine learning or AI conferences
Direct experience evaluating language model outputs
Background in formal logic or reasoning benchmarks
Familiarity with adversarial testing methods
Experience with human-in-the-loop evaluation systems
Knowledge of reinforcement learning from human feedback
Track record of open-source contributions in AI evaluation
Experience working on AI alignment projects

Compensation

Competitive salary and benefits package

Work Arrangement

Full-time, on-site or hybrid availability

Team

Collaborative research and engineering team focused on AI safety and model evaluation

About the Team

This role is part of a dedicated group focused on ensuring models behave as intended across a wide range of scenarios.
The team combines engineering rigor with research-driven inquiry to improve model reliability and trustworthiness.

What We Value

Rigorous empirical testing
Clear technical communication
Iterative improvement of evaluation methods
Collaboration between engineering and research roles

Available for qualified candidates

Anthropic is hiring a Research Engineer, Model Evaluations

About the Role

Responsibilities

Nice to Have

Compensation

Work Arrangement

Team

About the Team

What We Value

Thailand or Vietnam — your office, your rules

Similar Jobs

Senior Kernel / Linux Virtualization Engineer

Key Account Specialist

Sr. Customer Success Manager - Enterprise

Major Gifts Officer

Principal Software Research Engineer

Sr Technical Program Manager I - Adoption and Enablement