About the Role
The engineer will build and maintain systems that test and analyze how well large language models perform on complex tasks, with a focus on safety, accuracy, and consistency.
Responsibilities
- Develop automated systems to evaluate model performance across diverse tasks
- Design benchmarks that measure reasoning, safety, and factual accuracy
- Implement scalable testing infrastructure for large language models
- Collaborate with researchers to translate theoretical evaluation concepts into code
- Analyze model outputs to identify failure modes and biases
- Iterate on evaluation methodologies based on empirical results
- Maintain and improve existing evaluation pipelines
- Ensure evaluations are reproducible and statistically sound
- Integrate human feedback into automated assessment frameworks
- Monitor model behavior across versions and updates
- Contribute to documentation for evaluation protocols
- Work with cross-functional teams to align evaluations with product goals
- Optimize evaluation efficiency without sacrificing rigor
- Stay current with advancements in AI evaluation techniques
- Support red teaming exercises to uncover model vulnerabilities
Nice to Have
- Advanced degree in a relevant technical discipline
- Published research in machine learning or AI conferences
- Direct experience evaluating language model outputs
- Background in formal logic or reasoning benchmarks
- Familiarity with adversarial testing methods
- Experience with human-in-the-loop evaluation systems
- Knowledge of reinforcement learning from human feedback
- Track record of open-source contributions in AI evaluation
- Experience working on AI alignment projects
Compensation
Competitive salary and benefits package
Work Arrangement
Full-time, on-site or hybrid availability
Team
Collaborative research and engineering team focused on AI safety and model evaluation
About the Team
- This role is part of a dedicated group focused on ensuring models behave as intended across a wide range of scenarios.
- The team combines engineering rigor with research-driven inquiry to improve model reliability and trustworthiness.
What We Value
- Rigorous empirical testing
- Clear technical communication
- Iterative improvement of evaluation methods
- Collaboration between engineering and research roles
Available for qualified candidates