Responsibilities
- Build and scale automated evaluation pipelines (LLM-as-judge + human review) with clinical-grade benchmarks.
Requirements
- Proven experience designing agentic processes and LLM evaluation/benchmarking frameworks.
- Strong Python and ML background (PyTorch/TensorFlow, Hugging Face, LangChain/LlamaIndex).
- Demonstrated ability to design rigorous experiments and translate findings into production.
- Track record of published research or deep applied work in LLMs and agent evaluation.
- Strong communication and technical writing skills to articulate complex findings clearly.