Responsibilities
- Serve as the dedicated reliability owner for the Knowledge Work training environments, providing continuity of context and reducing the operational overhead of rotating ownership
- Own a clean, canonical set of evaluation tools and processes for Knowledge Work capabilities, including the process used for model releases
- Build and automate observability, dashboards, and operational tooling for our training environments and evaluation systems, with an emphasis on high signal-to-noise: a small set of trusted metrics and alerts rather than sprawling instrumentation
- Proactively harden environments and evaluation systems through load testing, fault injection, and stress testing at realistic scale, so failures surface early rather than during critical training work
- Act as the primary point of contact for partner training and infrastructure teams when issues in our environments arise, and drive incidents to resolution
- Reduce the operational burden on researchers so they can stay focused on research
Requirements
- Highly experienced Python engineer who ships reliable, well-instrumented code that teammates trust in production
- Demonstrated experience operating ML or distributed systems at scale, including significant on-call and incident-response experience
- Strong SRE or production-engineering mindset — reaching for SLOs, load tests, and failure injection before reaching for more dashboards
- Foundational ML knowledge sufficient to understand what a training environment or evaluation is actually measuring, and recognize when an evaluation has become stale or gameable
- Able to read research code and reason evaluation integrity
Nice to Have
- 5+ years of experience operating ML or distributed systems at scale
- Experience building or operating RL environments, agent harnesses, or LLM evaluation frameworks
- Familiarity with reward modeling, evaluation design, or detecting and mitigating reward hacking
- Experience with observability stacks (metrics, tracing, structured logging) and operational dashboard tooling
- Background in chaos engineering, fault injection, or large-scale load testing
- Experience with data quality pipelines, drift detection, or evaluation-set curation and versioning
- Familiarity with large-scale training or inference infrastructure (schedulers, multi-agent orchestration, sandboxed execution)
- Prior experience as a dedicated reliability or operations owner embedded within a research team