Anthropic is seeking a Software Engineer to join our Observability team within the Infrastructure organization. You will own the monitoring and telemetry infrastructure that every engineer and researcher at Anthropic depends on, having a direct impact on the reliability and operational excellence of our research and product systems.
What You'll Do
- Design and build scalable telemetry ingest and storage pipelines for metrics, logs, traces, and error data across Anthropic’s multi-cluster infrastructure.
- Own and evolve core observability platforms, driving migrations and architectural improvements that improve reliability, reduce cost, and scale with organizational growth.
- Build instrumentation libraries, SDKs, and integrations that make it easy for engineering teams to emit high-quality telemetry from their services.
- Drive alerting and SLO infrastructure that enables teams to define, monitor, and respond to reliability targets with minimal noise.
- Reduce mean time to detection and resolution by building cross-signal correlation, unified query interfaces, and AI-assisted diagnostic tooling.
- Partner with Research, Inference, Product, and Infrastructure teams to ensure observability solutions meet the unique needs of each organization.
What We're Looking For
- 10+ years of relevant industry experience building and operating large-scale observability or monitoring infrastructure.
- Deep experience with at least one observability signal area (metrics, logging, tracing, or error analytics) and familiarity with the others.
- Understanding of high-throughput data pipelines, columnar storage engines, and the tradeoffs involved in ingesting and querying telemetry data at scale.
- Experience operating or building on top of observability platforms such as Prometheus, Grafana, ClickHouse, OpenTelemetry, or similar systems.
- Strong proficiency in at least one of Python, Rust, or Go.
- Excellent communication skills and enjoy partnering with internal teams to improve their operational visibility and incident response capabilities.
- Excitement about building foundational infrastructure and comfort working independently on ambiguous, high-impact technical challenges.
Nice to Have
- Experience operating metrics systems at very high cardinality (hundreds of millions of active time series or more).
- Experience with log storage migrations or operating columnar databases (ClickHouse, BigQuery, or similar) for analytics workloads.
- Experience with OpenTelemetry instrumentation, collector pipelines, and tail-based sampling strategies.
- Experience building or operating alerting platforms, on-call tooling, or SLO frameworks at scale.
- Experience with Kubernetes-native monitoring, eBPF-based observability, or continuous profiling.
- Interest in applying AI/LLMs to operational workflows such as automated root cause analysis, anomaly detection, or intelligent alerting.
Technical Stack
- Languages: Python, Rust, Go
- Platforms: Prometheus, Grafana, ClickHouse, OpenTelemetry, Kubernetes
Team & Environment
You will be part of the Observability team within the Infrastructure organization at Anthropic.
Benefits & Compensation
- Compensation range: $405,000—$485,000 USD
- Competitive compensation and benefits
- Optional equity donation matching
- Generous vacation and parental leave
- Flexible working hours
- Lovely office space
Work Mode
This role follows a hybrid work model and is based in San Francisco.
Anthropic is an equal opportunity employer.




