Optum Tech, part of UnitedHealth Group, is seeking a Principal Site Reliability Engineer (SRE) to lead the strategic design and implementation of resilient, observable, and high-performing systems across the organization. This role is for a hands-on technologist and strategic thinker passionate about reliability, automation, and innovation, particularly in applying AI to enhance SRE practices.
What You'll Do
- Lead the implementation and standardization of OpenTelemetry across services to enhance observability and traceability.
- Define and enforce SLIs, SLOs, and error budgets in collaboration with engineering teams.
- Design and execute resiliency tests, disaster recovery (DR) exercises, and chaos engineering game days to proactively identify system weaknesses.
- Develop automated failure injection and recovery validation tools.
- Enhance CI/CD pipelines with automated performance and load testing to ensure reliability and scalability before production deployment.
- Collaborate with DevOps and QA to integrate performance benchmarks into release gates.
- Drive cloud adoption strategies with a focus on resiliency patterns, multi-region failover, and cost-effective scaling.
- Partner with cloud architects to design fault-tolerant infrastructure and services.
- Explore and implement AI-driven solutions for anomaly detection, incident prediction, and intelligent alerting.
- Innovate with AI agents to automate routine SRE tasks and improve incident response efficiency.
- Serve as a thought leader and mentor for SRE best practices across the organization.
- Lead cross-functional initiatives to improve system reliability, developer productivity, and customer experience.
What We're Looking For
- 10+ years of experience in software engineering, DevOps, or SRE roles.
- At least 3+ years in a principal or lead capacity.
- 5+ years of experience with CI/CD tooling (e.g., Jenkins, GitHub Actions, ArgoCD).
- 5+ years of experience with container orchestration in cloud platforms (Azure or AWS preferred).
- 3+ years of deep experience in observability and monitoring tools (e.g., OpenTelemetry, Prometheus, Grafana, Datadog).
- 3+ years of experience with chaos engineering, DR planning, and performance testing.
Nice to Have
- Bachelor's degree in Computer Science, Information Technology or related field.
- Hands-on experience with infrastructure as code (Terraform, Pulumi) and automation tools such as Ansible, Helm.
- Experience with service mesh technologies (e.g., Istio, Linkerd).
- Familiarity with AI/ML concepts and experience applying them in operational contexts.
- Proven excellent communication and leadership skills.
Technical Stack
- Observability: OpenTelemetry, Prometheus, Grafana, Datadog
- CI/CD: Jenkins, GitHub Actions, ArgoCD
- Cloud: Azure, AWS
- Infrastructure as Code: Terraform, Pulumi
- Automation & Orchestration: Ansible, Helm, Istio, Linkerd
Benefits & Compensation
- Compensation range: $134,600 to $230,800 annually.
- Comprehensive benefits package.
- Incentive and recognition programs.
- Equity stock purchase.
- 401k contribution.
Work Mode
This is a hybrid position open to candidates in the United States.
UnitedHealth Group is an Equal Employment Opportunity/Affirmative Action employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, or protected veteran status.



