Optum Tech, part of UnitedHealth Group, is seeking a Principal Site Reliability Engineer (SRE) to lead the strategic design and implementation of resilient, observable, and high-performing systems across the organization. This role is for a hands-on technologist and strategic thinker passionate about reliability, automation, and innovation, particularly in applying AI to enhance SRE practices.

What You'll Do

Lead the implementation and standardization of OpenTelemetry across services to enhance observability and traceability.
Define and enforce SLIs, SLOs, and error budgets in collaboration with engineering teams.
Design and execute resiliency tests, disaster recovery (DR) exercises, and chaos engineering game days to proactively identify system weaknesses.
Develop automated failure injection and recovery validation tools.
Enhance CI/CD pipelines with automated performance and load testing to ensure reliability and scalability before production deployment.
Collaborate with DevOps and QA to integrate performance benchmarks into release gates.
Drive cloud adoption strategies with a focus on resiliency patterns, multi-region failover, and cost-effective scaling.
Partner with cloud architects to design fault-tolerant infrastructure and services.
Explore and implement AI-driven solutions for anomaly detection, incident prediction, and intelligent alerting.
Innovate with AI agents to automate routine SRE tasks and improve incident response efficiency.
Serve as a thought leader and mentor for SRE best practices across the organization.
Lead cross-functional initiatives to improve system reliability, developer productivity, and customer experience.

What We're Looking For

10+ years of experience in software engineering, DevOps, or SRE roles.
At least 3+ years in a principal or lead capacity.
5+ years of experience with CI/CD tooling (e.g., Jenkins, GitHub Actions, ArgoCD).
5+ years of experience with container orchestration in cloud platforms (Azure or AWS preferred).
3+ years of deep experience in observability and monitoring tools (e.g., OpenTelemetry, Prometheus, Grafana, Datadog).
3+ years of experience with chaos engineering, DR planning, and performance testing.

Nice to Have

Bachelor's degree in Computer Science, Information Technology or related field.
Hands-on experience with infrastructure as code (Terraform, Pulumi) and automation tools such as Ansible, Helm.
Experience with service mesh technologies (e.g., Istio, Linkerd).
Familiarity with AI/ML concepts and experience applying them in operational contexts.
Proven excellent communication and leadership skills.

Technical Stack

Observability: OpenTelemetry, Prometheus, Grafana, Datadog
CI/CD: Jenkins, GitHub Actions, ArgoCD
Cloud: Azure, AWS
Infrastructure as Code: Terraform, Pulumi
Automation & Orchestration: Ansible, Helm, Istio, Linkerd

Benefits & Compensation

Compensation range: $134,600 to $230,800 annually.
Comprehensive benefits package.
Incentive and recognition programs.
Equity stock purchase.
401k contribution.

Work Mode

This is a hybrid position open to candidates in the United States.

UnitedHealth Group is an Equal Employment Opportunity/Affirmative Action employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, or protected veteran status.

Optum Tech (UnitedHealth Group) is hiring a Principal Site Reliability Engineer

What You'll Do

What We're Looking For

Nice to Have

Technical Stack

Benefits & Compensation

Work Mode

Similar Jobs

DevOps Engineer (Senior/Principal)

Senior DevOps Engineer

DevOps & Site Reliability Engineer

Sr Cloud Engineer | NodeJS + TS/JS | Europe remote

Senior Cloud Infrastructure Developer (Remote)

KTO - Platform Engineer - SRE - Lever

Related Articles

AI-Driven Graduate Jobs at Infosys: Future Tech Roles

Spotify Shifts Dev Work to Honk AI

CI/CD Testing Tools: 23 Best Options for 2026

Optum Tech (UnitedHealth Group) is hiring a Principal Site Reliability Engineer

What You'll Do

What We're Looking For

Nice to Have

Technical Stack

Benefits & Compensation

Work Mode

Similar Jobs

DevOps Engineer (Senior/Principal)

Senior DevOps Engineer

DevOps &amp; Site Reliability Engineer

Sr Cloud Engineer | NodeJS + TS/JS | Europe remote

Senior Cloud Infrastructure Developer (Remote)

KTO - Platform Engineer - SRE - Lever

Related Articles

AI-Driven Graduate Jobs at Infosys: Future Tech Roles

Spotify Shifts Dev Work to Honk AI

CI/CD Testing Tools: 23 Best Options for 2026

DevOps & Site Reliability Engineer