UnitedHealth Group / Optum is looking for a Lead Site Reliability Engineer to operate at the intersection of development and operations. You will ensure systems meet stringent production SLAs while empowering development teams to ship code faster and safer. Our mission is to help people live healthier lives and make the health system work better for everyone, supported by a culture guided by inclusion, talented peers, and comprehensive benefits.
What You'll Do
- Champion the philosophy of 'automation first' to eliminate manual, repetitive operational tasks (toil).
- Design and implement robust automation solutions to allow engineers to focus on strategic projects.
- Implement and manage comprehensive monitoring, logging, and alerting systems to provide deep visibility into application performance and infrastructure health.
- Develop dashboards and tools that enable rapid detection and resolution of incidents.
- Act as a catalyst for DevOps culture and practices across development teams.
- Provide the tools, infrastructure, and guardrails necessary to accelerate the software delivery lifecycle securely and reliably.
- Lead the design and implementation of automated operational workflows for existing services and new service onboarding, including provisioning, deployment, scaling, and self-healing capabilities.
- Oversee incident response management, lead root cause analyses (post-mortems), and ensure action items are completed to prevent recurrence.
- Manage and optimize cloud infrastructure costs and efficiency using Infrastructure as Code (IaC) principles.
What We're Looking For
- Undergraduate degree or equivalent experience.
- 10+ years of experience in SRE, DevOps, Software Engineering, or a related operational capacity within a high-traffic production environment.
- Extensive experience in managing critical production systems, incident response, and leading post-mortem processes.
- Proven experience managing infrastructure and applications within a major public cloud environment (AWS, Azure, or GCP) at scale.
- Proven solid track record of automating complex, manual operational processes and improving engineering efficiency.
- Hands-on experience implementing and managing monitoring and logging stacks (e.g., Prometheus, Grafana, ELK stack/Elasticsearch, Datadog, Splunk).
- Solid experience with Infrastructure as Code tools such as Terraform or CloudFormation, and configuration management tools (Ansible, Chef, or Puppet).
- Proficiency in programming languages (e.g., Python, Go, Ruby, or Java/C#) used for automation, tooling development, and services management.
- Proven expertise in cloud platforms (e.g., AWS services such as EC2, S3, RDS, Lambda, EKS/ECS).
- Proven mandatory expertise in Docker and Kubernetes for container orchestration and management.
- Proven expertise in building and maintaining robust CI/CD pipelines (e.g., GHA, Jenkins, GitLab CI, Azure DevOps) and strong Git practices.
Technical Stack
- Monitoring & Logging: Prometheus, Grafana, ELK stack/Elasticsearch, Datadog, Splunk
- Infrastructure as Code: Terraform, CloudFormation
- Configuration Management: Ansible, Chef, Puppet
- Languages: Python, Go, Ruby, Java, C#
- AWS Services: EC2, S3, RDS, Lambda, EKS/ECS
- Containers & Orchestration: Docker, Kubernetes
- CI/CD: GHA, Jenkins, GitLab CI, Azure DevOps
- Version Control: Git
UnitedHealth Group is committed to mitigating environmental impact and enabling and delivering equitable care that addresses health disparities.



