The Lead Site Reliability Engineer plays a critical role in ensuring the reliability, scalability, and performance of our production systems. This position bridges the gap between development and operations by implementing robust automation, monitoring, and incident response practices. The ideal candidate will lead initiatives to reduce system downtime, improve service ownership, and drive a culture of continuous improvement. You will work closely with engineering teams to design resilient architectures, enforce SLOs, and optimize cloud infrastructure for both performance and cost-efficiency. This role requires deep technical expertise, strong leadership skills, and a commitment to operational excellence in complex, high-traffic environments.

Responsibilities

Lead efforts to improve system availability, performance, and scalability across production environments
Define and enforce Service Level Objectives (SLOs) and Service Level Indicators (SLIs) through proactive engineering practices
Promote an automation-first mindset to reduce manual toil and increase engineering efficiency
Design and deploy automated solutions for provisioning, deployment, scaling, and self-healing of services
Implement comprehensive monitoring, logging, and alerting systems for real-time system visibility
Create dashboards and diagnostic tools to accelerate incident detection and resolution
Foster a DevOps culture by equipping development teams with secure, reliable tools and infrastructure
Lead incident response efforts and conduct thorough root cause analyses to prevent future outages
Ensure continuous improvement by driving post-mortem action items to completion

Requirements

Bachelor's degree or equivalent practical experience
Over 10 years of experience in Site Reliability Engineering, DevOps, or software engineering within high-scale production systems
Extensive background managing critical systems, incident response, and leading post-mortem reviews
Proven experience operating and scaling infrastructure in AWS, Azure, or GCP
Demonstrated success in automating complex operational workflows to improve efficiency
Hands-on experience with monitoring and logging technologies such as Prometheus, Grafana, ELK, Datadog, or Splunk
Strong proficiency with Infrastructure as Code tools like Terraform or CloudFormation and configuration management systems such as Ansible, Chef, or Puppet

Tech Stack

Prometheus, Grafana, ELK stack, Elasticsearch, Datadog, Splunk, Terraform, CloudFormation, Ansible, Chef, Puppet, Python, Go, Ruby, Java, C#, AWS, EC2, S3, RDS, Lambda, EKS, ECS, Docker, Kubernetes

Benefits

Comprehensive benefits package
Opportunities for career development and growth
Work environment rooted in inclusion

Compensation

Not specified

Work Arrangement

Not specified

Team

Engineer-focused environment emphasizing collaboration, reliability, and operational excellence across distributed systems

Inclusion
Caring
Connecting
Growing together

Additional Information

The ideal candidate has deep expertise in distributed systems and cloud computing, with a proven ability to scale SRE and DevOps practices in large production environments
Employees are expected to comply with employment contracts, company policies, and management directives, including potential changes in work location, team assignments, or work schedules
The company reserves the right to modify, update, or discontinue any policies or directives at its discretion

Not specified

UnitedHealth Group / Optum is hiring a Lead Site Reliability Engineer

Responsibilities

Requirements

Tech Stack

Benefits

Compensation

Work Arrangement

Team

Additional Information

Similar Jobs

Senior Platform Engineer - Observability

Sr. DevOps Engineer

Senior DevOps Engineer

Senior/Lead Cloud Automation Developer (m/f/d)

Senior Site Reliability Engineer - Ireland

Senior/Lead Cloud Automation Developer

Related Articles

Become an AI Developer: Your Career Guide

AI Boom Job Impact: Tech Decline vs. Service Growth in SF

Remote Tech Job Risks 2026: Automation, Loyalty, and Pay