North America

UnitedHealth Group / Optum is hiring a Lead Site Reliability Engineer

The Lead Site Reliability Engineer plays a critical role in ensuring the reliability, scalability, and performance of our production systems. This position bridges the gap between development and operations by implementing robust automation, monitoring, and incident response practices. The ideal candidate will lead initiatives to reduce system downtime, improve service ownership, and drive a culture of continuous improvement. You will work closely with engineering teams to design resilient architectures, enforce SLOs, and optimize cloud infrastructure for both performance and cost-efficiency. This role requires deep technical expertise, strong leadership skills, and a commitment to operational excellence in complex, high-traffic environments.

Responsibilities

  • Lead efforts to improve system availability, performance, and scalability across production environments
  • Define and enforce Service Level Objectives (SLOs) and Service Level Indicators (SLIs) through proactive engineering practices
  • Promote an automation-first mindset to reduce manual toil and increase engineering efficiency
  • Design and deploy automated solutions for provisioning, deployment, scaling, and self-healing of services
  • Implement comprehensive monitoring, logging, and alerting systems for real-time system visibility
  • Create dashboards and diagnostic tools to accelerate incident detection and resolution
  • Foster a DevOps culture by equipping development teams with secure, reliable tools and infrastructure
  • Lead incident response efforts and conduct thorough root cause analyses to prevent future outages
  • Ensure continuous improvement by driving post-mortem action items to completion

Requirements

  • Bachelor's degree or equivalent practical experience
  • Over 10 years of experience in Site Reliability Engineering, DevOps, or software engineering within high-scale production systems
  • Extensive background managing critical systems, incident response, and leading post-mortem reviews
  • Proven experience operating and scaling infrastructure in AWS, Azure, or GCP
  • Demonstrated success in automating complex operational workflows to improve efficiency
  • Hands-on experience with monitoring and logging technologies such as Prometheus, Grafana, ELK, Datadog, or Splunk
  • Strong proficiency with Infrastructure as Code tools like Terraform or CloudFormation and configuration management systems such as Ansible, Chef, or Puppet

Tech Stack

Prometheus, Grafana, ELK stack, Elasticsearch, Datadog, Splunk, Terraform, CloudFormation, Ansible, Chef, Puppet, Python, Go, Ruby, Java, C#, AWS, EC2, S3, RDS, Lambda, EKS, ECS, Docker, Kubernetes

Benefits

  • Comprehensive benefits package
  • Opportunities for career development and growth
  • Work environment rooted in inclusion

Compensation

Not specified

Work Arrangement

Not specified

Team

Engineer-focused environment emphasizing collaboration, reliability, and operational excellence across distributed systems

  • Inclusion
  • Caring
  • Connecting
  • Growing together

Additional Information

  • The ideal candidate has deep expertise in distributed systems and cloud computing, with a proven ability to scale SRE and DevOps practices in large production environments
  • Employees are expected to comply with employment contracts, company policies, and management directives, including potential changes in work location, team assignments, or work schedules
  • The company reserves the right to modify, update, or discontinue any policies or directives at its discretion

Not specified

Required Skills
PrometheusGrafanaELK stackTerraformAWSAzureGCPAnsibleChefPuppetincident responsepost-mortemautomation PrometheusGrafanaELK stackElasticsearchDatadogSplunkTerraformCloudFormationAnsibleChefPuppetPythonGoRubyJava
About company
UnitedHealth Group / Optum
Optum is a global organization that delivers care, aided by technology to help millions of people live healthier lives. It connects people with care, pharmacy benefits, data and resources.
All jobs at UnitedHealth Group / Optum Visit website
Job Details
Department Information Technology
Category infrastructure
Posted 3 months ago