The Senior Site Reliability Engineer I plays a critical role in ensuring the reliability, scalability, and performance of our distributed systems. This position bridges the gap between development and operations by driving automation, improving system observability, and reducing manual intervention through proactive engineering solutions. The engineer will lead initiatives to define service level objectives, implement robust monitoring and alerting, and ensure rapid incident response and recovery. This role requires deep technical expertise, strong collaboration skills, and a relentless focus on system resilience and operational excellence.

Responsibilities

Develops monitoring queries and defines service level objectives to measure system reliability
Assists senior engineers during incident response and contributes to root cause analyses
Conducts post-incident reviews with detailed reporting on impact, timeline, and follow-up actions
Participates in disaster recovery testing to validate system resilience
Implements automation solutions and deploys code in production environments
Documents SRE practices and contributes to internal knowledge resources
Supports the design of infrastructure layouts and deployment processes
Tests system availability, reliability, and recovery capabilities in non-production settings
Benchmarks performance data to inform production readiness assessments
Applies advanced DevOps expertise in cloud infrastructure, CI/CD, containerization, and security
Participates in on-call rotations to support critical incident resolution
Validates failover mechanisms across geographic regions for production systems
Automates system recovery using Infrastructure-as-Code and configuration management tools
Leads scenario modeling for SLO breaches and designs responsive workflows
Writes advanced scripts for automated incident response, including rollbacks and failovers
Analyzes operational toil through ticket trends and recommends process improvements
Executes independent projects to eliminate repetitive manual work
Applies deep observability knowledge to diagnose complex system issues
Builds reusable observability dashboards and configurations via code templates
Guides appropriate error budget and SLO definitions for services
Collaborates with cross-functional teams to migrate applications to standardized platforms
Provides technical guidance on implementing new platform features

Requirements

Proven experience across core SRE practices and principles
Understanding of monitoring and tracing in distributed systems with interdependencies
Ability to automate recovery processes to maintain service level agreements
Prior on-call experience supporting incident resolution
Track record of improving processes through practical contributions
Advanced hands-on skills in DevOps including monitoring, networking, cloud storage, containers, orchestration, CI/CD, and cloud security
Experience creating monitoring logic and setting performance baselines
History of supporting senior staff during major incidents
Active participation in post-mortem and RCA processes
Involvement in disaster recovery validation exercises
Direct experience deploying automation in production systems
Contributions to SRE documentation and knowledge repositories
Support in developing infrastructure diagrams and deployment workflows
Testing of system reliability and recoverability outside production
Documenting benchmark results for production readiness
On-call participation for major incident recovery
Testing of regional failover for systems and components
Automates recovery using Infrastructure-as-Code and configuration scripts
Producing comprehensive RCAs with executive summaries and risk assessments
Leading SLO breach scenario planning and response workflows

Tech Stack

Azure (including AKS), Terraform, GitHub, CI/CD pipelines, Java debugging, Helm charts, JFrog

Benefits

Comprehensive health, dental, and vision insurance
401(k) plan with company match
Generous paid time off and flexible work arrangements

Compensation

Competitive salary based on experience and qualifications

Additional Information

This role requires occasional on-call availability to support production systems
Candidates must be authorized to work in the United States without sponsorship

RELX is hiring a Senior Site Reliability Engineer I

Responsibilities

Requirements

Tech Stack

Benefits

Compensation

Additional Information

Similar Jobs

IT Operations Automation Engineer (100% Remote - Canada)

Platform Engineer, Infrastructure

Senior DevOps Engineer

Sr. Cloud Solutions Architect | DevOps (Remote)

Senior DevOps Engineer

IT Operations Automation Engineer (100% Remote - Ireland)

Related Articles

Network Configuration as Code: CI/CD for Automation | NVIDIA

CI/CD Testing Tools: 23 Best Options for 2026

Remote SRE Jobs: Vanguard’s Cloud Transformation