Join a forward-thinking engineering team dedicated to building resilient, observable, and highly reliable cloud systems. In this role, you will drive site reliability engineering practices across infrastructure and applications, ensuring systems are robust, scalable, and well-monitored.

Key Responsibilities

Apply and promote the principles of the Well Architected Framework, with a focus on system resiliency
Design and execute controlled chaos engineering tests to identify weaknesses and improve fault tolerance
Support cloud migration initiatives by evaluating workloads and minimizing operational disruption
Oversee migration progress to ensure smooth, reliable transitions to cloud environments
Improve observability through the design and implementation of monitoring, logging, and alerting solutions
Collaborate with IT teams to align observability strategies with business and technical requirements
Review cloud deployments for adherence to internal standards and reliability benchmarks
Identify and resolve gaps in system visibility and monitoring coverage
Stay current with emerging technologies and lead knowledge-sharing sessions across teams
Contribute to capacity planning, performance analysis, and system optimization efforts
Guide peers through technical mentorship and collaborative problem-solving
Evaluate and enhance the organization’s overall resilience posture
Participate in a rotating on-call schedule to support critical system reliability

Required Qualifications

Bachelor’s or Master’s degree in Computer Science or a related technical field
Minimum of 5 years of experience with cloud platforms, including at least 3 years focused on AWS
At least 3 years in a Site Reliability or similar infrastructure-focused role
Proven experience with monitoring, application performance tools, logging systems, and alerting platforms
Familiarity with incident, problem, and change management workflows
Deep understanding of SRE methodologies, including SLIs, SLOs, and error budgets
Strong diagnostic abilities and experience mentoring technical colleagues
Hands-on expertise with Kubernetes and containerized environments
Advanced skills in CI/CD pipelines and Infrastructure as Code tools such as Terraform (HCL) and AWS CloudFormation
Proficient with Git and version control best practices
Excellent organizational habits and documentation practices
Effective time management and research capabilities
Strong command of Linux systems, networking fundamentals, and scripting languages

Preferred Skills

Experience with message streaming platforms, particularly Kafka (MSK)
Working knowledge of relational databases including Postgres and MySQL
Proficiency in scripting or programming with Python or Go

Technology Environment

Our stack centers on AWS, Kubernetes, Terraform (HCL), AWS CloudFormation, Git, Linux, networking, scripting, APM tools, logging and notification systems, CI/CD pipelines, IaC, Kafka (MSK), Postgres, MySQL, Python, and Go.

What We Offer

Competitive compensation and benefits package
A stimulating technical environment that encourages innovation
Ongoing learning opportunities and access to international training programs

Work Environment

This role supports a culture centered on cloud resilience, continuous learning, peer mentorship, adherence to best practices, and driving organizational change toward greater system reliability.

XM Careers is hiring a Site Reliability Engineers

Key Responsibilities

Required Qualifications

Preferred Skills

Technology Environment

What We Offer

Work Environment

Similar Jobs

Senior Site Reliability Engineer - Production Engineering (Remote - Ireland)

Senior Platform Engineer / Senior Devops Engineer

Senior Cloud Database Administrator (IVR & Analytics Platform)

Containerization Cloud Consulting

Software Engineer - DevOps

Sr. Devops Engineer

Related Articles

Platform Engineering: Kubernetes for All

Developer Experience Platform: Lessons from Europe

Kubernetes Remote Jobs: AI & Cloud-Native Careers in 2026