As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, scalability, and performance of our global gaming and betting platforms. You will bridge the gap between development and operations by applying engineering principles to operations challenges. This includes building and maintaining observability systems, enforcing SLOs, leading incident response, and driving continuous improvements in system resilience. You will work in a 24/7 environment with on-call responsibilities, ensuring minimal downtime and rapid recovery during incidents. Collaboration with development, operations, and service management teams is essential to foster a culture of reliability and operational excellence across the organization.

Responsibilities

Maintain near-zero downtime for observability systems that monitor platforms serving millions of users.
Design, implement, and manage monitoring, alerting, and observability infrastructure integrated with cloud services like Grafana, Splunk, and CloudWatch.
Perform capacity planning and performance tuning to handle traffic surges during major events.
Define and enforce Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for critical services in collaboration with Service Management.
Work with Service Management to lead post-incident reviews, identify systemic issues, and implement corrective actions.
Help develop and maintain runbooks and incident response protocols for efficient issue resolution.
Build and maintain real-time dashboards for system health and performance across all platforms.
Create customized visual analytics for technical and business stakeholders to support data-driven decisions.
Optimize ingestion of data from time-series databases, log systems, cloud APIs, and custom sources.
Refine alerting rules and notification workflows to minimize noise while ensuring timely escalation of critical issues.
Implement and manage Application Performance Monitoring (APM) tools within the observability ecosystem.
Collaborate with development teams to integrate custom telemetry and metrics for actionable insights.
Lead and maintain a chaos testing framework, define failure scenarios, and support teams in executing resilience tests.
Conduct disaster recovery drills in isolated environments to validate recovery procedures and system robustness.
Apply chaos engineering principles to uncover system weaknesses and improve platform resilience.
Partner with engineering teams to enhance application reliability and deployment practices.
Mentor junior engineers and help mature SRE practices across the organization.
Contribute to architecture reviews by providing reliability input for new system designs.
Document system architecture, troubleshooting procedures, and operational knowledge for team use.
Build strong, trust-based relationships with stakeholders while supporting enterprise technology strategy without direct authority.
Make impartial, objective decisions based on clear criteria and ensure fair treatment across teams.
Work collaboratively toward shared goals aligned with organizational strategy, stepping into leadership when needed.
Adapt communication and approach to accommodate diverse perspectives and achieve effective outcomes.
Think strategically to support agility, faster delivery, and improved customer experience across the business.

Requirements

Proven experience with observability tools such as Prometheus, Grafana, ELK stack, or similar in high-availability production environments.
Hands-on experience with cloud platforms like AWS, Azure, or Google Cloud Platform, including deep knowledge of services and architectural patterns.
Extensive background in implementing reliability engineering practices in 24/7/365 operational environments.
Experience operating systems in highly regulated, security-compliant settings.
Strong scripting and programming skills in Python, Go, Bash, TypeScript, or Terraform for automation and infrastructure as code.
Demonstrated experience with CI/CD tools such as Jenkins, GitLab CI, Azure DevOps, or GitHub Actions.
Familiarity

Tech Stack

Prometheus, Grafana, ELK Stack (Elasticsearch, Logstash, Kibana), Splunk, CloudWatch, AWS, Azure, Google Cloud Platform, Docker, Kubernetes, Terraform, Ansible, Jenkins, GitLab CI

Benefits

Competitive salary and performance-based bonuses
Comprehensive health, dental, and vision insurance
Flexible working hours and remote work options
Annual learning and development budget for courses, certifications, and conferences
On-site gym and wellness programs
Free daily meals and snacks in the office
Generous paid time off and parental leave policies
Employee stock purchase plan and retirement savings matching

Work Arrangement

Hybrid (office and remote options available)

Team

You will join a global, cross-functional team of Site Reliability Engineers, DevOps specialists, and platform engineers who are passionate about system reliability, automation, and operational excellence. The team operates in an agile environment with a strong focus on collaboration, continuous improvement, and knowledge sharing across regions and time zones.

Additional Information

This role requires on-call availability on a rotating basis.
Candidates must be comfortable working in a fast-paced, high-pressure environment during critical incidents.
We value a blameless post-mortem culture focused on learning and systemic improvement.
The position supports 24/7 platform operations, requiring occasional weekend or evening work during incidents.
We are committed to diversity, inclusion, and creating an environment where all engineers can thrive.

Betfair Romania Development / Flutter Entertainment is hiring a Site Reliability Engineer

Responsibilities

Requirements

Tech Stack

Benefits

Work Arrangement

Team

Additional Information

Similar Jobs

Senior Platform Engineer - Observability

Senior DevOps Engineer

Senior Infrastructure Engineer

Senior Site Reliability Engineer - Ireland

Platform Engineer - Product Reliability (Mid Level)

Contact Center Production Control Engineer (Amazon Connect preferable)

Related Articles

AI Boom Job Impact: Tech Decline vs. Service Growth in SF

remote full stack jobs 2026: Top Skills to Land a Role

Tech Layoffs AI Efficiency: Block Cuts 40% Workforce