Cluj-napoca, Romania Hybrid

Betfair Romania Development / Flutter Entertainment is hiring a Site Reliability Engineer

As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, scalability, and performance of our global gaming and betting platforms. You will bridge the gap between development and operations by applying engineering principles to operations challenges. This includes building and maintaining observability systems, enforcing SLOs, leading incident response, and driving continuous improvements in system resilience. You will work in a 24/7 environment with on-call responsibilities, ensuring minimal downtime and rapid recovery during incidents. Collaboration with development, operations, and service management teams is essential to foster a culture of reliability and operational excellence across the organization.

Responsibilities

  • Maintain near-zero downtime for observability systems that monitor platforms serving millions of users.
  • Design, implement, and manage monitoring, alerting, and observability infrastructure integrated with cloud services like Grafana, Splunk, and CloudWatch.
  • Perform capacity planning and performance tuning to handle traffic surges during major events.
  • Define and enforce Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for critical services in collaboration with Service Management.
  • Work with Service Management to lead post-incident reviews, identify systemic issues, and implement corrective actions.
  • Help develop and maintain runbooks and incident response protocols for efficient issue resolution.
  • Build and maintain real-time dashboards for system health and performance across all platforms.
  • Create customized visual analytics for technical and business stakeholders to support data-driven decisions.
  • Optimize ingestion of data from time-series databases, log systems, cloud APIs, and custom sources.
  • Refine alerting rules and notification workflows to minimize noise while ensuring timely escalation of critical issues.
  • Implement and manage Application Performance Monitoring (APM) tools within the observability ecosystem.
  • Collaborate with development teams to integrate custom telemetry and metrics for actionable insights.
  • Lead and maintain a chaos testing framework, define failure scenarios, and support teams in executing resilience tests.
  • Conduct disaster recovery drills in isolated environments to validate recovery procedures and system robustness.
  • Apply chaos engineering principles to uncover system weaknesses and improve platform resilience.
  • Partner with engineering teams to enhance application reliability and deployment practices.
  • Mentor junior engineers and help mature SRE practices across the organization.
  • Contribute to architecture reviews by providing reliability input for new system designs.
  • Document system architecture, troubleshooting procedures, and operational knowledge for team use.
  • Build strong, trust-based relationships with stakeholders while supporting enterprise technology strategy without direct authority.
  • Make impartial, objective decisions based on clear criteria and ensure fair treatment across teams.
  • Work collaboratively toward shared goals aligned with organizational strategy, stepping into leadership when needed.
  • Adapt communication and approach to accommodate diverse perspectives and achieve effective outcomes.
  • Think strategically to support agility, faster delivery, and improved customer experience across the business.

Requirements

  • Proven experience with observability tools such as Prometheus, Grafana, ELK stack, or similar in high-availability production environments.
  • Hands-on experience with cloud platforms like AWS, Azure, or Google Cloud Platform, including deep knowledge of services and architectural patterns.
  • Extensive background in implementing reliability engineering practices in 24/7/365 operational environments.
  • Experience operating systems in highly regulated, security-compliant settings.
  • Strong scripting and programming skills in Python, Go, Bash, TypeScript, or Terraform for automation and infrastructure as code.
  • Demonstrated experience with CI/CD tools such as Jenkins, GitLab CI, Azure DevOps, or GitHub Actions.
  • Familiarity

Tech Stack

Prometheus, Grafana, ELK Stack (Elasticsearch, Logstash, Kibana), Splunk, CloudWatch, AWS, Azure, Google Cloud Platform, Docker, Kubernetes, Terraform, Ansible, Jenkins, GitLab CI

Benefits

  • Competitive salary and performance-based bonuses
  • Comprehensive health, dental, and vision insurance
  • Flexible working hours and remote work options
  • Annual learning and development budget for courses, certifications, and conferences
  • On-site gym and wellness programs
  • Free daily meals and snacks in the office
  • Generous paid time off and parental leave policies
  • Employee stock purchase plan and retirement savings matching

Work Arrangement

Hybrid (office and remote options available)

Team

You will join a global, cross-functional team of Site Reliability Engineers, DevOps specialists, and platform engineers who are passionate about system reliability, automation, and operational excellence. The team operates in an agile environment with a strong focus on collaboration, continuous improvement, and knowledge sharing across regions and time zones.

Additional Information

  • This role requires on-call availability on a rotating basis.
  • Candidates must be comfortable working in a fast-paced, high-pressure environment during critical incidents.
  • We value a blameless post-mortem culture focused on learning and systemic improvement.
  • The position supports 24/7 platform operations, requiring occasional weekend or evening work during incidents.
  • We are committed to diversity, inclusion, and creating an environment where all engineers can thrive.
Required Skills
PrometheusGrafanaELK StackCloudWatchAWSMicrosoft AzureGCPPythonGoBashTypeScriptTerraformJenkinsGitLab CI
About company
Betfair Romania Development / Flutter Entertainment
Betfair Romania Development is the largest technology hub of Flutter Entertainment, powering the world’s leading sports betting and iGaming brands such as FanDuel, PokerStars, SportsBet, Betfair, Paddy Power, and Sky Betting & Gaming, delivering experiences to over 18 million customers worldwide.
All jobs at Betfair Romania Development / Flutter Entertainment Visit website
Job Details
Department Information Technology
Category infrastructure
Posted 3 months ago