Cluj-Napoca, Cluj, Romania Hybrid Employment

Betfair Romania Development / Flutter Entertainment is hiring a Site Reliability Engineer

About the Role

Flutterbe is seeking a Site Reliability Engineer to ensure the reliability, availability, and performance of our critical global gaming and betting platforms. This role combines engineering and operational excellence to maintain 24/7/365 service availability for millions of customers. You will be responsible for enterprise-grade observability, disaster recovery, and business continuity capabilities across our AWS Cloud tenancy.

What You'll Do

  • Maintain 99.9%+ uptime for the Observability platform monitoring systems serving millions of concurrent users.
  • Implement and maintain comprehensive monitoring, alerting, and observability solutions, owning the tooling infrastructure like Grafana, Splunk, and CloudWatch.
  • Conduct capacity planning and performance optimization for peak loads during major sporting events.
  • Establish and maintain Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for all critical services.
  • Partner with Service Management to drive continuous improvement through blameless post-mortems and implement resilience improvements.
  • Assist in developing and maintaining comprehensive runbooks and incident response procedures.
  • Deploy and maintain comprehensive monitoring dashboards and visualization solutions for real-time system visibility.
  • Create custom dashboards and visual analytics for business metrics, technical KPIs, and operational insights.
  • Configure and optimize data ingestion from diverse sources including time-series databases, log aggregation systems, and cloud monitoring services.
  • Implement and refine alerting rules and notification workflows to reduce alert fatigue.
  • Establish and maintain APM capabilities, integrating instrumentation and telemetry collection.
  • Collaborate with development teams to define, implement, and instrument custom business and technical metrics.
  • Own and maintain the chaos testing framework and tooling, define standard failure scenarios, and support product teams in executing tests.
  • Conduct disaster recovery fire drills in collaboration with product teams, coordinating complex testing scenarios in isolated environments.
  • Apply chaos engineering principles to proactively identify system weaknesses and vulnerabilities.
  • Partner with development teams to improve application reliability and deployment practices.
  • Mentor junior team members and contribute to the development of SRE practices across Flutter.
  • Participate in architecture reviews and provide reliability expertise for new system designs.
  • Document procedures, troubleshooting guides, and system architecture for knowledge sharing.

What We're Looking For

  • Extensive experience with monitoring and observability tools like Prometheus, Grafana, and the ELK stack at enterprise scale.
  • Demonstrated ability with cloud platforms including AWS, Azure, or Google Cloud Platform, with deep understanding of cloud services and architecture patterns.
  • Extensive experience implementing and maintaining reliability engineering practices in 24/7/365 production environments.
  • Experience delivering and operating systems in stringent security-compliant and highly regulated environments.
  • Strong scripting and programming abilities in Python, Go, Bash, TypeScript, or Terraform for automation and infrastructure as code.
  • Proven experience with CI/CD pipelines and tools like Jenkins, GitLab CI, Azure DevOps, GitHub Actions, or similar platforms.
  • Working knowledge of database technologies including SQL databases (PostgreSQL, MySQL) and NoSQL solutions.
  • Experience producing comprehensive, clear, and actionable technical documentation for operational procedures and runbooks.
  • Experience working in an agile environment with cross-functional teams.
  • Proficiency with containerization technologies including Docker and Kubernetes.

Nice to Have

  • Bonus points for previous software engineering experience, AWS certifications, or experience in highly regulated industries such as gaming, financial services, or healthcare.

Technical Stack

  • Monitoring & Observability: Prometheus, Grafana, ELK stack
  • Cloud Platforms: AWS, Azure, Google Cloud Platform
  • Languages & Automation: Python, Go, Bash, TypeScript, Terraform
  • CI/CD & DevOps: Jenkins, GitLab CI, Azure DevOps, GitHub Actions
  • Databases: PostgreSQL, MySQL
  • Containerization: Docker, Kubernetes

Team & Environment

You will be part of the Flutter Functions division, collaborating closely with development teams, infrastructure specialists, and business stakeholders.

Benefits & Compensation

  • Hybrid & remote working options
  • €1,000 per year for self-development
  • Company share scheme
  • 25 days of annual leave per year
  • 20 days per year to work abroad
  • 5 personal days/year
  • Flexible benefits: travel, sports, hobbies
  • Extended health, dental and travel insurances
  • Customized well-being programmes
  • Career growth sessions
  • Thousands of online courses through Udemy
  • A variety of engaging office events

Work Mode

This is a hybrid role based in Cluj-Napoca, Romania.

Flutterbe is an equal opportunity employer. Our culture is built on winning together, raising the bar, having each other's backs, owning our work, and making a positive impact.

Required Skills
PrometheusGrafanaELK stackAWSAzureGoogle Cloud PlatformPythonGoBashTypeScriptTerraformmonitoringobservabilitycloud architectureautomationinfrastructure as code
Got hired remotely?

Get paid like a professional

Remote clients expect company invoices, not personal PayPal requests. Glopay forms an EU partnership that makes you look legitimate while you stay independent.

Professional invoices with EU company details
Compliance handled automatically
Withdraw to any bank account
Income reports for easy tax filing
Create free account
Free signup • 5 min setup
About company
Betfair Romania Development / Flutter Entertainment

Betfair Romania Development is the largest technology hub of Flutter Entertainment, powering the world’s leading sports betting and iGaming brands such as FanDuel, PokerStars, SportsBet, Betfair, Paddy Power, and Sky Betting & Gaming, delivering experiences to over 18 million customers worldwide.

Visit website
Job Details
Department Information Technology
Category infrastructure
Posted 14 days ago