Flutterbe is seeking a Site Reliability Engineer to ensure the reliability, availability, and performance of our critical global gaming and betting platforms. This role combines engineering and operational excellence to maintain 24/7/365 service availability for millions of customers. You will be responsible for enterprise-grade observability, disaster recovery, and business continuity capabilities across our AWS Cloud tenancy.
What You'll Do
- Maintain 99.9%+ uptime for the Observability platform monitoring systems serving millions of concurrent users.
- Implement and maintain comprehensive monitoring, alerting, and observability solutions, owning the tooling infrastructure like Grafana, Splunk, and CloudWatch.
- Conduct capacity planning and performance optimization for peak loads during major sporting events.
- Establish and maintain Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for all critical services.
- Partner with Service Management to drive continuous improvement through blameless post-mortems and implement resilience improvements.
- Assist in developing and maintaining comprehensive runbooks and incident response procedures.
- Deploy and maintain comprehensive monitoring dashboards and visualization solutions for real-time system visibility.
- Create custom dashboards and visual analytics for business metrics, technical KPIs, and operational insights.
- Configure and optimize data ingestion from diverse sources including time-series databases, log aggregation systems, and cloud monitoring services.
- Implement and refine alerting rules and notification workflows to reduce alert fatigue.
- Establish and maintain APM capabilities, integrating instrumentation and telemetry collection.
- Collaborate with development teams to define, implement, and instrument custom business and technical metrics.
- Own and maintain the chaos testing framework and tooling, define standard failure scenarios, and support product teams in executing tests.
- Conduct disaster recovery fire drills in collaboration with product teams, coordinating complex testing scenarios in isolated environments.
- Apply chaos engineering principles to proactively identify system weaknesses and vulnerabilities.
- Partner with development teams to improve application reliability and deployment practices.
- Mentor junior team members and contribute to the development of SRE practices across Flutter.
- Participate in architecture reviews and provide reliability expertise for new system designs.
- Document procedures, troubleshooting guides, and system architecture for knowledge sharing.
What We're Looking For
- Extensive experience with monitoring and observability tools like Prometheus, Grafana, and the ELK stack at enterprise scale.
- Demonstrated ability with cloud platforms including AWS, Azure, or Google Cloud Platform, with deep understanding of cloud services and architecture patterns.
- Extensive experience implementing and maintaining reliability engineering practices in 24/7/365 production environments.
- Experience delivering and operating systems in stringent security-compliant and highly regulated environments.
- Strong scripting and programming abilities in Python, Go, Bash, TypeScript, or Terraform for automation and infrastructure as code.
- Proven experience with CI/CD pipelines and tools like Jenkins, GitLab CI, Azure DevOps, GitHub Actions, or similar platforms.
- Working knowledge of database technologies including SQL databases (PostgreSQL, MySQL) and NoSQL solutions.
- Experience producing comprehensive, clear, and actionable technical documentation for operational procedures and runbooks.
- Experience working in an agile environment with cross-functional teams.
- Proficiency with containerization technologies including Docker and Kubernetes.
Nice to Have
- Bonus points for previous software engineering experience, AWS certifications, or experience in highly regulated industries such as gaming, financial services, or healthcare.
Technical Stack
- Monitoring & Observability: Prometheus, Grafana, ELK stack
- Cloud Platforms: AWS, Azure, Google Cloud Platform
- Languages & Automation: Python, Go, Bash, TypeScript, Terraform
- CI/CD & DevOps: Jenkins, GitLab CI, Azure DevOps, GitHub Actions
- Databases: PostgreSQL, MySQL
- Containerization: Docker, Kubernetes
Team & Environment
You will be part of the Flutter Functions division, collaborating closely with development teams, infrastructure specialists, and business stakeholders.
Benefits & Compensation
- Hybrid & remote working options
- €1,000 per year for self-development
- Company share scheme
- 25 days of annual leave per year
- 20 days per year to work abroad
- 5 personal days/year
- Flexible benefits: travel, sports, hobbies
- Extended health, dental and travel insurances
- Customized well-being programmes
- Career growth sessions
- Thousands of online courses through Udemy
- A variety of engaging office events
Work Mode
This is a hybrid role based in Cluj-Napoca, Romania.
Flutterbe is an equal opportunity employer. Our culture is built on winning together, raising the bar, having each other's backs, owning our work, and making a positive impact.




