Investigate system incidents, drive Root Cause Analysis (RCAs), and execute long-term remedial fixes. Proactively reduce the number of incidents caused by system changes.
Define and enforce Service Level Agreements (SLAs), Service Level Objectives (SLOs), and success metrics for new initiatives. Build and maintain comprehensive dashboards to achieve observability excellence.
Identify and help resolve performance bottlenecks. Optimize infrastructure and code to maintain fast service, and conduct capacity planning to forecast future hardware or cloud resource requirements.
Guarantee the Platform components remain highly reachable and functional for users. Oversee deployments to ensure new code does not disrupt the existing system.

Deep experience building dashboards and tracking SLAs/SLOs using tools like Prometheus, Grafana, Coralogix, Splunk, or Loki.
Proficiency in scripting and coding to automate manual tasks (eliminate 'toil') and build reliability tools.
Experience provisioning and managing infrastructure using Terraform or Ansible, along with a solid understanding of cloud platforms (AWS, GCP, or Azure).
Hands-on experience scaling and managing distributed systems using Kubernetes (K8s) and Docker.
Familiarity with deployment pipelines (GitLab CI, GitHub Actions, Team City, Octopus) to ensure safe, automated rollouts that don't cause incidents.
Strong analytical skills for Root Cause Analysis (RCA), a calm approach to incident response, and the ability to lead blameless post-mortems.

Strong skills in .NET, Python, Powershell or Bash
Experience with Microsoft SQL databases, PostgreSQL, and Couchbase
AWS Cloud infrastructure, CDNs, and other various systems running in multiple data centres and environments
Cloud Application Load Balancer, preferably with experience on AWS ALB
Cloud DNS support such as AWS Route 53, GCP Cloud DNS, or Azure DNS

Betsson Group is hiring a Site Reliability Engineer