Responsibilities
- Investigate system incidents, drive Root Cause Analysis (RCAs), and execute long-term remedial fixes. Proactively reduce the number of incidents caused by system changes.
- Define and enforce Service Level Agreements (SLAs), Service Level Objectives (SLOs), and success metrics for new initiatives. Build and maintain comprehensive dashboards to achieve observability excellence.
- Identify and help resolve performance bottlenecks. Optimize infrastructure and code to maintain fast service, and conduct capacity planning to forecast future hardware or cloud resource requirements.
- Guarantee the Platform components remain highly reachable and functional for users. Oversee deployments to ensure new code does not disrupt the existing system.
Requirements
- Deep experience building dashboards and tracking SLAs/SLOs using tools like Prometheus, Grafana, Coralogix, Splunk, or Loki.
- Proficiency in scripting and coding to automate manual tasks (eliminate 'toil') and build reliability tools.
- Experience provisioning and managing infrastructure using Terraform or Ansible, along with a solid understanding of cloud platforms (AWS, GCP, or Azure).
- Hands-on experience scaling and managing distributed systems using Kubernetes (K8s) and Docker.
- Familiarity with deployment pipelines (GitLab CI, GitHub Actions, Team City, Octopus) to ensure safe, automated rollouts that don't cause incidents.
- Strong analytical skills for Root Cause Analysis (RCA), a calm approach to incident response, and the ability to lead blameless post-mortems.
Nice to Have
- Strong skills in .NET, Python, Powershell or Bash
- Experience with Microsoft SQL databases, PostgreSQL, and Couchbase
- AWS Cloud infrastructure, CDNs, and other various systems running in multiple data centres and environments
- Cloud Application Load Balancer, preferably with experience on AWS ALB
- Cloud DNS support such as AWS Route 53, GCP Cloud DNS, or Azure DNS