What You'll Do
Design and deploy reliability-focused solutions that strengthen system resilience and reduce operational burden. Champion automation-first strategies to streamline incident response, capacity planning, and service monitoring across distributed platforms.
Work closely with engineering and product teams to refine Service Level Objectives and Indicators, ensuring realistic error budgets and measurable reliability standards. Enhance CI/CD workflows to support robust DevOps practices and faster, safer deployments.
Lead root cause analyses following incidents, promoting a blameless culture where insights drive systemic improvements. Participate in on-call rotations to maintain continuous service availability and contribute to disaster recovery planning.
Mentor engineers across teams, sharing SRE principles and helping embed reliability thinking into daily workflows. Stay informed on emerging technologies and practices, evaluating tools that advance observability, scalability, and system health.
Requirements
- Demonstrate a strong automation mindset, using scripting to eliminate repetitive tasks and reduce toil
- Possess deep knowledge of SLOs, SLIs, error budgets, and architectures built for high availability
- Have experience managing production incidents and leading post-incident reviews with a focus on continuous improvement
- Apply best practices in logging, monitoring, and alerting to ensure full system observability
- Show practical understanding of data structures and modern data processing engines
- Communicate effectively across technical and non-technical stakeholders to advocate for reliability initiatives
- Display a commitment to coaching others and fostering a culture of operational excellence
Preferred Qualifications
- Five or more years in software engineering, site reliability, or cloud infrastructure roles
- Hands-on experience with DevOps platforms such as GitHub, Azure DevOps, GitLab, or Jenkins
- Proficiency in building cloud-native, service-oriented systems at scale
- Strong programming skills in Python, Go, Java, C#, or .NET
- Familiarity with observability tools like Prometheus, Grafana, or OpenTelemetry
- Experience improving CI/CD pipelines and automating deployment workflows
- Background in global SaaS environments requiring 24/7 uptime
- Knowledge of redundancy, failover, and disaster recovery strategies
- Ability to collaborate across technical and business functions
- Experience with Agile methodologies and delivering complex technical projects
- Skill in problem-solving, analysis, and clear communication
- Exposure to Chaos Engineering or AI Ops concepts is a plus
Benefits
- Comprehensive health, dental, and vision insurance
- Parental leave for primary and secondary caregivers
- Flexible work arrangements
- Two company-wide breaks each year, each lasting a week
- Additional time off beyond standard vacation
- Long-term incentive program
- Annual training investment for professional growth

