SITA Switzerland Sarl is looking for a Lead Site Reliability Engineer to provide proactive support for our products, ensuring high performance and continuous improvement. In this role, you will focus on identifying root causes of operational incidents, implementing solutions to improve stability, and managing operational automation and integration.
What You'll Do
- Define, build, and maintain support systems to ensure high availability and performance.
- Handle complex cases for the PSO and perform incident response and root cause analysis (RCA) for critical system failures.
- Implement automation for system provisioning, self-healing, auto-recovery, deployment, and monitoring.
- Monitor system performance and establish Service-Level Indicators (SLIs) and Service-Level Objectives (SLOs).
- Collaborate with Development and Operations to integrate reliability best practices, including zero-downtime architecture.
- Proactively identify and remediate performance issues.
- Work closely with Product T&E, ICE, and Service Architects for new product productization as SGS technical expert.
- Coordinate with internal and external stakeholders to improve service performance and ensure high availability.
- Ensure Operations readiness to support new products and be accountable within SGS for in-scope product availability and performance.
- Conduct thorough problem investigations and root cause analyses to diagnose recurring incidents and service disruptions.
- Coordinate with Incident Management teams and collaborate with PSOs and Engineering/Product teams to implement permanent solutions.
- Monitor effectiveness of problem resolution activities and provide regular reporting to ensure continuous improvement.
- Define, build, and maintain an event catalog specifying active events, thresholds, and remediation actions; optimize it for efficiency.
- Develop event response protocols, provide training, and ensure efficient incident handling.
- Collaborate with Customer Success Managers to implement initiatives that enhance customer satisfaction and retention.
- Prepare reports, documentation, and communication materials covering customer metrics, updates, and product changes.
- Identify and implement improvements in internal processes and workflows and contribute to knowledge management resources.
- Implement data governance policies defined by the Data Owner and ensure adherence to standards.
- Monitor data quality, consistency, and compliance on an ongoing basis.
- Act as a Subject Matter Expert (SME) for data within the assigned area, providing guidance and answering queries.
What We're Looking For
- Bachelor’s degree in Computer Science, Information Technology, Engineering, or a related field.
- 10+ years of experience in IT operations, service management, or infrastructure management, including roles such as Site Reliability Engineer, Problem Manager, or DevOps Manager.
- Proven experience managing high-availability systems and ensuring operational reliability.
- Extensive experience in root cause analysis (RCA), incident management, and developing permanent solutions for recurring service disruptions.
- Hands-on experience with CI/CD pipelines, automation, system performance monitoring, and infrastructure as code (IaC).
- Strong background in collaborating with cross-functional teams (Development, Operations, Engineering, etc.) to improve operational processes and service delivery.
- Experience managing deployments, conducting risk assessments, and optimizing event and problem management processes.
- Familiarity with cloud technologies, containerization, and scalable architectures, including zero-downtime deployment strategies.
- Strong AKS & On prem K8s skills and experience.
- Scripting experience with Ansible & Bash, Python.
- Automation experience.
- CI/CD pipeline experience.
- Terraform exposure.
- Azure or AWS skill.
- Basic DB skills.
- Strong problem-solving skills & quick learner.
- SRE mindset.
Technical Stack
- AKS, On prem K8s
- Ansible, Bash, Python
- Terraform
- Azure, AWS
Benefits & Compensation
- Flex Week: Work from home up to 2 days/week (depending on your team's needs).
- Flex Day: Make your workday suit your life and plans.
- Flex-Location: Take up to 30 days a year to work from any location in the world.
- Employee Wellbeing: Employee Assistance Program (EAP) for you and your dependents 24/7, 365 days/year. Access to Champion Health platform.
- Professional Development: Access to LinkedIn Learning, Microsoft's Enterprise Skills Initiative, Airport Council International, Pluralsight, Harvard Business Publishing, Stanford and other learning platforms.
- Competitive Benefits: Competitive benefits aligned with local market and employment status.
Work Mode
This role follows a hybrid work model.
SITA is an Equal Opportunity Employer. We value a diverse workforce. In support of our Employment Equity Program, we encourage women, aboriginal people, members of visible minorities, and/or persons with disabilities to apply and self-identify in the application process.

