The Site Reliability Engineer plays a critical role in ensuring the stability, performance, and scalability of our global infrastructure. This position bridges the gap between development and operations by applying engineering principles to operations challenges. The role focuses on building resilient systems, automating operational workflows, and driving continuous improvement through data-driven insights. You will work closely with cross-functional teams to establish SRE best practices, define reliability metrics, and implement proactive monitoring and alerting strategies. With a strong emphasis on incident management and postmortem analysis, you will help reduce system downtime and improve overall service quality. This role is essential in supporting high-availability services and enabling the organization to deliver reliable, scalable solutions to users worldwide.
Responsibilities
- Design and manage infrastructure that is both scalable and highly reliable to support global operations.
- Create automated solutions to reduce repetitive manual tasks and operational toil.
- Work with engineering teams to define service-level indicators, objectives, and error budgets.
- Use monitoring tools such as Prometheus and Grafana to track system performance and uptime.
- Lead incident postmortems and conduct root cause analysis to improve system resilience.
- Support AI and machine learning workloads by providing robust and efficient infrastructure.
- Promote and implement SRE best practices across development and operations teams.
- Take part in on-call rotations and implement proactive measures to prevent system outages.
Requirements
- Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field.
- Minimum of nine years of professional experience in site reliability, DevOps, or backend engineering.
- Strong programming skills with proficiency in Python and Ansible.
- Hands-on experience operating in at least one major cloud environment such as AWS, GCP, or Azure.
- Demonstrated expertise with containerization and orchestration tools like Kubernetes and Docker.
- Solid understanding of distributed systems and principles of reliability engineering.
- Experience with monitoring, logging, and alerting platforms to ensure system health.
Nice to Have
- Familiarity with AI or machine learning infrastructure and workflows is a strong advantage.
Tech Stack
Python, Ansible, AWS, GCP, Azure, Kubernetes, Docker, Prometheus, Grafana
Team
Part of the Technology Operations Platform team focused on enhancing user experience and enabling SRE teams; collaborates closely with the Observability team while maintaining distinct responsibilities.
- Committed to diversity and inclusion in the workplace.
- Values varied perspectives and ways of thinking.
- Provides equal opportunities regardless of race, colour, gender, sex, age, religion, creed, national origin, ancestry, citizenship, marital status, sexual orientation, physical or mental disability, medical condition, pregnancy or parental leave, veteran status, gender identity, genetic information, or other legally protected characteristics.
Additional Information
- Applicants with criminal histories will be evaluated in accordance with applicable legal requirements.
- Accommodations are available during the application and hiring process; individuals may contact accommodationrequests@maersk.com for support.


