Build and manage a global Site Reliability Engineering team focused on monitoring, maintaining, and enhancing the reliability of cloud infrastructure. Lead incident response, promote automation, and integrate operational best practices across development teams to ensure high system availability and performance.
Responsibilities
- Lead the SRE team by defining goals and guiding efforts to achieve strong system reliability while managing cost and performance commitments.
- Work closely with platform and product engineering teams to integrate reliability and operational standards into the development lifecycle.
- Establish and implement SRE frameworks, including service level objectives, service level indicators, and error budgeting.
- Promote automation across operations to minimize manual tasks, improve system efficiency, and support scalable growth.
- Manage incident response processes, conduct post-mortem reviews, and lead root cause analysis to prevent recurring issues.
- Lead capacity planning and scalability initiatives to support business growth and optimize resource usage.
- Oversee disaster recovery planning and testing to ensure uninterrupted service for customer webstores.
- Foster a culture of continuous learning by mentoring team members and encouraging innovation.
- Stay current with advancements in SRE practices and advocate for the adoption of relevant technologies and methodologies.
Requirements
- Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field.
- Minimum of 5 years of experience in Site Reliability Engineering, including at least 2 years in a leadership capacity.
- Extensive hands-on experience with Microsoft Azure, including deployment and management of cloud-native systems, with required knowledge of Kubernetes.
- Solid understanding of network protocols, load balancing, and high availability setups.
- Experience applying software engineering to SRE challenges, with proficiency in languages such as PowerShell, C#, Python, Go, or Java.
- Proven experience with automation tools and infrastructure-as-code platforms like Terraform and Ansible.
- Skilled in using monitoring and logging tools such as Prometheus, Grafana, and the ELK Stack to build comprehensive observability solutions.
- Demonstrated ability to solve complex technical problems under pressure.
- Proven leadership experience, including mentoring and growing high-performing engineering teams.
- Strong communication and collaboration skills with a history of effective cross-team coordination.
Nice to Have
- Familiarity with Dynatrace is advantageous.
Tech Stack
Microsoft Azure, Kubernetes, Terraform, Ansible, Prometheus, Grafana, ELK Stack, Dynatrace, PowerShell, C#, Python, Go, Java
Benefits
- Opportunity to drive impact within a rapidly growing SaaS scale-up.
- Eligibility for up to 5 weeks of 'work from anywhere' annually.
- Customized global onboarding program, highly rated by new hires.
- Hybrid work model with 3 days in office and 2 days remote per week.
- Weekly company-sponsored lunch.
Work Arrangement
hybrid — 3 days from the office, 2 days from home; up to 5 weeks “work from anywhere” per year
Team
global SRE team managing and monitoring all systems, environments, and infrastructure
- We deliver lasting success by balancing immediate results with long-term value.
- We empower customers by transforming B2B commerce and enabling their leadership.
- We embrace challenges and continuously raise the bar for ourselves and the industry.
- We act boldly, supported by trust and mutual accountability within the team.
Additional Information
- Even if you don’t meet all listed qualifications, we encourage applications from candidates who align with our vision and are eager to grow with us.
- This role includes a hybrid work model with remote flexibility and a 'work from anywhere' benefit.


