Platform.sh is seeking a Site Reliability Engineer to join our Upsun team. As a key addition, you will help transition from traditional Cloud Operations to an automation-driven SRE model. Your focus will be on improving infrastructure, automating operational tasks, and streamlining processes to enhance system reliability, scalability, and efficiency.
What You'll Do
- Refine monitoring and observability using tools like Prometheus, Grafana, and ELK Stack to ensure system visibility aligns with business objectives.
- Automate deployments and workflows by transitioning manual processes to automated solutions with IaC tools like Terraform and Ansible.
- Optimize CI/CD pipelines to improve architecture for fast, reliable, and scalable releases.
- Manage and scale cloud-based systems on platforms like AWS, GCP, and Azure while minimizing technical debt.
- Support incident response and lead post-mortem analysis to ensure continuous improvement and knowledge sharing.
- Collaborate with cross-functional engineering and product teams to integrate reliability practices into the development lifecycle.
- Drive technical innovation by introducing new tools, technologies, and practices that improve system reliability, performance, and scalability.
What We're Looking For
- A solid understanding of DevOps, Cloud Operations, or SRE principles, with a focus on reliability and scalability.
- Advanced hands-on experience with Linux internals, including performance tuning, kernel configurations, and troubleshooting.
- Proficiency in programming languages such as Go (preferred) or Python for building tools and automating processes.
- Strong skills in scripting languages like Python, Bash, or Go to automate workflows and manage infrastructure.
- Extensive experience with cloud platforms like AWS, GCP, and Azure, along with expertise in monitoring/logging frameworks and CI/CD pipelines.
- Strong problem-solving skills, system design experience, and the ability to collaborate effectively across teams.
Nice to Have
- Hands-on experience with Docker, Kubernetes, and other containerization technologies for building and deploying scalable applications.
Technical Stack
- Monitoring/Observability: Prometheus, Grafana, ELK Stack
- Infrastructure as Code: Terraform, Ansible
- Cloud Platforms: AWS, GCP, Azure
- Languages: Go, Python, Bash
- Containerization: Docker, Kubernetes
Team & Environment
You will report to the Director, Site Reliability Engineering.
Benefits & Compensation
- Flexible PTO
- Comprehensive healthcare coverage (UK, France, Spain)
- Company stock options
- Professional development budget
- Office equipment budget
- Wellness budget
- Annual team gatherings
- Internet reimbursement
- Inclusive parental leave
- Remote work travel program
Work Mode
This is a global remote position open to candidates in France, Germany, Spain, and the United Kingdom.
Platform.sh is an equal opportunity employer.



