What You'll Do
Design and manage production systems with a focus on scalability, reliability, and security. You'll ensure platforms operate efficiently by implementing observability practices and proactive monitoring solutions.
- Develop and maintain automated deployment pipelines to safely roll out changes in a hybrid cloud environment.
- Reduce manual effort by building tools and scripts that streamline operations and improve developer workflows.
- Respond to incidents, lead triage efforts, and create detailed postmortems to prevent recurring issues.
- Develop and update runbooks for incident response and coordinate maintenance windows with minimal disruption.
- Collaborate with engineering teams to identify infrastructure bottlenecks and implement scalable solutions.
- Integrate monitoring and alerting systems to detect problems early and automate responses where possible.
- Work with third-party vendors to resolve hardware and software issues affecting platform stability.
- Research open-source systems to understand architecture and improve troubleshooting effectiveness.
Requirements
You bring hands-on experience with infrastructure automation and production operations in cloud environments. A strong grasp of system design and developer experience is essential.
- Experience with tools such as Ansible, Jenkins, Kubernetes, Grafana, Spinnaker, and MySQL.
- Familiarity with code repositories and version control systems like Gerrit and Perforce.
- Knowledge of artifact management using Artifactory and distributed storage solutions.
- Understanding of search platforms like ElasticSearch and caching layers such as Varnish.
- Experience operating systems on Google Cloud Platform.
- Ability to analyze system behavior and improve performance, security, and fault tolerance.
- Strong communication skills for cross-team collaboration and technical documentation.
Benefits
This role supports a critical function in maintaining platform health and empowering engineering teams. You'll work across technologies and teams, driving improvements in infrastructure resilience and developer productivity.


