Responsibilities
- Architect and govern the design, deployment, and operation of high-scale, multi-region VM and Kubernetes infrastructure on GCP and AWS, ensuring maximum resilience and performance across all environments.
- Drive cross-functional technical alignment with Engineering, Product, Compliance, and Security teams, serving as the architectural consultant and leader for major initiatives involving capacity planning, disaster recovery, and cloud-native application design.
- Define and enforce organizational best practices and standards for Infrastructure as Code (IaC) using Terraform and Spacelift, ensuring consistency and security across all provisioned cloud resources (GCP/AWS).
- Design and manage complex, multi-layer configuration management and deployment workflows that optimize reliability and operational efficiency across the entire platform.
- Set the technical direction and implement comprehensive observability solutions (Grafana Cloud, Prometheus/Mimir, OTel collectors), establishing organization-wide standards for system visibility, metrics, and alerting.
- Define the strategic architecture and lifecycle management of core platform services, including certificate management, DNS automation, ingress controllers, and service mesh networking (Cilium).
- Proactively identify and lead large-scale strategic efforts to eliminate technical toil and improve operational efficiency through the development of tools, strategic automation, and building advanced CI/CD pipelines.
- Mentor and provide deep technical guidance to both junior and senior engineers within Platform Infrastructure Engineering.
- Participate in a 24x7 on-call rotation as part of a globally distributed team, responding to incidents and driving post-incident reviews to ensure long-term solutions and process improvements.
Requirements
- Bachelor's degree in Computer Science, similar technical field of study, or equivalent practical experience.
- Proficiency in common programming & scripting languages. We use a lot of python, bash and go.
- Understanding of network topologies, communication protocols (ie. TCP/IP, HTTP/S, UDP, TLS) and enterprise grade connectivity solutions.
- Kubernetes expertise including cluster administration, RBAC, networking, workload management, and troubleshooting across production environments.
- Proven experience with Terraform for infrastructure provisioning and management.
- Knowledge of Google Cloud Platform services including GKE, VPC networking, Cloud DNS, Artifact Registry, Secret Manager, IAM, Gemini Code Assist, and Workload Identity.
- Prior experience and success mentoring other junior and senior engineers
- Experience with GitOps methodologies and tools.
- Clear understanding of how to use LLM code assist tools to effectively build software.
Work Arrangement
Remote (Worldwide)
Additional Information
- Participate in a 24x7 on-call rotation as part of a globally distributed team.
- All qualified applicants will receive consideration for employment without regard to race, sex, color, religion, sexual orientation, gender identity, national origin, protected veteran status, or on the basis of disability.