Responsibilities
- Lead the design, implementation, and ongoing management of a multi-cloud environment spanning Google Cloud, Azure, and AWS.
- Develop a scalable, self-service platform layer that empowers engineering teams to deploy and manage infrastructure efficiently.
- Design and implement networking solutions connecting cloud and edge environments, ensuring secure, reliable, and scalable communication.
- Enforce security best practices and compliance standards including ISO 27001, zero-trust models, and software supply chain protections.
- Implement comprehensive observability with metrics, logging, distributed tracing, and proactive alerting across all systems.
- Build and refine the MLOps pipeline, supporting data workflows, GPU-intensive workloads, and model deployment for AI systems.
- Ensure consistent reliability and performance of distributed systems deployed across multiple geographic regions.
- Collaborate directly with executive and technical leadership on infrastructure strategy, architectural direction, and long-term planning.
Requirements
- Minimum of five years in DevOps, infrastructure, or platform engineering with direct responsibility for production environments.
- Demonstrated experience designing and managing complex cloud or hybrid infrastructure across AWS, GCP, or Azure.
- Proficiency with Kubernetes and Infrastructure-as-Code tools such as Terraform for automated provisioning.
- Solid knowledge of networking principles and cloud networking including routing, DNS, VPNs, and security protocols.
- Experience with identity and access management systems including IAM, RBAC, and federated identity via OIDC or SAML.
- Strong programming ability in Python or Go, with a focus on building robust, maintainable systems.
- Production experience with CI/CD pipelines and GitOps methodologies for infrastructure and application delivery.
- In-depth understanding of Linux operating systems and distributed computing concepts.
- Familiarity with event-driven architectures and API technologies such as Kafka, AMQP, REST, or gRPC.
- Knowledge of modern operational practices including Site Reliability Engineering, DevSecOps, and platform engineering principles.
- Hands-on experience with MLOps tools or a strong drive to rapidly develop expertise in machine learning operations.
Nice to Have
- Experience using configuration management tools such as Ansible.
- Understanding of distributed storage and data systems.
- Background working with on-premises or edge computing infrastructure.
- Track record in incident management, defining SLOs, and improving system reliability.
Benefits
- Significant ownership over system design and architectural decisions, not just maintenance.
- Rapid professional growth through exposure to multi-cloud platforms, edge computing, and AI infrastructure.
- Access to cutting-edge technologies including MLOps pipelines, GPU orchestration, and distributed edge systems.
- Work in a high-impact environment where infrastructure choices directly affect large-scale production systems.
- Collaborate with seasoned engineers from leading technology companies who embrace DevOps culture.
- Clear trajectory for advancement into senior technical roles such as Staff Engineer, Architect, or leadership positions.
Required
- 5+ years of experience in DevOps / infrastructure / platform engineering, with clear ownership of production systems.
- Proven ability to design and operate complex cloud or hybrid architectures (AWS, GCP, Azure).
- Strong Kubernetes and Infrastructure-as-Code expertise (Terraform or similar is expected).
- Deep understanding of networking fundamentals and cloud networking (routing, VPNs, DNS, security layers).
- Solid grasp of identity and access systems (IAM, OIDC/SAML, RBAC).
- Strong programming skills (Python and/or Go) — you build real systems, not just glue scripts.
- Experience with CI/CD and GitOps workflows in production environments.
- Strong Linux and distributed systems knowledge.
- Exposure to event-driven systems and APIs (Kafka, AMQP, REST, gRPC).
- Good understanding of modern infrastructure disciplines (SRE, DevSecOps, platform engineering).
- Working knowledge of MLOps or strong motivation to grow into it quickly.
Preferred
- Experience with configuration management (Ansible or similar).
- Familiarity with storage and distributed data systems.
- Exposure to on-prem / edge infrastructure.
- Experience leading incident response, SLOs, and reliability practices.
Benefits
- Real ownership: You won’t just maintain systems — you’ll design and shape them.
- Steep growth curve: Work across multi-cloud, edge computing, and modern AI infrastructure.
- Cutting-edge tech: MLOps, GPU orchestration, distributed edge systems, and emerging AI workflows.
- High-impact environment: Your decisions directly influence production systems at scale.
- Strong team: Work with experienced engineers from top-tier companies who live DevOps philosophy.
- Clear growth path: Opportunity to grow into Staff Engineer, Architect, or technical leadership roles.