Responsibilities
- Own the architecture, design, and evolution of our Kubernetes-based platform on AWS, ensuring scalability, resilience, and operational excellence
- Develop, maintain, and optimize Kubernetes infrastructure using Infrastructure as Code (CDK), enforcing best practices and architectural standards
- Act as the technical guide and domain expert for Kubernetes architecture, collaborating closely with the Architecture team to bridge high-level strategy with implementation and guide engineering teams with clear patterns, reference architectures, and best-practice recommendations
- Lead technical decision-making across multiple teams, providing mentorship, design reviews, and hands-on support to drive platform consistency and quality
- Drive proactive improvements across the platform, identifying scaling issues, reliability gaps, or operational inefficiencies before they become problems
- Design and implement secure, compliant, and highly observable Kubernetes environments, integrating monitoring, logging, and alerting systems
- Lead and support the migration of non-cloud-native applications to the new infrastructure, assessing readiness and implementing best practices for scalability and maintainability
- Champion automation and GitOps practices to reduce manual work, eliminate drift, and improve release velocity
- Lead cross-team initiatives and influence technical direction outside your immediate organization, driving adoption of cloud-native best practices
- Participate in on-call rotations and contribute to incident response and root cause analysis
Requirements
- Expertise in Kubernetes cluster architecture and operations, including designing and maintaining multi-node clusters for performance, scalability, and fault tolerance
- Proficiency with service mesh technologies to manage observability, security, and traffic routing, along with experience implementing auto-scaling, optimizing resource utilization, and planning rolling upgrades and disaster recovery scenarios with zero downtime
- Solid knowledge of and experience with AWS and EKS
- Ability to assess application readiness and provide migration strategies for different application types while ensuring minimal disruption during transitions
- Automation mindset with the drive to identify and automate repetitive tasks and manual processes
- Experience with observability tools such as DataDog to ensure high uptime and fast issue resolution
- Strong problem-solving and troubleshooting abilities, with a collaborative, team-oriented approach and a can-do attitude
- Effective communication and collaboration skills across stakeholder groups of varied technical backgrounds, with strong written and verbal English proficiency
- Fluency with AI tools and large language models to support platform engineering, automation, and operational efficiency
Nice to Have
- General familiarity with AI tools and large language models (e.g., Claude by Anthropic, AWS Bedrock)
- Experience with multiple IaC languages (CDK with TypeScript, CloudFormation)
- GitOps practices, including ArgoCD experience
- Multi-cloud experience (AWS required, Azure nice to have)