About the Role
Own end-to-end reliability of cloud infrastructure across AWS and GCP, including Kubernetes, databases, and CI/CD pipelines. Drive incident response, system resilience, and performance at scale while supporting the foundation of AI-driven operations for leading enterprise customers.
Responsibilities
- Ensure high availability and performance of cloud systems on AWS services such as ECS, Aurora, and CloudWatch, as well as on Google Cloud Platform
- Operate and fine-tune Kubernetes environments for efficient container orchestration and workload management
- Lead database reliability initiatives, focusing on performance optimization, query tuning, and horizontal scaling
- Develop, secure, and maintain CI/CD pipelines to enable fast, reliable, and automated software deployments
- Lead incident management processes, including on-call rotations and post-incident reviews
- Collaborate with engineering teams to design systems that are scalable, fault-tolerant, and production-ready
Tech Stack
Amazon Web Services (ECS, Aurora, CloudWatch), Google Cloud Platform, Kubernetes and container orchestration systems, Relational databases and database performance tuning, CI/CD pipeline tools and practices, Incident response and monitoring frameworks
Benefits
- 4 weeks of paid vacation annually
- Paid sick leave
- Paid parental leave
- Annual professional development fund for AI, enterprise software, and customer success-related courses, certifications, and conferences
- Provision of top-tier hardware with choice of laptop and peripherals
- Comprehensive health, dental, and vision insurance coverage
- Direct engagement with customers to understand needs and observe real-world impact of technical solutions
Compensation
competitive salary
Company Culture
- Focus on building foundational infrastructure for AI-driven enterprise applications
- Collaborative environment with direct influence on product and platform evolution
- Strong alignment with customer success through hands-on technical support and system design
Additional Information
- End-to-end ownership of system reliability, spanning database performance, incident management, and deployment pipelines
- Core contributor to the development and adoption of AI operations infrastructure
- Customers include recognized innovators such as Gusto, Instacart, and Opendoor
- Backed by $11M in funding from Khosla Ventures and Felicis
- Strategic involvement with MCP, co-created by an investor on the cap table
- Pioneered the MCP protocol and now building the enterprise platform to enable its real-world implementation