Responsibilities
- Develop and manage reliability metrics (SLOs) for AI-driven API services and agentic AI platform features
- Implement comprehensive observability and monitoring systems for real-time performance and fault detection
- Design and drive automated failover, recovery, and incident response strategies for high-availability AI infrastructure
- Optimize resource utilization, particularly GPU/accelerator efficiency, ensuring cost-effective AI system operation
- Collaborate closely with engineering, platform, and product teams to align reliability efforts with broader organizational goals
- Lead efforts to build internal tooling and automation focused on AI system stability and operational excellence
- Drive continuous improvement in deployment practices, monitoring approaches, and incident management processes
Requirements
- Have a strong background in AI reliability engineering, SRE, or DevOps for distributed systems
- Understand the unique challenges of maintaining large-scale AI systems and integrating AI-specific metrics into reliability frameworks
- Are experienced with cloud platforms, monitoring tools, and incident response automation
- Are comfortable collaborating across teams to influence best practices for AI system reliability and operational health
- Thrive in dynamic, fast-paced environments focusing on delivering reliable, safe AI-powered services
Nice to Have
- Hands-on experience with AI/ML infrastructure, including GPU/xPU optimization and scaling
- Familiarity with API platform operations and large-scale distributed services
- Prior experience building or operating observability tools tailored for AI and agentic systems
- Contribution to open-source projects or reliability engineering thought leadership