Responsibilities
- Architect and manage infrastructure that supports live AI agent workflows
- Guarantee high availability, performance, and monitoring capabilities for agent-based systems used in internal and external products
- Create platform-level services, APIs, SDKs, and self-service tools to simplify AI infrastructure adoption for engineering teams
- Operate and maintain compute resources, orchestration systems, and model serving infrastructure for AI agents
- Establish monitoring, alerting, and incident response protocols customized for AI and machine learning workloads
- Use Infrastructure as Code tools like Terraform to deploy and manage cloud-based components on AWS
- Develop and maintain CI/CD pipelines enabling fast and stable deployment of AI-powered services and agent logic
- Define resilience strategies, failure handling mechanisms, and recovery patterns for systems using LLMs and autonomous agents
- Work closely with AI and data engineering teams to transition experimental agent prototypes into robust production deployments
- Orchestrate containerized applications using Kubernetes to ensure efficient scaling and management of AI services
- Enforce security policies, access controls, and compliance standards across AI infrastructure environments
- Maintain comprehensive documentation including system architecture, operational runbooks, and engineering best practices