Architect and manage infrastructure that supports live AI agent workflows
Guarantee high availability, performance, and monitoring capabilities for agent-based systems used in internal and external products
Create platform-level services, APIs, SDKs, and self-service tools to simplify AI infrastructure adoption for engineering teams
Operate and maintain compute resources, orchestration systems, and model serving infrastructure for AI agents
Establish monitoring, alerting, and incident response protocols customized for AI and machine learning workloads
Use Infrastructure as Code tools like Terraform to deploy and manage cloud-based components on AWS
Develop and maintain CI/CD pipelines enabling fast and stable deployment of AI-powered services and agent logic
Define resilience strategies, failure handling mechanisms, and recovery patterns for systems using LLMs and autonomous agents
Work closely with AI and data engineering teams to transition experimental agent prototypes into robust production deployments
Orchestrate containerized applications using Kubernetes to ensure efficient scaling and management of AI services
Enforce security policies, access controls, and compliance standards across AI infrastructure environments
Maintain comprehensive documentation including system architecture, operational runbooks, and engineering best practices

Kraken is hiring a Site Reliability Engineer - AI Agents