London, United Kingdom Remote (Global) Full-time

Kraken is hiring a Site Reliability Engineer - AI Agents

Responsibilities

  • Architect and manage infrastructure that supports live AI agent workflows
  • Guarantee high availability, performance, and monitoring capabilities for agent-based systems used in internal and external products
  • Create platform-level services, APIs, SDKs, and self-service tools to simplify AI infrastructure adoption for engineering teams
  • Operate and maintain compute resources, orchestration systems, and model serving infrastructure for AI agents
  • Establish monitoring, alerting, and incident response protocols customized for AI and machine learning workloads
  • Use Infrastructure as Code tools like Terraform to deploy and manage cloud-based components on AWS
  • Develop and maintain CI/CD pipelines enabling fast and stable deployment of AI-powered services and agent logic
  • Define resilience strategies, failure handling mechanisms, and recovery patterns for systems using LLMs and autonomous agents
  • Work closely with AI and data engineering teams to transition experimental agent prototypes into robust production deployments
  • Orchestrate containerized applications using Kubernetes to ensure efficient scaling and management of AI services
  • Enforce security policies, access controls, and compliance standards across AI infrastructure environments
  • Maintain comprehensive documentation including system architecture, operational runbooks, and engineering best practices
About company
Kraken
Kraken is a cryptocurrency exchange building premium crypto products for experienced traders, institutions, and newcomers. The company is committed to industry-leading security, crypto education, and world-class client support through products like Kraken Pro, Desktop, Wallet, and Kraken Futures.
All jobs at Kraken Visit website
Job Details
Department Engineering, SRE / Devops
Category infrastructure
Posted 7 days ago