About the Role

The role involves building and maintaining core infrastructure for artificial intelligence workloads, ensuring systems are robust, scalable, and optimized for performance across distributed environments.

Responsibilities

Develop scalable backend systems to support AI model training and inference
Design distributed computing frameworks for efficient resource utilization
Optimize data pipelines for high-throughput machine learning workflows
Collaborate with research teams to operationalize AI models
Ensure infrastructure reliability under heavy computational loads
Implement monitoring and observability for AI systems
Troubleshoot performance bottlenecks in large-scale environments
Contribute to capacity planning for GPU and CPU clusters
Maintain secure and compliant computing environments
Integrate new hardware accelerators into existing infrastructure
Automate deployment and scaling of AI services
Work closely with data engineers to streamline data access
Define best practices for infrastructure as code
Support reproducibility and versioning of AI experiments
Improve fault tolerance in distributed training jobs
Evaluate new technologies for AI compute efficiency
Document system architecture and operational procedures
Respond to incidents affecting AI platform availability
Participate in code and design reviews
Drive improvements in system latency and throughput
Ensure compatibility across software and hardware stacks
Collaborate on disaster recovery planning
Enhance developer tooling for machine learning engineers
Contribute to technical roadmaps for infrastructure evolution
Mentor engineers working on AI platform components

Nice to Have

Master’s or PhD in computer science or related field
Experience with high-performance computing environments
Contributions to open-source AI or infrastructure projects
Prior work in machine learning platform development
Familiarity with model serving frameworks
Experience with large-scale data processing systems
Knowledge of formal verification methods
Background in systems programming
Published research in systems or AI conferences

Compensation

Competitive salary with equity and benefits

Work Arrangement

Hybrid work model with flexible remote options

Team

Part of a high-performing engineering team focused on AI systems

About the AI Team

The AI team builds foundational systems that enable rapid experimentation and deployment of machine learning models. Engineers work on low-latency inference, distributed training, and scalable data infrastructure.

Technology Stack

Primary languages include Python and Go. Infrastructure runs on Kubernetes with cloud providers. Tools include Prometheus, Grafana, Docker, and custom internal platforms for model management.

Available for qualified candidates

Kraken is hiring a Senior Software Engineer – AI Infrastructure