About the Role

The role involves building and optimizing infrastructure to power advanced AI training and inference pipelines, working closely with research and engineering teams to deliver scalable solutions.

Responsibilities

Design scalable systems for AI model training and deployment
Optimize compute resource utilization across distributed environments
Collaborate with machine learning teams to understand infrastructure needs
Improve system reliability and fault tolerance
Develop automation tools for provisioning and monitoring infrastructure
Troubleshoot performance bottlenecks in compute clusters
Implement efficient data pipelines for AI workflows
Ensure security and compliance across infrastructure layers
Evaluate new hardware and software technologies for AI workloads
Maintain documentation for systems and processes
Support deployment of large language models in production
Drive initiatives to reduce operational costs
Monitor cluster health and respond to incidents
Integrate feedback from research teams into infrastructure design
Scale systems to accommodate growing model sizes
Work with containerization and orchestration platforms
Ensure low-latency communication between compute nodes
Optimize storage solutions for high-throughput access
Manage GPU resource allocation and scheduling
Contribute to capacity planning and forecasting

Nice to Have

Experience with large-scale transformer model training
Background in computer science or related technical field
Prior work with AI research teams
Familiarity with Slurm or similar workload managers
Knowledge of RDMA or high-speed interconnects
Experience with hardware provisioning at scale
Contributions to open-source infrastructure projects
Understanding of energy-efficient computing
Exposure to formal incident response procedures
Experience mentoring junior engineers

Compensation

Competitive salary with equity and benefits package

Work Arrangement

Hybrid work model with flexibility for remote or on-site

Team

Collaborative engineering team focused on AI infrastructure scalability

About the AI Infrastructure Team

The team operates at the intersection of machine learning and systems engineering, building platforms that enable rapid experimentation and deployment of AI models.
Focus areas include cluster management, resource scheduling, and performance optimization for GPU-intensive workloads.

Impact

Engineers directly influence the speed and efficiency of AI development by enabling faster training cycles and reliable inference systems.
Work contributes to reducing time-to-market for new AI capabilities.

Available for qualified candidates

Kraken is hiring a Senior AI Compute Infrastructure Engineer