About the Role
The role involves building and optimizing infrastructure to power advanced AI training and inference pipelines, working closely with research and engineering teams to deliver scalable solutions.
Responsibilities
- Design scalable systems for AI model training and deployment
- Optimize compute resource utilization across distributed environments
- Collaborate with machine learning teams to understand infrastructure needs
- Improve system reliability and fault tolerance
- Develop automation tools for provisioning and monitoring infrastructure
- Troubleshoot performance bottlenecks in compute clusters
- Implement efficient data pipelines for AI workflows
- Ensure security and compliance across infrastructure layers
- Evaluate new hardware and software technologies for AI workloads
- Maintain documentation for systems and processes
- Support deployment of large language models in production
- Drive initiatives to reduce operational costs
- Monitor cluster health and respond to incidents
- Integrate feedback from research teams into infrastructure design
- Scale systems to accommodate growing model sizes
- Work with containerization and orchestration platforms
- Ensure low-latency communication between compute nodes
- Optimize storage solutions for high-throughput access
- Manage GPU resource allocation and scheduling
- Contribute to capacity planning and forecasting
Nice to Have
- Experience with large-scale transformer model training
- Background in computer science or related technical field
- Prior work with AI research teams
- Familiarity with Slurm or similar workload managers
- Knowledge of RDMA or high-speed interconnects
- Experience with hardware provisioning at scale
- Contributions to open-source infrastructure projects
- Understanding of energy-efficient computing
- Exposure to formal incident response procedures
- Experience mentoring junior engineers
Compensation
Competitive salary with equity and benefits package
Work Arrangement
Hybrid work model with flexibility for remote or on-site
Team
Collaborative engineering team focused on AI infrastructure scalability
About the AI Infrastructure Team
- The team operates at the intersection of machine learning and systems engineering, building platforms that enable rapid experimentation and deployment of AI models.
- Focus areas include cluster management, resource scheduling, and performance optimization for GPU-intensive workloads.
Impact
- Engineers directly influence the speed and efficiency of AI development by enabling faster training cycles and reliable inference systems.
- Work contributes to reducing time-to-market for new AI capabilities.
Available for qualified candidates