About the Role
The role involves building and maintaining core infrastructure for artificial intelligence workloads, ensuring systems are robust, scalable, and optimized for performance across distributed environments.
Responsibilities
- Develop scalable backend systems to support AI model training and inference
- Design distributed computing frameworks for efficient resource utilization
- Optimize data pipelines for high-throughput machine learning workflows
- Collaborate with research teams to operationalize AI models
- Ensure infrastructure reliability under heavy computational loads
- Implement monitoring and observability for AI systems
- Troubleshoot performance bottlenecks in large-scale environments
- Contribute to capacity planning for GPU and CPU clusters
- Maintain secure and compliant computing environments
- Integrate new hardware accelerators into existing infrastructure
- Automate deployment and scaling of AI services
- Work closely with data engineers to streamline data access
- Define best practices for infrastructure as code
- Support reproducibility and versioning of AI experiments
- Improve fault tolerance in distributed training jobs
- Evaluate new technologies for AI compute efficiency
- Document system architecture and operational procedures
- Respond to incidents affecting AI platform availability
- Participate in code and design reviews
- Drive improvements in system latency and throughput
- Ensure compatibility across software and hardware stacks
- Collaborate on disaster recovery planning
- Enhance developer tooling for machine learning engineers
- Contribute to technical roadmaps for infrastructure evolution
- Mentor engineers working on AI platform components
Nice to Have
- Master’s or PhD in computer science or related field
- Experience with high-performance computing environments
- Contributions to open-source AI or infrastructure projects
- Prior work in machine learning platform development
- Familiarity with model serving frameworks
- Experience with large-scale data processing systems
- Knowledge of formal verification methods
- Background in systems programming
- Published research in systems or AI conferences
Compensation
Competitive salary with equity and benefits
Work Arrangement
Hybrid work model with flexible remote options
Team
Part of a high-performing engineering team focused on AI systems
About the AI Team
The AI team builds foundational systems that enable rapid experimentation and deployment of machine learning models. Engineers work on low-latency inference, distributed training, and scalable data infrastructure.
Technology Stack
Primary languages include Python and Go. Infrastructure runs on Kubernetes with cloud providers. Tools include Prometheus, Grafana, Docker, and custom internal platforms for model management.
Available for qualified candidates