Responsibilities
- Build and maintain a low-latency model inference system that supports high availability, efficient resource use, and scalable performance for advanced world models.
- Develop and scale core data infrastructure platforms, including Flyte and Ray on Kubernetes, to manage petabyte-level data processing.
- Construct and operate large GPU-powered training clusters optimized for deep learning, emphasizing throughput, reliability, and ease of use.
- Implement automated provisioning, configuration management, monitoring, and alerting systems using Infrastructure as Code methodologies.
- Lead initiatives to enhance system performance, reduce operational costs, and improve overall infrastructure resilience across all layers.
- Work directly with research and product teams to gather requirements, refine workflows, and enhance platform accessibility and functionality.


