About the Role
Role details below.
Responsibilities
- Architect the future of AI infrastructure by designing, building, and owning the end-to-end platform that supports the entire lifecycle of ML models—from massive-scale distributed training to ultra-low-latency, highly-available inference.
- Implement and scale sophisticated inference stacks for LLMs using frameworks like vLLM, TensorRT-LLM, or SGLang.
- Solve complex challenges in throughput, latency, token streaming, and automated scaling to deliver a seamless user experience.
- Act as a strategic partner to AI Research and Data Science teams.
- Create a seamless developer experience that accelerates the ability to experiment, fine-tune, and deploy groundbreaking models with velocity and confidence.
- Develop robust CI/CD/CT (Continuous Training) pipelines using tools like Argo Workflows, ArgoCD, and GitHub Actions to automate model validation, deployment, and lifecycle management.
- Ensure systems are both agile and rock-solid.
Requirements
- 5+ years in infrastructure or software engineering, with at least 2+ years laser-focused on MLOps or ML infrastructure for large-scale distributed systems.
- Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.
- Deep, hands-on expertise with Kubernetes in production.
- Fluency in the cloud-native ecosystem, including Helm, ArgoCD, and Argo Workflows.
- Ability to optimize the platform’s performance and scalability, considering factors such as GPU resource utilization, data ingestion, model training, and deployment.
- Hands-on experience with modern LLM inference serving frameworks (e.g., vLLM, SGLang, Triton Inference Server, Ray Serve).
- Understanding of the unique challenges of serving generative models.
- Strong programming proficiency in Python or Go.
- Experience using ML frameworks like PyTorch, Jax, TensorFlow.
- Passion for building observable and resilient systems using modern monitoring tools (e.g., Prometheus, Grafana, OpenTelemetry).
Nice to Have
- Deep performance optimization skills, including writing custom inference kernels in CUDA or Triton to accelerate model performance beyond what off-the-shelf frameworks provide.
- Experience with model optimization techniques like quantization, distillation, and speculative decoding.
- Exposure to training and serving multi-modal models (e.g., text-to-image, vision-language).
- Knowledge of AI safety and evaluation frameworks for monitoring model performance for things like bias, toxicity, and hallucinations.
Additional Information
- Shortlisted candidates will undergo a Background Verification (BGV).
- By applying, you consent to sharing personal information required for the Background