NVIDIA is looking for a Senior MLOps Engineer to build and manage the continuous integration pipelines and release processes for our Generative AI Frameworks, including Megatron-LM and NeMo Framework. You will architect DevOps solutions that enable our fast-growing team to release scalable, high-performance software for Large Language Models and multimodal generation.
What You'll Do
- Architect and manage continuous integration pipelines and release processes for our Generative AI framework and libraries.
- Design and implement scalable DevOps solutions to increase software release frequency while maintaining high quality and performance.
- Work with tools like Kubernetes, Docker, Slurm, Ansible, GitLab, GitHub Actions, Jenkins, Artifactory, and Jira in hybrid on-premise and cloud environments.
- Assist with cluster operations and system administration, including managing servers, team accounts, and clusters.
- Accelerate research and development cycles by automating tasks like accuracy and performance regression detection.
- Develop new quality control measures, such as code analysis, backwards compatibility, and regression testing.
- Work closely with DL frameworks and libraries teams and other engineering groups providing infrastructure.
What We're Looking For
- BS or MS degree in Computer Science, Computer Architecture, or a related technical field (or equivalent experience) and 3+ years of industry experience in DevOps and infrastructure engineering.
- Strong system-level programming in languages like Python and shell scripting.
- Extensive understanding of build/release systems and CI/CD, with experience in solutions like GitLab, GitHub, or Jenkins.
- Experience with Linux system administration.
- Proficiency with containerization and cluster management technologies like Docker and Kubernetes.
- Experience with build tools including Make and Cmake.
- A strong background in source code management solutions such as GitLab, GitHub, or Perforce.
- Well-versed problem-solving and debugging skills.
- Ability to collaborate and influence others in a dynamic environment.
- Excellent interpersonal and written communication skills.
Nice to Have
- Proven-track record with GPU accelerated systems at scale.
- Well-versed in DL frameworks such as PyTorch, Jax, or TensorFlow.
- Expertise in cluster and cloud compute technologies, like SLURM, Lustre, or k8s.
- Experience in software and hardware benchmarking on high-performance computing systems.
Technical Stack
- Infrastructure: Kubernetes, Docker, Slurm, Ansible
- CI/CD & Tools: GitLab, GitHub Actions, Jenkins, Artifactory, Jira
- Languages & Scripting: Python, Shell scripting
- Build Tools: Make, Cmake
- SCM: Git, Perforce
- DL Frameworks & Libraries: CUDA, cuDNN, cuBLAS, PyTorch, Jax, TensorFlow
- Operating System: Linux
Team & Environment
You will join a technically diverse team of DL algorithm engineers and performance optimization specialists.
Benefits & Compensation
- Compensation: $148,000 - $235,750 for Level 3, and $184,000 - $287,500 for Level 4.
- Eligible for equity.
- Comprehensive benefits.
NVIDIA is proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.



