Full-time

NVIDIA is hiring a Senior MLOps Engineer, GenAI Framework

About the Role

NVIDIA is looking for a Senior MLOps Engineer to build and manage the continuous integration pipelines and release processes for our Generative AI Frameworks, including Megatron-LM and NeMo Framework. You will architect DevOps solutions that enable our fast-growing team to release scalable, high-performance software for Large Language Models and multimodal generation.

What You'll Do

  • Architect and manage continuous integration pipelines and release processes for our Generative AI framework and libraries.
  • Design and implement scalable DevOps solutions to increase software release frequency while maintaining high quality and performance.
  • Work with tools like Kubernetes, Docker, Slurm, Ansible, GitLab, GitHub Actions, Jenkins, Artifactory, and Jira in hybrid on-premise and cloud environments.
  • Assist with cluster operations and system administration, including managing servers, team accounts, and clusters.
  • Accelerate research and development cycles by automating tasks like accuracy and performance regression detection.
  • Develop new quality control measures, such as code analysis, backwards compatibility, and regression testing.
  • Work closely with DL frameworks and libraries teams and other engineering groups providing infrastructure.

What We're Looking For

  • BS or MS degree in Computer Science, Computer Architecture, or a related technical field (or equivalent experience) and 3+ years of industry experience in DevOps and infrastructure engineering.
  • Strong system-level programming in languages like Python and shell scripting.
  • Extensive understanding of build/release systems and CI/CD, with experience in solutions like GitLab, GitHub, or Jenkins.
  • Experience with Linux system administration.
  • Proficiency with containerization and cluster management technologies like Docker and Kubernetes.
  • Experience with build tools including Make and Cmake.
  • A strong background in source code management solutions such as GitLab, GitHub, or Perforce.
  • Well-versed problem-solving and debugging skills.
  • Ability to collaborate and influence others in a dynamic environment.
  • Excellent interpersonal and written communication skills.

Nice to Have

  • Proven-track record with GPU accelerated systems at scale.
  • Well-versed in DL frameworks such as PyTorch, Jax, or TensorFlow.
  • Expertise in cluster and cloud compute technologies, like SLURM, Lustre, or k8s.
  • Experience in software and hardware benchmarking on high-performance computing systems.

Technical Stack

  • Infrastructure: Kubernetes, Docker, Slurm, Ansible
  • CI/CD & Tools: GitLab, GitHub Actions, Jenkins, Artifactory, Jira
  • Languages & Scripting: Python, Shell scripting
  • Build Tools: Make, Cmake
  • SCM: Git, Perforce
  • DL Frameworks & Libraries: CUDA, cuDNN, cuBLAS, PyTorch, Jax, TensorFlow
  • Operating System: Linux

Team & Environment

You will join a technically diverse team of DL algorithm engineers and performance optimization specialists.

Benefits & Compensation

  • Compensation: $148,000 - $235,750 for Level 3, and $184,000 - $287,500 for Level 4.
  • Eligible for equity.
  • Comprehensive benefits.

NVIDIA is proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Required Skills
KubernetesDockerSlurmAnsiblePythonGitLabGitHub ActionsJenkinsArtifactoryJiraMLOpsGenAICI/CDInfrastructure as CodeCloud Platforms
Looking for a remote dev community?

200+ professionals, 37 countries, one network

Working remotely doesn't mean working alone. Iglu connects you with developers, designers, and digital experts worldwide. Collaborate, learn, and grow together.

Global professional network
Knowledge sharing & collaboration
Regular community events
Cross-project opportunities
Join the community
37 countries represented
About company
NVIDIA

NVIDIA is the platform upon which every new AI‑powered application is built.

Visit website
Job Details
Category infrastructure
Posted 4 months ago