About the Role

The role involves building and maintaining scalable machine learning operations infrastructure tailored for generative AI workloads, supporting end-to-end model lifecycle management from development to production.

Responsibilities

Design and implement CI/CD pipelines for machine learning models
Develop automated workflows for model training and evaluation
Integrate monitoring solutions for model performance and system health
Optimize deployment strategies for low-latency inference services
Collaborate with research teams to transition prototypes into production
Ensure reproducibility and versioning across ML pipelines
Support infrastructure for large-scale distributed training
Implement security and access controls for ML systems
Work with containerization and orchestration tools like Docker and Kubernetes
Maintain documentation for ML infrastructure components
Troubleshoot issues in staging and production environments
Improve scalability and reliability of model serving platforms
Partner with data engineering teams to streamline data pipelines
Apply software engineering best practices to ML codebases
Evaluate new tools and frameworks for MLOps efficiency
Contribute to internal developer tooling for ML teams
Manage configuration and provisioning of cloud resources
Enforce compliance with data governance policies
Support A/B testing and canary rollout strategies
Develop metrics dashboards for operational visibility
Participate in incident response for critical system outages
Drive automation of repetitive operational tasks
Ensure efficient resource utilization in compute environments
Collaborate on cross-functional initiatives involving AI ethics and safety
Stay current with advancements in generative AI and MLOps

Nice to Have

Master’s degree in computer science or related field
Experience with generative AI models such as LLMs or diffusion models
Contributions to open-source MLOps projects
Deep knowledge of PyTorch or TensorFlow ecosystems
Experience with model quantization and optimization techniques
Familiarity with data labeling and annotation pipelines
Background in high-performance computing environments
Knowledge of model explainability and fairness tools
Experience with multi-cloud or hybrid cloud deployments
Prior work in AI research or product teams

Compensation

Competitive salary and benefits package

Work Arrangement

Hybrid work model with flexibility based on location and team needs

Team

Part of the AI infrastructure team focused on generative AI systems

About the Team

This group builds foundational infrastructure for generative AI, enabling researchers and engineers to develop, train, and deploy models efficiently. The focus is on creating robust, scalable platforms that support rapid innovation while maintaining operational excellence.

Why This Role Matters

As generative AI models grow in complexity, the systems supporting them must evolve. This role directly impacts the speed and reliability of AI development by creating automated, maintainable pipelines that bridge research and production.

Available for qualified candidates

NVIDIA is hiring a Senior MLOps Engineer, GenAI Framework

About the Role

Responsibilities

Nice to Have

Compensation

Work Arrangement

Team

About the Team

Why This Role Matters

Similar Jobs

DevOps Lead - (LATAM CANDIDATES ONLY)

DevOps Azure Senior MS055SG

Senior Engineer - Cloud Platforms

AWS Web Services Engineer

Software Engineer / DevOps

Platform Engineer, Infrastructure

Related Articles

Platform Engineering: Kubernetes for All

Network Configuration as Code: CI/CD for Automation | NVIDIA

Become an AI Developer: Your Career Guide