USD 184,000 – 356,500 / year

NVIDIA is hiring a Senior ML Platform Engineer - Lepton

Responsibilities

Design, build, and maintain our core ML platform infrastructure as code, primarily using Ansible and Terraform, ensuring reproducibility and scalability across large-scale, distributed GPU clusters.
Apply SRE principles to diagnose, troubleshoot, and resolve complex system issues across the entire stack, ensuring high availability and performance for critical AI workloads.
Develop robust internal automation and tooling for ML workflow orchestration, resource scheduling, and platform operations, with a strong focus on software engineering best practices.
Collaborate with ML researchers and applied scientists to understand infrastructure needs and build solutions that streamline their end-to-end experimentation.
Evolve and operate our multi-cloud and hybrid (on-prem + cloud) environments, implementing monitoring, alerting, and incident response protocols.
Participate in on-call rotation to provide support for platform services and infrastructure running critical ML jobs, driving root cause analysis and implementing preventative measures.
Write high-quality, maintainable code (Python, Go) to contribute to the core orchestration platform and automate manual processes.
Drive the adoption of modern GPU technologies and ensure smooth integration of next-generation hardware into ML pipelines (e.g., GB200, NVLink, etc.).

Benefits

Eligible for equity
Benefits (specifics not detailed)

Additional Information

Applications accepted until November 8, 2025.
Equal opportunity employer: does not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status, or any other characteristic protected by law.

Required Skills

AnsibleTerraformPythonGoKubernetesDockerPytorchTensorFlowMLOpsCloud InfrastructureDistributed SystemsCI/CD

About company

NVIDIA builds accelerated computing platforms and AI technologies that power advancements in areas such as generative AI, data centers, robotics, and digital twins.

All jobs at NVIDIA Visit website

Job Details

Category infrastructure

Posted 8 months ago

Similar Jobs

Other opportunities you might be interested in

Senior DevOps Engineer (hiring in US/CAN & LATAM)

TrueML

Remote in Mexico Remote (Global)

Senior Software Engineer - Cloud

Bitdeer Technologies Group

United States of America Remote (Global)

Senior Engineer - Cloud Platforms

Field Nation

Software Engineer / DevOps

Applied Intuition

Sunnyvale On-site

DevOps Azure Senior MS055SG

Coderio

Bogotá Remote (Global)

Senior DevOps Engineer (m/w/d)

Adfinis AG

Switzerland Remote (Global)

Related Articles

Insights related to this role

Remote data scientist working with Kubernetes through a low-code platform, enabling cloud-native tools without backend expertise

Platform Engineering: Kubernetes for All

3 min 3 months ago

Data center rack with network switches and fiber connections, illustrating automated network deployment using CI/CD and network configuration as code.

Network Configuration as Code: CI/CD for Automation | NVIDIA

4 min 3 months ago

A remote developer working in a well-lit, modern workspace, illustrating a productive environment enabled by a developer experience platform.

Developer Experience Platform: Lessons from Europe

5 min 3 months ago