Nvidia is seeking a Principal Site Reliability Engineer, AI Infrastructure to architect, lead, and scale globally distributed production systems supporting AI/ML, HPC, and critical engineering platforms. You’ll define platform-wide reliability metrics and collaborate across organizations to establish long-term strategies in a diverse, encouraging environment focused on defining the next era of AI computing.

What You'll Do

Architect and scale globally distributed production systems supporting AI/ML and HPC platforms across hybrid and multi-cloud environments.
Design and lead implementation of automation frameworks that reduce manual tasks, promote resilience, and uphold standard methodologies for system health and change safety.
Define and evolve platform-wide reliability metrics, capacity forecasting strategies, and uncertainty testing for sophisticated distributed systems.
Lead cross-organizational efforts to assess operational maturity, address systemic risks, and establish long-term reliability strategies.
Pioneer initiatives influencing NVIDIA’s AI platform roadmap, participate in co-development with internal and external partners, and stay ahead of academic and industry advances.
Publish technical insights through papers, patents, or whitepapers and drive innovation in production engineering and system design.
Lead and mentor global teams in a technical capacity, participating in recruitment, design reviews, and developing standard methodologies for incident response and observability.

What We're Looking For

15+ years of experience in SRE, Production Engineering, or Cloud Infrastructure with a track record of leading platform-scale efforts.
Deep expertise in Linux/Unix systems engineering and public/private cloud platforms (AWS, GCP, Azure, OCI).
Expert-level programming in Python and one or more languages such as C++, Go, or Rust.
Demonstrated experience with Kubernetes at scale, CPU/GPU scheduling, microservice orchestration, and container lifecycle management.
Hands-on expertise with observability frameworks (Prometheus, Grafana, ELK, Loki) and Infrastructure as Code (Terraform, CDK, Pulumi).
Proficiency in Site Reliability Engineering concepts like error budgets, SLOs, distributed tracing, and architectural fault tolerance.
Ability to influence multi-functional collaborators and drive technical decisions through effective written and verbal communication.
Proven track record to complete long-term, forward-looking platform strategies.
Degree in Computer Science or related field, or equivalent experience.

Nice to Have

Hands-on experience building platforms for large-scale AI training, inferencing, and data movement pipelines.
Familiarity with deep learning frameworks (PyTorch, TensorFlow, JAX) and orchestration frameworks (Ray, Kubeflow).
Expertise in hardware fleet observability, predictive failure analysis, and power/resource-aware scheduling.
Experience leading operational readiness efforts and reliability engineering in GPU-heavy environments.
Track record of driving cultural improvements in incident management, root cause analysis, and postmortem processes across large teams.

Technical Stack

Platforms: Linux/Unix, AWS, GCP, Azure, OCI
Languages: Python, C++, Go, Rust
Orchestration & Observability: Kubernetes, Prometheus, Grafana, ELK, Loki
Infrastructure as Code: Terraform, CDK, Pulumi
AI Frameworks: PyTorch, TensorFlow, JAX, Ray, Kubeflow

Team & Environment

You will lead and mentor global teams in a technical capacity, participating in design reviews and developing standard methodologies.

Benefits & Compensation

Compensation: $272,000 USD - $425,500 USD + equity
Eligible for equity
Comprehensive benefits package

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.