Full-time

NVIDIA is hiring a Principal Site Reliability Engineer, AI Infrastructure

About the Role

Nvidia is seeking a Principal Site Reliability Engineer, AI Infrastructure to architect, lead, and scale globally distributed production systems supporting AI/ML, HPC, and critical engineering platforms. You’ll define platform-wide reliability metrics and collaborate across organizations to establish long-term strategies in a diverse, encouraging environment focused on defining the next era of AI computing.

What You'll Do

  • Architect and scale globally distributed production systems supporting AI/ML and HPC platforms across hybrid and multi-cloud environments.
  • Design and lead implementation of automation frameworks that reduce manual tasks, promote resilience, and uphold standard methodologies for system health and change safety.
  • Define and evolve platform-wide reliability metrics, capacity forecasting strategies, and uncertainty testing for sophisticated distributed systems.
  • Lead cross-organizational efforts to assess operational maturity, address systemic risks, and establish long-term reliability strategies.
  • Pioneer initiatives influencing NVIDIA’s AI platform roadmap, participate in co-development with internal and external partners, and stay ahead of academic and industry advances.
  • Publish technical insights through papers, patents, or whitepapers and drive innovation in production engineering and system design.
  • Lead and mentor global teams in a technical capacity, participating in recruitment, design reviews, and developing standard methodologies for incident response and observability.

What We're Looking For

  • 15+ years of experience in SRE, Production Engineering, or Cloud Infrastructure with a track record of leading platform-scale efforts.
  • Deep expertise in Linux/Unix systems engineering and public/private cloud platforms (AWS, GCP, Azure, OCI).
  • Expert-level programming in Python and one or more languages such as C++, Go, or Rust.
  • Demonstrated experience with Kubernetes at scale, CPU/GPU scheduling, microservice orchestration, and container lifecycle management.
  • Hands-on expertise with observability frameworks (Prometheus, Grafana, ELK, Loki) and Infrastructure as Code (Terraform, CDK, Pulumi).
  • Proficiency in Site Reliability Engineering concepts like error budgets, SLOs, distributed tracing, and architectural fault tolerance.
  • Ability to influence multi-functional collaborators and drive technical decisions through effective written and verbal communication.
  • Proven track record to complete long-term, forward-looking platform strategies.
  • Degree in Computer Science or related field, or equivalent experience.

Nice to Have

  • Hands-on experience building platforms for large-scale AI training, inferencing, and data movement pipelines.
  • Familiarity with deep learning frameworks (PyTorch, TensorFlow, JAX) and orchestration frameworks (Ray, Kubeflow).
  • Expertise in hardware fleet observability, predictive failure analysis, and power/resource-aware scheduling.
  • Experience leading operational readiness efforts and reliability engineering in GPU-heavy environments.
  • Track record of driving cultural improvements in incident management, root cause analysis, and postmortem processes across large teams.

Technical Stack

  • Platforms: Linux/Unix, AWS, GCP, Azure, OCI
  • Languages: Python, C++, Go, Rust
  • Orchestration & Observability: Kubernetes, Prometheus, Grafana, ELK, Loki
  • Infrastructure as Code: Terraform, CDK, Pulumi
  • AI Frameworks: PyTorch, TensorFlow, JAX, Ray, Kubeflow

Team & Environment

You will lead and mentor global teams in a technical capacity, participating in design reviews and developing standard methodologies.

Benefits & Compensation

  • Compensation: $272,000 USD - $425,500 USD + equity
  • Eligible for equity
  • Comprehensive benefits package

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Required Skills
Linux/UnixAWSGCPAzureOCIPythonC++GoRustKubernetesAI InfrastructureSite Reliability EngineeringDistributed SystemsNetworkingPerformance Optimization
Freelancing without stability?

Get steady projects, keep your freedom

Iglu connects you with international clients and handles contracts, payments, and admin. You get consistent work and flexibility — no more chasing invoices or worrying about gaps.

Consistent client projects
Contract & payment management
Flexible work schedule
Revenue-sharing compensation
See open positions
Work from anywhere
About company
NVIDIA

NVIDIA is the platform upon which every new AI‑powered application is built.

Visit website
Job Details
Category infrastructure
Posted 8 months ago