NVIDIA is hiring a Senior Site Reliability Engineer, AI Infrastructure

Responsibilities

Lead strategic efforts in managing large-scale high-performance computing systems, covering compute, networking, and storage deployment
Enhance the ecosystem for GPU-powered computing by designing and refining scalable automation tools
Design, deploy, and manage heterogeneous AI and machine learning clusters across on-premises and cloud environments
Foster strong relationships with internal teams and users to ensure cluster reliability and adaptability to changing requirements
Assist research teams with workload execution, including performance evaluation and optimization
Perform in-depth root cause investigations and recommend effective remediation steps
Identify potential system issues proactively and implement preventive solutions

Other

The company supports a diverse and inclusive workplace and is an equal opportunity employer
Employment decisions are made without regard to race, religion, color, national origin, gender, gender identity, sexual orientation, age, marital status, veteran status, disability, or other legally protected characteristics

Required Skills

SlurmKubernetesRTDALSFCentos/RHELUbuntuAnsiblePuppetSaltDockerAI InfrastructureSite Reliability EngineeringHigh Performance ComputingAutomationMonitoring SlurmKubernetesRTDALSFCentos/RHELUbuntuAnsiblePuppetSaltDockerAI InfrastructureSite Reliability EngineeringHigh Performance ComputingAutomationMonitoring

About company

NVIDIA builds accelerated computing platforms and AI technologies that power advancements in areas such as generative AI, data centers, robotics, and digital twins.

All jobs at NVIDIA Visit website

Job Details

Category infrastructure

Posted 10 months ago

Similar Jobs

Other opportunities you might be interested in

Software Engineers Python / Devops

Coface

Bois-Colombes Hybrid

Containerization Cloud Consulting

Kyndryl

Medellin Hybrid

Senior DevOps Engineer (m/w/d)

Adfinis AG

Switzerland Remote (Global)

DevOps & Site Reliability Engineer

Oowlish Technology

Brasília Remote (Global)

Software Engineer / DevOps

Applied Intuition

Sunnyvale On-site

Founding Support Engineer

Nuclearn

Related Articles

Insights related to this role

Outdoor café in San Francisco with patrons dining, illustrating leisure and hospitality job growth amid AI-driven tech sector changes.

Industry Trends

AI Boom Job Impact: Tech Decline vs. Service Growth in SF

3 min a month ago

A remote developer working in a well-lit, modern workspace, illustrating a productive environment enabled by a developer experience platform.

Developer Experience Platform: Lessons from Europe

5 min a month ago

Home office setup with dual monitors showing Kubernetes dashboards, representing the rise of Kubernetes remote jobs in AI and cloud-native careers 2026.

Kubernetes Remote Jobs: AI & Cloud-Native Careers in 2026

5 min 2 months ago