NVIDIA is hiring a Senior Site Reliability Engineer, AI Infrastructure

Responsibilities

  • Lead strategic efforts in managing large-scale high-performance computing systems, covering compute, networking, and storage deployment
  • Enhance the ecosystem for GPU-powered computing by designing and refining scalable automation tools
  • Design, deploy, and manage heterogeneous AI and machine learning clusters across on-premises and cloud environments
  • Foster strong relationships with internal teams and users to ensure cluster reliability and adaptability to changing requirements
  • Assist research teams with workload execution, including performance evaluation and optimization
  • Perform in-depth root cause investigations and recommend effective remediation steps
  • Identify potential system issues proactively and implement preventive solutions

Other

  • The company supports a diverse and inclusive workplace and is an equal opportunity employer
  • Employment decisions are made without regard to race, religion, color, national origin, gender, gender identity, sexual orientation, age, marital status, veteran status, disability, or other legally protected characteristics
Required Skills
SlurmKubernetesRTDALSFCentos/RHELUbuntuAnsiblePuppetSaltDockerAI InfrastructureSite Reliability EngineeringHigh Performance ComputingAutomationMonitoring SlurmKubernetesRTDALSFCentos/RHELUbuntuAnsiblePuppetSaltDockerAI InfrastructureSite Reliability EngineeringHigh Performance ComputingAutomationMonitoring
About company
NVIDIA
NVIDIA builds accelerated computing platforms and AI technologies that power advancements in areas such as generative AI, data centers, robotics, and digital twins.
All jobs at NVIDIA Visit website
Job Details
Category infrastructure
Posted 10 months ago