NVIDIA is hiring a Senior Site Reliability Engineer, AI Infrastructure

About the Role

Design and implement scalable systems to support AI workloads, ensuring high availability and performance through proactive monitoring, incident response, and infrastructure automation.

Responsibilities

  • Develop automation tools to streamline operations and reduce manual intervention
  • Monitor system performance and troubleshoot production issues
  • Collaborate with development teams to improve service reliability
  • Implement and maintain observability solutions including logging and alerting
  • Drive incident response and post-mortem analysis for critical outages
  • Optimize infrastructure for scalability and efficiency
  • Support deployment pipelines and continuous integration workflows
  • Ensure systems meet security and compliance standards
  • Contribute to capacity planning and resource forecasting
  • Maintain documentation for systems and operational procedures
  • Participate in on-call rotations for critical services
  • Evaluate new technologies for infrastructure improvements
  • Enhance disaster recovery and failover mechanisms
  • Work closely with software engineers to refine system design
  • Improve system uptime and reduce mean time to recovery
  • Integrate infrastructure changes with minimal service disruption
  • Apply software engineering principles to operations challenges
  • Promote best practices in configuration management
  • Support global infrastructure deployments
  • Ensure alignment with long-term platform architecture goals

Nice to Have

  • Master's degree in a technical discipline
  • Experience supporting AI or machine learning infrastructure
  • Familiarity with GPU-accelerated computing environments
  • Knowledge of Kubernetes in production settings
  • Experience with large-scale data processing systems
  • Background in performance benchmarking and optimization
  • Exposure to hardware-software co-design principles
  • Contributions to open-source infrastructure projects
  • Certifications in cloud or systems administration

Compensation

Competitive salary and comprehensive benefits package

Work Arrangement

Hybrid work model available

Team

Part of a high-performance engineering team focused on AI systems

Why This Role Matters

This position plays a critical role in maintaining the backbone of AI infrastructure, enabling cutting-edge research and development by ensuring systems are resilient, efficient, and scalable.

What We Value

We prioritize technical excellence, proactive problem solving, collaboration across teams, and a commitment to continuous improvement in system design and operations.

Available for qualified candidates

Required Skills
PythonC/C++GoLinuxAWSAzureOCIKubernetesTerraformAnsibleCI/CDNetworkingMonitoringDistributed SystemsAI Infrastructure PythonC/C++GoLinuxAWSAzureOCIKubernetesTerraformAnsibleCI/CDNetworkingMonitoringDistributed SystemsAI Infrastructure
About company
NVIDIA
NVIDIA builds accelerated computing platforms and AI technologies that power advancements in areas such as generative AI, data centers, robotics, and digital twins.
All jobs at NVIDIA Visit website
Job Details
Category infrastructure
Posted 9 months ago