About the Role

Design and implement scalable systems to support AI workloads, ensuring high availability and performance through proactive monitoring, incident response, and infrastructure automation.

Responsibilities

Develop automation tools to streamline operations and reduce manual intervention
Monitor system performance and troubleshoot production issues
Collaborate with development teams to improve service reliability
Implement and maintain observability solutions including logging and alerting
Drive incident response and post-mortem analysis for critical outages
Optimize infrastructure for scalability and efficiency
Support deployment pipelines and continuous integration workflows
Ensure systems meet security and compliance standards
Contribute to capacity planning and resource forecasting
Maintain documentation for systems and operational procedures
Participate in on-call rotations for critical services
Evaluate new technologies for infrastructure improvements
Enhance disaster recovery and failover mechanisms
Work closely with software engineers to refine system design
Improve system uptime and reduce mean time to recovery
Integrate infrastructure changes with minimal service disruption
Apply software engineering principles to operations challenges
Promote best practices in configuration management
Support global infrastructure deployments
Ensure alignment with long-term platform architecture goals

Nice to Have

Master's degree in a technical discipline
Experience supporting AI or machine learning infrastructure
Familiarity with GPU-accelerated computing environments
Knowledge of Kubernetes in production settings
Experience with large-scale data processing systems
Background in performance benchmarking and optimization
Exposure to hardware-software co-design principles
Contributions to open-source infrastructure projects
Certifications in cloud or systems administration

Compensation

Competitive salary and comprehensive benefits package

Work Arrangement

Hybrid work model available

Team

Part of a high-performance engineering team focused on AI systems

Why This Role Matters

This position plays a critical role in maintaining the backbone of AI infrastructure, enabling cutting-edge research and development by ensuring systems are resilient, efficient, and scalable.

What We Value

We prioritize technical excellence, proactive problem solving, collaboration across teams, and a commitment to continuous improvement in system design and operations.

Available for qualified candidates

NVIDIA is hiring a Senior Site Reliability Engineer, AI Infrastructure

About the Role

Responsibilities

Nice to Have

Compensation

Work Arrangement

Team

Why This Role Matters

What We Value

Similar Jobs

Implementation Engineer

Sr. Devops EngineerMexico City Mexico

Software Engineer / DevOps

Senior/Lead Cloud Automation Developer

Senior Infrastructure Engineer

Staff / Senior Infrastructure Engineer (relocation)

Related Articles

Network Configuration as Code: CI/CD for Automation | NVIDIA

Developer Experience Platform: Lessons from Europe

Kubernetes Remote Jobs: AI & Cloud-Native Careers in 2026