NVIDIA is hiring a Senior Site Reliability Engineer, DGX Cloud

Responsibilities

Build, implement and support operational and reliability aspects of large-scale Kubernetes clusters with focus on performance at scale, real time monitoring, logging and alerting
Define SLOs/SLIs, monitor error budgets, and streamline reporting
Support services before they launch through system creation consulting, developing software tools, platforms and frameworks, capacity management, and launch reviews
Maintain services once they are live by measuring and monitoring availability, latency and overall system health
Operate and optimize GPU workloads across AWS, GCP, Azure, OCI, and private clouds
Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity
Lead triage and root-cause analysis of high-severity incidents
Practice balanced incident response and blameless postmortems
Participate in on-call rotation to support production services

Required Skills

KubernetesAWSGCPMicrosoft AzureTerraformAnsiblePuppetPythonDistributed SystemsInfrastructure as CodeCloud ComputingAutomation

About company

NVIDIA builds accelerated computing platforms and AI technologies that power advancements in areas such as generative AI, data centers, robotics, and digital twins.

All jobs at NVIDIA Visit website

Job Details

Category infrastructure

Posted 10 months ago