NVIDIA is hiring a Senior Site Reliability Engineer, DGX Cloud

Responsibilities

  • Build, implement and support operational and reliability aspects of large-scale Kubernetes clusters with focus on performance at scale, real time monitoring, logging and alerting
  • Define SLOs/SLIs, monitor error budgets, and streamline reporting
  • Support services before they launch through system creation consulting, developing software tools, platforms and frameworks, capacity management, and launch reviews
  • Maintain services once they are live by measuring and monitoring availability, latency and overall system health
  • Operate and optimize GPU workloads across AWS, GCP, Azure, OCI, and private clouds
  • Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity
  • Lead triage and root-cause analysis of high-severity incidents
  • Practice balanced incident response and blameless postmortems
  • Participate in on-call rotation to support production services
Required Skills
KubernetesAWSGCPAzureOCITerraformAnsibleChefPuppetPythonSite Reliability EngineeringDistributed SystemsInfrastructure as CodeCloud ComputingAutomation
About company
NVIDIA
NVIDIA builds accelerated computing platforms and AI technologies that power advancements in areas such as generative AI, data centers, robotics, and digital twins.
All jobs at NVIDIA Visit website
Job Details
Category infrastructure
Posted 9 months ago