Full-time

NVIDIA is hiring a Senior Site Reliability Engineer, DGX Cloud

About the Role

NVIDIA is looking for a Senior Site Reliability Engineer to join our DGX Cloud team. You will be responsible for maintaining the high-performance DGX Cloud clusters used by AI researchers and enterprise clients worldwide, focusing on the operational and reliability aspects of large-scale Kubernetes environments.

What You'll Do

  • Build, implement and support operational and reliability aspects of large-scale Kubernetes clusters with focus on performance at scale, real time monitoring, logging and alerting
  • Define SLOs/SLIs, monitor error budgets, and streamline reporting
  • Support services before they launch through system creation consulting, developing software tools, platforms and frameworks, capacity management, and launch reviews
  • Maintain services once they are live by measuring and monitoring availability, latency and overall system health
  • Operate and optimize GPU workloads across AWS, GCP, Azure, OCI, and private clouds
  • Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity
  • Lead triage and root-cause analysis of high-severity incidents
  • Practice balanced incident response and blameless postmortems
  • Participate in on-call rotation to support production services

What We're Looking For

  • BS in Computer Science or related technical field, or equivalent experience
  • 10+ years of experience operating production services
  • Expert-level knowledge of Kubernetes administration, containerization, and microservices architecture
  • Experience with infrastructure automation tools (e.g., Terraform, Ansible, Chef, Puppet)
  • Proficiency in at least one high-level programming language (e.g., Python, Go)
  • In-depth knowledge of Linux operating systems, networking fundamentals (TCP/IP), and cloud security standards
  • Proficient knowledge of SRE principles, encompassing SLOs, SLIs, error budgets, and incident handling
  • Experience building and operating comprehensive observability stacks (monitoring, logging, tracing) using tools like OpenTelemetry, Prometheus, Grafana, ELK Stack, Lightstep, Splunk, etc.

Nice to Have

  • Operating GPU-accelerated clusters with KubeVirt in production
  • Applying generative-AI techniques to reduce operational toil
  • Automating incidents with Shoreline or StackStorm

Technical Stack

  • Cloud: AWS, GCP, Azure, OCI
  • Infrastructure & Orchestration: Kubernetes, Terraform, Ansible, Chef, Puppet, KubeVirt
  • Languages: Python, Go
  • Platform: Linux
  • Observability: OpenTelemetry, Prometheus, Grafana, ELK Stack, Lightstep, Splunk, Shoreline, StackStorm

Team & Environment

You will be part of a diverse, supportive environment where everyone is inspired to do their best work.

Required Skills
KubernetesAWSGCPAzureOCITerraformAnsibleChefPuppetPythonSite Reliability EngineeringDistributed SystemsInfrastructure as CodeCloud ComputingAutomation
Starting a business in Thailand?

Company registration done right

Foreign ownership rules, licenses, tax registration — Thai business setup has many moving parts. SVBL guides you through every step with full legal compliance.

Company registration & structure
Foreign ownership solutions
License & tax registration
BOI promotion eligibility
Start your business
100% foreign ownership possible
About company
NVIDIA

NVIDIA is the platform upon which every new AI‑powered application is built.

Visit website
Job Details
Category infrastructure
Posted 7 months ago