Full-time

NVIDIA is hiring a Senior System Software Engineer, Cloud Services

About the Role

NVIDIA is looking for a Senior System Software Engineer, Cloud Services to build, operate, and maintain cloud-hosted services for user and service authentication across the company. You will ensure the reliability, performance, and scalability of these critical services through robust observability systems and automation.

What You'll Do

  • Architect, implement, and maintain observability systems at scale for monitoring, alerting, logging, and tracing of cloud-based services.
  • Define and refine service-level indicators (SLIs), service-level objectives (SLOs), and error budgets with service owners and product teams.
  • Build and maintain actionable dashboards displaying key metrics, SLI/SLOs, and system health for distributed services.
  • Collaborate with software, platform, and networking teams to integrate observability throughout the application lifecycle.
  • Drive automation to reduce manual toil in monitoring, telemetry, and incident response workflows; build self-service observability tooling.
  • Address performance and reliability issues using root cause analysis, distributed tracing, and log correlation.
  • Participate in on-call rotations, contribute to post-incident reviews, and drive solutions to improve system resilience.
  • Develop expertise in team offerings and assist in managing support channels for other NVIDIA teams.

What We're Looking For

  • Bachelor’s or master’s degree in computer science, engineering, or equivalent experience.
  • 8+ years in large-scale systems engineering roles with live service development, deployment, observability, and on-call experience.
  • Hands-on experience with modern monitoring systems like Prometheus, Grafana, Datadog, or OpenTelemetry in a production environment.
  • Advanced coding skills in Python, Go, or similar languages for automation and observability integration.
  • Proficiency in cloud platforms (AWS, GCP, Azure) and containerized environments (Kubernetes, Docker).
  • Experience with configuration-as-code tools (Terraform, Helm, Ansible).
  • Strong communication and collaboration skills for global, cross-disciplinary teams.
  • Detailed, analytical problem-solving approach and high standards for operational excellence.
  • Experience with incident management and postmortem processes.

Nice to Have

  • Familiarity with the Java Spring Boot framework.
  • Hands-on experience with Apache Cassandra and HashiCorp Vault.
  • Coding experience with React and Next.js for supporting front-end admin services.

Technical Stack

  • Languages: Python, Go, JavaScript
  • Frameworks & Frontend: React, Next.js, Java Spring Boot
  • Cloud & Infrastructure: AWS, GCP, Azure, Kubernetes, Docker, Terraform, Helm, Ansible
  • Observability: Prometheus, Grafana, Loki, Tempo, Datadog, New Relic, OpenTelemetry
  • Databases & Security: Apache Cassandra, HashiCorp Vault

Team & Environment

You will collaborate closely with software, platform, and networking teams across NVIDIA.

Benefits & Compensation

  • Compensation range: $184,000 USD - $287,500 USD + equity
  • Equity eligibility
  • Comprehensive benefits package

NVIDIA is committed to fostering a diverse work environment and is proud to be an equal opportunity employer. NVIDIA does not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Required Skills
PythonGoKubernetesDockerAWSGCPAzureReactNext.jsJavaScriptCloud ServicesSystem Software
Relocating to Thailand?

Visa and work permit handled by experts

SVBL manages your entire visa process — from application to approval. Work permits, extensions, and compliance all covered. One partner for legal, immigration, and settling in.

Work permit processing
Visa extensions & renewals
Immigration compliance
Banking & housing guidance
Get free consultation
Free initial consultation
About company
NVIDIA

NVIDIA is the platform upon which every new AI‑powered application is built.

Visit website
Job Details
Category infrastructure
Posted 7 months ago