NVIDIA is looking for a Senior System Software Engineer, Cloud Services to build, operate, and maintain cloud-hosted services for user and service authentication across the company. You will ensure the reliability, performance, and scalability of these critical services through robust observability systems and automation.
What You'll Do
- Architect, implement, and maintain observability systems at scale for monitoring, alerting, logging, and tracing of cloud-based services.
- Define and refine service-level indicators (SLIs), service-level objectives (SLOs), and error budgets with service owners and product teams.
- Build and maintain actionable dashboards displaying key metrics, SLI/SLOs, and system health for distributed services.
- Collaborate with software, platform, and networking teams to integrate observability throughout the application lifecycle.
- Drive automation to reduce manual toil in monitoring, telemetry, and incident response workflows; build self-service observability tooling.
- Address performance and reliability issues using root cause analysis, distributed tracing, and log correlation.
- Participate in on-call rotations, contribute to post-incident reviews, and drive solutions to improve system resilience.
- Develop expertise in team offerings and assist in managing support channels for other NVIDIA teams.
What We're Looking For
- Bachelor’s or master’s degree in computer science, engineering, or equivalent experience.
- 8+ years in large-scale systems engineering roles with live service development, deployment, observability, and on-call experience.
- Hands-on experience with modern monitoring systems like Prometheus, Grafana, Datadog, or OpenTelemetry in a production environment.
- Advanced coding skills in Python, Go, or similar languages for automation and observability integration.
- Proficiency in cloud platforms (AWS, GCP, Azure) and containerized environments (Kubernetes, Docker).
- Experience with configuration-as-code tools (Terraform, Helm, Ansible).
- Strong communication and collaboration skills for global, cross-disciplinary teams.
- Detailed, analytical problem-solving approach and high standards for operational excellence.
- Experience with incident management and postmortem processes.
Nice to Have
- Familiarity with the Java Spring Boot framework.
- Hands-on experience with Apache Cassandra and HashiCorp Vault.
- Coding experience with React and Next.js for supporting front-end admin services.
Technical Stack
- Languages: Python, Go, JavaScript
- Frameworks & Frontend: React, Next.js, Java Spring Boot
- Cloud & Infrastructure: AWS, GCP, Azure, Kubernetes, Docker, Terraform, Helm, Ansible
- Observability: Prometheus, Grafana, Loki, Tempo, Datadog, New Relic, OpenTelemetry
- Databases & Security: Apache Cassandra, HashiCorp Vault
Team & Environment
You will collaborate closely with software, platform, and networking teams across NVIDIA.
Benefits & Compensation
- Compensation range: $184,000 USD - $287,500 USD + equity
- Equity eligibility
- Comprehensive benefits package
NVIDIA is committed to fostering a diverse work environment and is proud to be an equal opportunity employer. NVIDIA does not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.



