Full-time

NVIDIA is hiring a Senior Site Reliability Engineer, Cloud

About the Role

NVIDIA is looking for a Senior Site Reliability Engineer to design, build, and maintain its large-scale production cloud systems with high efficiency and availability. In this role, you will be instrumental in ensuring the reliability and uptime of our GPU cloud services while enabling safe, rapid changes.

What You'll Do

  • Design, implement, and support operational and reliability aspects of large-scale Kubernetes clusters, focusing on performance, monitoring, logging, and alerting.
  • Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation, and refinement.
  • Support services pre-launch through system design consulting, developing software tools and platforms, capacity management, and launch reviews.
  • Maintain live services by measuring and monitoring availability, latency, and overall system health.
  • Scale systems sustainably through automation and evolve systems by pushing for changes that improve reliability and velocity.
  • Practice sustainable incident response and blameless postmortems.
  • Be part of an on-call rotation to support production systems.

What We're Looking For

  • BS degree in Computer Science or a related technical field involving coding, or equivalent experience.
  • 5+ years of experience with infrastructure automation, distributed systems design, and developing tools for running large-scale private or public cloud systems in production.
  • Experience in one or more of the following: Python, Go, Perl, or Ruby.
  • In-depth knowledge of Linux, Networking, and Containers.

Nice to Have

  • Interest in crafting, analyzing, and fixing large-scale distributed systems.
  • Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.
  • Ability to debug and optimize code and automate routine tasks.
  • Experience using or running large private and public cloud systems based on Kubernetes, OpenStack, and Docker.

Technical Stack

  • Languages: Python, Go, Perl, Ruby
  • Platforms & Infrastructure: Linux, Kubernetes, OpenStack, Docker

Team & Environment

We foster a culture of diversity, intellectual curiosity, problem solving, and openness. We encourage collaboration, thinking big, and taking risks in a blame-free environment. You will have the opportunity for self-direction on meaningful projects with support and mentorship.

Benefits & Compensation

  • Compensation: $144,000 - $230,000 USD for Level 3, and $168,000 - $270,250 USD for Level 4 + equity.
  • Equity and comprehensive benefits.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status, or any other characteristic protected by law.

Required Skills
PythonGoPerlRubyLinuxKubernetesOpenStackDockerSite Reliability EngineeringCloud InfrastructureAutomationDistributed SystemsMonitoringCI/CD
Landing international contracts?

Invoice globally with an EU company

GloPay creates an Estonian partnership for you automatically. Your clients get proper invoices, you keep 95% of payments. Setup takes 5 minutes, works in 100+ currencies.

EU-registered company for compliance
Multi-currency invoicing & payments
Expense tracking & tax reports
Money in your bank in 1 business day
Start invoicing free
5% per invoice • No subscriptions
About company
NVIDIA

NVIDIA is the platform upon which every new AI‑powered application is built.

Visit website
Job Details
Category infrastructure
Posted 7 months ago