Remote (Global) Full-time

NVIDIA is hiring a Senior Site Reliability Engineer, NIM Factory

About the Role

Nvidia is looking for a Senior Site Reliability Engineer for the NIM Factory, where you will operate and improve the automation of NVIDIA Inference Microservices (NIMs). Your work will ensure flawless performance, accuracy, and availability of these services, directly impacting AI-powered applications.

What You'll Do

  • Operate a software factory that transforms AI models into deployable services validated across Cloud, On-prem, and Kubernetes environments.
  • Collaborate with the development team to deliver rapid iterations on technical strategies and roadmaps, continuously evolving the NIM factory.
  • Ensure the factory's operation, availability, critical metrics, observability, and stability while tracking service deployment across multiple cloud hosts.
  • Partner with internal and external SRE teams to provide the best experience for developers and users, securing infrastructure with robust configurations and management.
  • Collaborate broadly with AI model teams to build an efficient infrastructure, driving improvements based on user feedback and mentoring team members.
  • Participate in an on-call rotation for maintaining the reliability of NVIDIA NIMs and the NIM Factory.

What We're Looking For

  • Advanced system engineering skills in operating and improving the observability and maintainability of distributed microservices cloud applications.
  • Proven experience in working with multi-functional teams, principals, architects, and across organizational boundaries.
  • Demonstrated ability to mentor teams, grow team members, and adapt to the needs of customers.
  • Experience in operating distributed containerized applications using Docker, K8s, Cloud Endpoints, Helm, and Prometheus.
  • Experience using Infrastructure as Code tools like Terraform, Puppet, or Ansible.
  • Skilled in pinpointing issues in cloud systems, understanding security for public cloud services.
  • A BS or MS in Computer Science, Computer Engineering, or equivalent experience.
  • 8+ years of experience as an SRE or Developer working on high-performance microservices and cloud software.

Nice to Have

  • Excellent communication and interpersonal skills for engaging a multi-functional team.
  • Experience with event-driven applications using services such as Temporal, Airflow, Kafka, or Redis.
  • A background of building and deploying containers for Microservices, Cloud, and On-prem deployments, along with their associated CI/CD pipelines.
  • A history of dealing with high cardinality and dimensions of metrics.

Technical Stack

  • Containerization & Orchestration: Docker, K8s, Helm
  • Monitoring & Observability: Prometheus
  • Infrastructure as Code: Terraform, Puppet, Ansible
  • Event-Driven & Workflow: Temporal, Airflow, Kafka, Redis
  • Integration: Cloud Endpoints

Benefits & Compensation

  • Compensation: $184,000 USD - $287,500 USD for Level 4, and $224,000 USD - $356,500 USD for Level 5 + equity eligibility.
  • Equity participation.
  • Comprehensive benefits package.

Team & Environment

Join a diverse and supportive environment where everyone is inspired to do their best work. You will collaborate broadly across teams to build efficient infrastructure and drive improvements.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Required Skills
DockerKubernetesTerraformPrometheusHelmAnsiblePuppetCloud EndpointsTemporalAirflowPythonGoCI/CDDistributed SystemsObservability
Relocating to Thailand?

Visa and work permit handled by experts

SVBL manages your entire visa process — from application to approval. Work permits, extensions, and compliance all covered. One partner for legal, immigration, and settling in.

Work permit processing
Visa extensions & renewals
Immigration compliance
Banking & housing guidance
Get free consultation
Free initial consultation
About company
NVIDIA

NVIDIA is the platform upon which every new AI‑powered application is built.

Visit website
Job Details
Category infrastructure
Posted 8 months ago