NVIDIA is looking for a Senior Software Engineer to improve our HPC infrastructure for business-critical services and AI applications. You will join a team building and operating sophisticated, cloud-native systems using modern distributed systems patterns.
What You'll Do
- Apply modern distributed systems patterns to push the limits of scale, latency, and reliability.
- Continuously improve infrastructure provisioning and operations with automation, APIs, and self‑service platforms.
- Operate in a globally distributed, hybrid multi‑cloud environment (AWS, GCP, on‑prem), building systems that are cloud‑native and location‑agnostic.
- Build strong cross-functional relationships and align with collaborators across various business units.
- Improve uptime and Quality of Service (QoS) through data-driven operations, strong SLOs, and robust incident practices.
- Participate in the team’s on‑call rotation and lead high‑impact incident response when needed.
What We're Looking For
- Strong coding skills in at least two of: Go, Java, C/C++, Scala, Python, Elixir, with a focus on backend, systems, or infrastructure engineering.
- Deep understanding of scalability, consistency, and performance trade‑offs in server‑side systems; ability to build horizontally scalable, resilient, and low‑latency services.
- Experience owning services end‑to‑end: architecture, build reviews, implementation, testing, rollout, observability, and iterative improvement.
- Hands‑on experience with at least one major cloud provider (GCP, AWS, or Azure) and cloud‑native primitives (managed storage, messaging, compute).
- Proficiency with modern CI/CD, GitOps workflows, and Infrastructure as Code practices for safe, repeatable changes.
- Bias for action, strong problem‑solving skills, and a track record of simplifying complex systems.
- B.S. in Computer Science or related field (or equivalent experience), with 5+ years of relevant experience.
- Careful communication and collaboration skills; comfortable guiding technical decisions across teams.
Nice to Have
- Prior experience building core infrastructure or control planes for HPC clusters, large-scale AI/ML platforms, or systems managed by job schedulers (e.g., Slurm or Kubernetes).
- Maintainer or co‑maintainer responsibilities for an open source component used in production (plugins, operators, exporters, controllers, or SDKs) at large scale.
Technical Stack
- Languages: Go, Java, C/C++, Scala, Python, Elixir
- Cloud: AWS, GCP, Azure
Benefits & Compensation
- Compensation Range: $152,000 USD - $241,500 USD
- Equity
- Comprehensive benefits package
Work Mode
This role follows a hybrid work model.
NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer.




