NVIDIA is hiring a Senior AI/HPC Engineer to join its Infrastructure Specialist team. In this role, you will be the technical face to the customer, interacting with partners and internal teams to analyze, define, and implement large-scale AI/HPC projects across Networking, System Design, and Automation.
What You'll Do
- Deploy, manage, and maintain AI/HPC infrastructure in Linux-based environments.
- Act as the domain expert for customers from planning calls through implementation.
- Create comprehensive handover documentation and perform knowledge transfers for sophisticated systems.
- Provide feedback to internal teams through bug reports, workarounds, and suggested improvements.
What We're Looking For
- A BS/MS/PhD or equivalent experience in Computer Science, Engineering, Physics, Mathematics, or a related field.
- 5+ years providing in-depth support and deployment services for hardware and software products.
- Expert knowledge and experience with Linux System Administration, including process management, package management, kernel management, boot troubleshooting, and performance optimization.
- Experience with cluster management technologies and schedulers such as SLURM, LSF, or UGE.
- Scripting proficiency and strong organizational skills with the ability to prioritize tasks with limited supervision.
- Excellent verbal and written English skills and good interpersonal skills for resolving critical customer issues.
- Industry-standard Linux certifications.
- Experience with advanced networking, including routing, tuning, and monitoring.
Nice to Have
- Hands-on experience with MPI (e.g., OpenMPI, MPICH), including distributed communication programming and cluster debugging.
- In-depth understanding of NCCL principles and expertise in collective communication optimization for NVIDIA GPU clusters.
- Experience deploying and optimizing high-speed networks (InfiniBand/Ethernet) and understanding their impact on GPU cluster performance.
- Familiarity with automation tools like Ansible, Salt, or Puppet for batch configuration and operational automation.
- Knowledge and hands-on experience with Kubernetes for container orchestration, resource scheduling, and integration with HPC environments.
Technical Stack
- Linux, SLURM, LSF, UGE
- MPI (OpenMPI, MPICH), NCCL
- InfiniBand, Ethernet
- Ansible, Salt, Puppet, Kubernetes
Team & Environment
You will join NVIDIA's Infrastructure Specialist team, a diverse and supportive environment where everyone is inspired to do their best work.
NVIDIA is an equal opportunity employer and values diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.




