NVIDIA is hiring a Senior HPC DevOps Engineer to help build the supercomputers and HPC clusters of the future. In this role, you will be a key player in groundbreaking advancements in artificial intelligence and GPU computing, driving the latest breakthroughs in at-scale system design and tuning.
What You'll Do
- Design, implement, and maintain large-scale HPC/AI clusters with state-of-the-art monitoring, logging, and alerting systems.
- Utilize and develop tools to manage infrastructure as code, ensuring scalable and repeatable deployments.
- Develop and maintain continuous integration and continuous delivery (CI/CD) pipelines to automate and streamline deployment processes.
- Develop automation scripts and tools to automate deployment, configuration management, and operational monitoring.
- Develop complex Networking automations.
- Perform comprehensive troubleshooting from bare metal to application level, ensuring system reliability and efficiency.
- Serve as a technical resource, developing and sharing best practices with internal teams.
- Support R&D activities and engage in proof of concepts (POCs) and proof of values (POVs) for future improvements.
What We're Looking For
- B.Sc. in Computer Science, Engineering, or a related field with 5+ years of experience.
- Deep knowledge of HPC and AI solution technologies, including CPUs, GPUs, high-speed interconnects, and supporting software.
- Advanced proficiency in programming and scripting languages, with a solid understanding of object-oriented programming principles.
- Familiarity with Jenkins, Ansible, Puppet/Chef.
- Excellent knowledge of Windows and Linux (Redhat/CentOS and Ubuntu), networking and OS-level security.
- Deep understanding of networking protocols such as InfiniBand and Ethernet.
- Experience with job scheduling workloads and orchestration tools such as Slurm and Kubernetes.
- Background with multiple storage solutions like Lustre, GPFS, ZFS, and XFS.
- Expertise with virtual systems (VMware, Hyper-V, KVM, Citrix).
- Familiarity with cloud platforms (AWS, Azure, Google Cloud).
Nice to Have
- Proven networking experience or strong knowledge through professional networking training.
- Knowledge of CPU and/or GPU architecture.
- Understanding of Kubernetes and container-related microservice technologies.
- Experience with GPU-focused hardware/software (DGX, CUDA).
- Background with RDMA (InfiniBand or RoCE) fabrics.
Technical Stack
- Automation & CI/CD: Jenkins, Ansible, Puppet/Chef
- Operating Systems: Windows, Linux (Redhat/CentOS, Ubuntu)
- Networking: InfiniBand, Ethernet
- Orchestration: Slurm, Kubernetes
- Storage: Lustre, GPFS, ZFS, XFS
- Virtualization: VMware, Hyper-V, KVM, Citrix
- Cloud Platforms: AWS, Azure, Google Cloud
- GPU Technologies: DGX, CUDA
NVIDIA values diversity and is committed to creating an inclusive environment for all employees. We do not discriminate on the basis of race, religion, color, national origin, sex, gender, gender expression, sexual orientation, age, marital status, veteran status, or disability status. We provide reasonable accommodations to ensure all individuals can participate in the job application or interview process, perform essential job functions, and receive other benefits and privileges of employment.


