Full-time

NVIDIA is hiring a Senior HPC DevOps Engineer

About the Role

NVIDIA is hiring a Senior HPC DevOps Engineer to help build the supercomputers and HPC clusters of the future. In this role, you will be a key player in groundbreaking advancements in artificial intelligence and GPU computing, driving the latest breakthroughs in at-scale system design and tuning.

What You'll Do

  • Design, implement, and maintain large-scale HPC/AI clusters with state-of-the-art monitoring, logging, and alerting systems.
  • Utilize and develop tools to manage infrastructure as code, ensuring scalable and repeatable deployments.
  • Develop and maintain continuous integration and continuous delivery (CI/CD) pipelines to automate and streamline deployment processes.
  • Develop automation scripts and tools to automate deployment, configuration management, and operational monitoring.
  • Develop complex Networking automations.
  • Perform comprehensive troubleshooting from bare metal to application level, ensuring system reliability and efficiency.
  • Serve as a technical resource, developing and sharing best practices with internal teams.
  • Support R&D activities and engage in proof of concepts (POCs) and proof of values (POVs) for future improvements.

What We're Looking For

  • B.Sc. in Computer Science, Engineering, or a related field with 5+ years of experience.
  • Deep knowledge of HPC and AI solution technologies, including CPUs, GPUs, high-speed interconnects, and supporting software.
  • Advanced proficiency in programming and scripting languages, with a solid understanding of object-oriented programming principles.
  • Familiarity with Jenkins, Ansible, Puppet/Chef.
  • Excellent knowledge of Windows and Linux (Redhat/CentOS and Ubuntu), networking and OS-level security.
  • Deep understanding of networking protocols such as InfiniBand and Ethernet.
  • Experience with job scheduling workloads and orchestration tools such as Slurm and Kubernetes.
  • Background with multiple storage solutions like Lustre, GPFS, ZFS, and XFS.
  • Expertise with virtual systems (VMware, Hyper-V, KVM, Citrix).
  • Familiarity with cloud platforms (AWS, Azure, Google Cloud).

Nice to Have

  • Proven networking experience or strong knowledge through professional networking training.
  • Knowledge of CPU and/or GPU architecture.
  • Understanding of Kubernetes and container-related microservice technologies.
  • Experience with GPU-focused hardware/software (DGX, CUDA).
  • Background with RDMA (InfiniBand or RoCE) fabrics.

Technical Stack

  • Automation & CI/CD: Jenkins, Ansible, Puppet/Chef
  • Operating Systems: Windows, Linux (Redhat/CentOS, Ubuntu)
  • Networking: InfiniBand, Ethernet
  • Orchestration: Slurm, Kubernetes
  • Storage: Lustre, GPFS, ZFS, XFS
  • Virtualization: VMware, Hyper-V, KVM, Citrix
  • Cloud Platforms: AWS, Azure, Google Cloud
  • GPU Technologies: DGX, CUDA

NVIDIA values diversity and is committed to creating an inclusive environment for all employees. We do not discriminate on the basis of race, religion, color, national origin, sex, gender, gender expression, sexual orientation, age, marital status, veteran status, or disability status. We provide reasonable accommodations to ensure all individuals can participate in the job application or interview process, perform essential job functions, and receive other benefits and privileges of employment.

Required Skills
JenkinsAnsiblePuppet/ChefLinuxRedhat/CentOSUbuntuInfiniBandEthernetSlurmKubernetesLustreHPCDevOpsWindows
Looking for a remote dev community?

200+ professionals, 37 countries, one network

Working remotely doesn't mean working alone. Iglu connects you with developers, designers, and digital experts worldwide. Collaborate, learn, and grow together.

Global professional network
Knowledge sharing & collaboration
Regular community events
Cross-project opportunities
Join the community
37 countries represented
About company
NVIDIA

NVIDIA is the platform upon which every new AI‑powered application is built.

Visit website
Job Details
Category infrastructure
Posted 3 months ago