Full-time

NVIDIA is hiring an AI Infrastructure Engineer, DGXC Lepton

About the Role

NVIDIA is looking for an AI Infrastructure Engineer to join the DGX Cloud (DGXC) Lepton team. You will design, build, and maintain the AI infrastructure that enables large-scale AI training and inferencing, implementing software and systems engineering practices to ensure high efficiency and availability.

What You'll Do

  • Develop infrastructure software and tools for large-scale AI, LLM, and GenAI infrastructure.
  • Develop and optimize tools to improve infrastructure efficiency and resiliency.
  • Root cause and analyze and triage failures from the application level to the hardware level.
  • Enhance infrastructure and products underpinning NVIDIA's AI platforms.
  • Co-design and implement APIs for integration with NVIDIA's resiliency stacks.
  • Define meaningful and actionable reliability metrics to track and improve system and service reliability.
  • Apply strong problem-solving, root cause analysis, and optimization skills.

What We're Looking For

  • A minimum of 12+ years of experience in developing software infrastructure for large scale AI systems.
  • Bachelor's degree or higher in Computer Science or a related technical field (or equivalent experience).
  • Strong debugging skills and experience in analyzing and triaging AI applications from the application level to the hardware level.
  • Proven track record in building and scaling large-scale distributed systems.
  • Experience with AI training and inferencing and data infrastructure services.
  • Familiarity in operating large-scale observability platforms for monitoring and logging (e.g., ELK, Prometheus, Loki).
  • Proficiency in programming languages such as Python, C/C++, and scripting languages.
  • Excellent communication and collaboration skills.

Nice to Have

  • Experience in working with large scale AI clusters.
  • Strong understanding of NVIDIA GPUs and network technologies (RDMA, IB, NCCL).
  • Good understanding of DL frameworks internal to PyTorch, TensorFlow, JAX, and Ray.
  • Experience and root cause analysis of failures at the datacenter scale.
  • Strong background in software design and development.

Technical Stack

  • Languages: Python, C/C++
  • Observability: ELK, Prometheus, Loki
  • Frameworks: PyTorch, TensorFlow, JAX, Ray

Team & Environment

You will be part of the DGX Cloud Team. We cultivate a dynamic and supportive environment that values learning and growth, with a culture of blameless postmortems, iterative improvement, and risk-taking. We value diversity, intellectual curiosity, problem solving, and openness.

Benefits & Compensation

  • Compensation: $224,000 USD - $356,500 USD for Level 5, and $272,000 USD - $425,500 USD for Level 6 + equity eligibility.
  • Equity.
  • Comprehensive benefits package.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Required Skills
PythonC/C++ELKPrometheusLokiPyTorchTensorFlowJAXRayAI InfrastructureDistributed SystemsObservabilityPerformance OptimizationGPU Computing
Invoicing holding you back?

Focus on work, not paperwork

Stop worrying about invoicing, taxes, and compliance. Glopay handles the business setup, you handle the client work. Get paid faster and look professional.

Auto-generated compliant invoices
Built-in expense management
Income reports for tax season
95% of earnings stay with you
Try Glopay free
No credit card needed
About company
NVIDIA

NVIDIA is the platform upon which every new AI‑powered application is built.

Visit website
Job Details
Category infrastructure
Posted 5 months ago