NVIDIA is hiring a Principal Software Engineer – Large-Scale LLM Memory and Storage Systems to define the vision and roadmap for memory management of large-scale LLM and storage systems within the Dynamo inference framework. You will design and evolve a unified memory layer spanning multiple tiers and architect deep integrations with leading LLM serving engines.
What You'll Do
- Design and evolve a unified memory layer that spans GPU memory, pinned host memory, RDMA-accessible memory, SSD tiers, and remote storage to support large-scale LLM inference.
- Architect and implement deep integrations with leading LLM serving engines (like vLLM, SGLang, TensorRT-LLM), focusing on KV-cache offload, reuse, and remote sharing across heterogeneous clusters.
- Co-design interfaces and protocols enabling disaggregated prefill, peer-to-peer KV-cache sharing, and multi-tier KV-cache storage for high-throughput, low-latency inference.
- Partner closely with GPU architecture, networking, and platform teams to exploit GPUDirect, RDMA, NVLink, and similar technologies for low-latency KV-cache access across heterogeneous accelerators.
- Mentor engineers, set technical direction for memory and storage subsystems, and represent the team in internal reviews and external forums.
What We're Looking For
- Masters, PhD, or equivalent experience.
- 15+ years of experience building large-scale distributed systems, high-performance storage, or ML systems infrastructure in C/C++ and Python, with a track record of delivering production services.
- Deep understanding of memory hierarchies (GPU HBM, host DRAM, SSD, remote storage) and experience designing systems spanning multiple tiers for performance and cost.
- Expertise in distributed caching or key-value systems optimized for low latency and high concurrency.
- Hands-on experience with networked I/O and RDMA/NVMe-oF/NVLink-style technologies, and familiarity with disaggregated and aggregated deployments for AI clusters.
- Strong skills in profiling and optimizing systems across CPU, GPU, memory, and network, using metrics to drive architectural decisions and validate improvements.
- Excellent communication skills and prior experience leading cross-functional efforts.
Nice to Have
- Prior contributions to open-source LLM serving or systems projects focused on KV-cache optimization, compression, streaming, or reuse.
- Experience designing unified memory or storage layers exposing a single logical KV or object model across GPU, host, SSD, and cloud tiers, especially in enterprise or hyperscale environments.
- Publications or patents in areas like LLM systems, memory-disaggregated architectures, RDMA/NVLink-based data planes, or KV-cache systems for ML.
Technical Stack
- Languages: Rust, Python, C/C++
- Technologies: GPU, RDMA, NVLink, NVMe-oF
- Frameworks: vLLM, SGLang, TensorRT-LLM
Benefits & Compensation
- Compensation: $272,000 USD - $425,500 USD + equity.
- Equity package.
- Comprehensive benefits package.
NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.





