Responsibilities
- Develop and refine a unified memory architecture that integrates GPU memory, pinned host memory, RDMA-capable memory, SSD layers, and remote file, object, and cloud storage to enable efficient large-scale LLM inference.
- Design and build deep integrations with top LLM serving frameworks like vLLM, SGLang, and TensorRT-LLM, focusing on KV-cache offloading, reuse, and cross-cluster sharing in heterogeneous and disaggregated environments.
- Collaborate on defining interfaces and protocols that support disaggregated prefill operations, peer-to-peer KV-cache exchange, and multi-tiered caching across GPU, CPU, local disk, and remote memory for low-latency, high-throughput inference.
- Work closely with hardware and systems teams to leverage technologies such as GPUDirect, RDMA, and NVLink for fast, efficient access and sharing of KV-cache data across diverse accelerators and memory resources.
- Guide engineering teams, establish technical vision for memory and storage subsystems, and represent the organization in technical reviews, open-source communities, conferences, and customer engagements.
Benefits
- Highly competitive salaries
- Comprehensive benefits package
- Eligibility for equity
Compensation
Highly competitive salaries, comprehensive benefits package, and eligibility for equity
Other
- Applications for this position will be accepted until at least December 26, 2025.
- The company is committed to building a diverse and inclusive workplace and is proud to be an equal opportunity employer. It does not discriminate based on race, religion, color, national origin, gender, gender identity, sexual orientation, age, marital status, veteran status, disability status, or any other legally protected characteristic.
