Responsibilities
- Build and maintain distributed inference systems capable of handling high request volumes and supporting text, image, and multimodal models reliably.
- Develop and refine distributed inference methods such as Mixture of Experts, tensor parallelism, and pipeline parallelism to maximize serving performance.
- Improve inference speed and resource efficiency using CUDA graphs, TensorRT/TRT-LLM optimizations, PyTorch compilation, and speculative decoding techniques.
- Partner with hardware teams to identify performance bottlenecks and jointly optimize inference workloads across GPUs, TPUs, and specialized accelerators.
- Collaborate with AI researchers and infrastructure teams to create optimized execution strategies and streamline end-to-end model serving workflows.
Benefits
- competitive compensation
- startup equity
- health insurance
- other competitive benefits
Compensation
competitive compensation