Build and maintain distributed inference systems capable of handling high request volumes and supporting text, image, and multimodal models reliably.
Develop and refine distributed inference methods such as Mixture of Experts, tensor parallelism, and pipeline parallelism to maximize serving performance.
Improve inference speed and resource efficiency using CUDA graphs, TensorRT/TRT-LLM optimizations, PyTorch compilation, and speculative decoding techniques.
Partner with hardware teams to identify performance bottlenecks and jointly optimize inference workloads across GPUs, TPUs, and specialized accelerators.
Collaborate with AI researchers and infrastructure teams to create optimized execution strategies and streamline end-to-end model serving workflows.

competitive compensation

Together AI is hiring a LLM Inference Frameworks and Optimization Engineer