Responsibilities

Define and execute the full technical roadmap for voice inference, focusing on STT, TTS, and speech-to-speech models while aligning with industry trends and platform positioning.
Design and implement high-performance inference systems that achieve industry-leading time-to-first-byte, throughput, and GPU efficiency for voice workloads.
Own the scalable deployment of voice models, including serverless and dedicated serving architectures, with optimized batching, streaming pipelines, and real-time memory management.
Develop a comprehensive evaluation framework to assess STT accuracy across diverse accents, languages, and noise conditions, and TTS quality in naturalness, latency, and pronunciation.
Establish internal benchmarking standards that guide model selection and influence long-term roadmap decisions.
Design system architecture to support emerging voice AI paradigms such as audio-native large language models, codec-based models like SNAC and Encodec, and end-to-end speech systems.
Lead technical integration with external model partners, managing the full lifecycle from onboarding to optimization and sustained performance.
Investigate and resolve deep technical issues across the stack, using profiling and root-cause analysis from kernel to framework layers.
Collaborate with platform engineering leaders to shape infrastructure decisions that meet strict latency and reliability requirements for real-time voice APIs.
Drive the technical vision for customer-facing fine-tuning of STT and TTS models, enabling customized voice experiences at scale.

Compensation

Competitive salary and equity package

Work Arrangement

Remote

Team

Part of the core machine learning infrastructure team focused on voice AI innovation

Responsibilities

Own the voice inference roadmap end-to-end — define and execute the technical strategy for optimizing STT, TTS, and speech-to-speech models across Together's infrastructure, with a clear-eyed view of where the field is heading and how to position the platform ahead of it.
Drive best-in-class inference performance — architect and implement systems targeting leading TTFB, throughput, and GPU utilization for voice workloads; set the performance bar others in the industry measure against, not just catch up to.
Lead productionization of voice models at scale — design the serving architecture for serverless and dedicated endpoints, including batching strategies, streaming inference pipelines, and memory management tailored to real-time audio; own reliability and latency SLAs.
Build the voice evaluation platform — design a rigorous, extensible evaluation framework covering WER across accents, languages, and noise conditions for STT; naturalness, latency, and pronunciation fidelity for TTS; establish the internal benchmark methodology that informs model selection and roadmap decisions.
Shape the architecture for next-generation model support — anticipate and enable emerging model paradigms — audio-native LLMs, codec-based architectures (SNAC, Encodec), and end-to-end speech-to-speech systems — before they're mainstream, not after.
Serve as the technical DRI for model partner integrations — lead deep collaboration with partners such as Cartesia, Deepgram, and Rime; own the full lifecycle from integration to optimization to ongoing performance accountability.
Diagnose and resolve the hardest performance problems in the stack — conduct systematic profiling and root-cause analysis from GPU kernel behavior to framework-level bottlenecks; drive shipped improvements with documented, measurable impact.
Influence platform architecture across the organization — partner with platform engineering leadership to ensure the serving layer is built for the latency and reliability demands of real-time voice APIs; your technical decisions should raise the ceiling for the whole team.
Define and scale voice fine-tuning capabilities — lead the technical direction for enabling customers to fine-tune STT and TTS models on Together's infrastructure, establishing the primitives for differentiated voice experiences.
Lay technical foundations for a category-defining product surface — architect systems with enough foresight that they support multiple new voice products with minimal rework; think in terms of platforms, not point solutions.

Available

Together AI is hiring a Staff Machine Learning Engineer, Voice AI

Responsibilities

Compensation

Work Arrangement

Team

Responsibilities