Responsibilities
- Cluster Hardening & Validation: Design and execute rigorous pre-handover test suites (NCCL, DCGM, GPU Burn) to ensure clusters are stable under the extreme stress of multi-node training.
- Technical Partnership: Act as the primary technical point of contact for model labs, helping them tune their orchestration layer (Kubernetes or SLURM) for maximum throughput.
- Infrastructure Optimization: Profile and debug low-level bottlenecks in InfiniBand (IB) fabrics, NVLink topologies, and high-performance storage systems.
- Opinionated Onboarding: Build reference designs and "out-of-the-box" configurations for training frameworks to reduce customer time-to-train.
- Benchmarking & Migration: Lead complex benchmarking exercises to demonstrate the performance impact of migrating to new hardware families or Together AI’s optimized infrastructure.
- Product Feedback Loop: Directly influence our hardware and software roadmap by surfacing edge cases and performance gaps found during customer deployments.
Requirements
- Experience: 5+ years in a technical role, with a strong focus on Large-Scale GPU Infrastructure.
- Orchestration Mastery: Deep, hands-on experience with Kubernetes (specifically GPU-operator and device plugins) and/or SLURM for workload scheduling.
- Networking & Interconnects: Expert knowledge of InfiniBand, RoCE, and NVLink; ability to diagnose network failures that degrade collective communication (NCCL).
- Storage Knowledge: Familiarity with parallel file systems (VAST or Weka preferred) and object storage, specifically in the context of large-scale checkpointing.
- Benchmarking Skills: Ability to run and interpret training benchmarks and communication tests to validate cluster health and performance.
- Coding & Automation: Proficiency in Python and shell scripting; experience with Ansible or similar tools for automated cluster configuration.
- Willingness to dive into the customer's stack to solve hard problems and comfortable with the high-stakes, fast-paced environment of frontier model labs.