San Francisco Hybrid Employment $270,000 - $300,000

Together AI is hiring a Forward Deployed Engineer (GPU Clusters)

Responsibilities

  • Cluster Hardening & Validation: Design and execute rigorous pre-handover test suites (NCCL, DCGM, GPU Burn) to ensure clusters are stable under the extreme stress of multi-node training.
  • Technical Partnership: Act as the primary technical point of contact for model labs, helping them tune their orchestration layer (Kubernetes or SLURM) for maximum throughput.
  • Infrastructure Optimization: Profile and debug low-level bottlenecks in InfiniBand (IB) fabrics, NVLink topologies, and high-performance storage systems.
  • Opinionated Onboarding: Build reference designs and "out-of-the-box" configurations for training frameworks to reduce customer time-to-train.
  • Benchmarking & Migration: Lead complex benchmarking exercises to demonstrate the performance impact of migrating to new hardware families or Together AI’s optimized infrastructure.
  • Product Feedback Loop: Directly influence our hardware and software roadmap by surfacing edge cases and performance gaps found during customer deployments.

Requirements

  • Experience: 5+ years in a technical role, with a strong focus on Large-Scale GPU Infrastructure.
  • Orchestration Mastery: Deep, hands-on experience with Kubernetes (specifically GPU-operator and device plugins) and/or SLURM for workload scheduling.
  • Networking & Interconnects: Expert knowledge of InfiniBand, RoCE, and NVLink; ability to diagnose network failures that degrade collective communication (NCCL).
  • Storage Knowledge: Familiarity with parallel file systems (VAST or Weka preferred) and object storage, specifically in the context of large-scale checkpointing.
  • Benchmarking Skills: Ability to run and interpret training benchmarks and communication tests to validate cluster health and performance.
  • Coding & Automation: Proficiency in Python and shell scripting; experience with Ansible or similar tools for automated cluster configuration.
  • Willingness to dive into the customer's stack to solve hard problems and comfortable with the high-stakes, fast-paced environment of frontier model labs.
Required Skills
KubernetesPythonShell Scripting
About company
Together AI
Together AI is a research-driven artificial intelligence company that believes open and transparent AI systems will drive innovation. They are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models, and have contributed to leading open-source research, models, and datasets.
All jobs at Together AI Visit website
Job Details
Category other
Posted 8 days ago