Amsterdam Hybrid Employment

Together AI is hiring a Staff Engineer, Distributed Storage,HPC & AI Infrastructure

Responsibilities

  • Architect multi-petabyte storage solutions for AI and machine learning, incorporating systems like WekaFS and Ceph, with leadership in capacity forecasting and cost reduction through tiered storage and resource optimization.
  • Design and fine-tune high-performance networks using RDMA, InfiniBand, and 400GbE to maximize throughput and minimize latency, including deployment of NVMe-oF and iSCSI and optimization of TCP/IP for storage traffic.
  • Develop Kubernetes storage controllers and operators to enable automated provisioning, self-service interfaces, secure multi-tenancy, and quota management, along with reusable infrastructure patterns using Helm and Terraform.
  • Achieve data throughput of 10–50 GB/s per GPU node by optimizing caching strategies for models and datasets, tuning parallel filesystems, and improving data pathways across large-scale clusters.
  • Create hierarchical caching layers using local NVMe, distributed caches, and object storage, improving data locality and model weight delivery with intelligent prefetch and eviction logic.
  • Establish comprehensive observability with monitoring, alerting, and service-level objectives; design disaster recovery and backup procedures with documented runbooks and conduct chaos engineering to ensure resilience.
  • Collaborate with machine learning and site reliability teams, provide mentorship on storage efficiency, contribute to open-source projects, and produce technical documentation and post-incident reviews.

Work Arrangement

Hybrid

About company
Together AI
Together AI is a research-driven artificial intelligence company that believes open and transparent AI systems will drive innovation. They are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models, and have contributed to leading open-source research, models, and datasets.
All jobs at Together AI Visit website
Job Details
Category infrastructure
Posted 8 days ago