Architect multi-petabyte storage solutions for AI and machine learning, incorporating systems like WekaFS and Ceph, with leadership in capacity forecasting and cost reduction through tiered storage and resource optimization.
Design and fine-tune high-performance networks using RDMA, InfiniBand, and 400GbE to maximize throughput and minimize latency, including deployment of NVMe-oF and iSCSI and optimization of TCP/IP for storage traffic.
Develop Kubernetes storage controllers and operators to enable automated provisioning, self-service interfaces, secure multi-tenancy, and quota management, along with reusable infrastructure patterns using Helm and Terraform.
Achieve data throughput of 10–50 GB/s per GPU node by optimizing caching strategies for models and datasets, tuning parallel filesystems, and improving data pathways across large-scale clusters.
Create hierarchical caching layers using local NVMe, distributed caches, and object storage, improving data locality and model weight delivery with intelligent prefetch and eviction logic.
Establish comprehensive observability with monitoring, alerting, and service-level objectives; design disaster recovery and backup procedures with documented runbooks and conduct chaos engineering to ensure resilience.
Collaborate with machine learning and site reliability teams, provide mentorship on storage efficiency, contribute to open-source projects, and produce technical documentation and post-incident reviews.

Hybrid

Together AI is hiring a Staff Engineer, Distributed Storage,HPC & AI Infrastructure