Responsibilities
- Architect multi-petabyte storage solutions for AI and machine learning, incorporating systems like WekaFS and Ceph, with leadership in capacity forecasting and cost reduction through tiered storage and resource optimization.
- Design and fine-tune high-performance networks using RDMA, InfiniBand, and 400GbE to maximize throughput and minimize latency, including deployment of NVMe-oF and iSCSI and optimization of TCP/IP for storage traffic.
- Develop Kubernetes storage controllers and operators to enable automated provisioning, self-service interfaces, secure multi-tenancy, and quota management, along with reusable infrastructure patterns using Helm and Terraform.
- Achieve data throughput of 10–50 GB/s per GPU node by optimizing caching strategies for models and datasets, tuning parallel filesystems, and improving data pathways across large-scale clusters.
- Create hierarchical caching layers using local NVMe, distributed caches, and object storage, improving data locality and model weight delivery with intelligent prefetch and eviction logic.
- Establish comprehensive observability with monitoring, alerting, and service-level objectives; design disaster recovery and backup procedures with documented runbooks and conduct chaos engineering to ensure resilience.
- Collaborate with machine learning and site reliability teams, provide mentorship on storage efficiency, contribute to open-source projects, and produce technical documentation and post-incident reviews.
Work Arrangement
Hybrid