Responsibilities
- Design and optimize high-performance data pipelines for distributed training and storage (using tools like Arrow, DuckDB, LanceDB, BigQuery, vector databases).
- Focus on low-level optimizations (latency, throughput, reliability, GPU usage).
- Build monitoring and visualization tools for tracking data quality, pipeline performance, and experiments.
- Optimize distributed AI workloads for reliability, latency, and efficiency.
- Scope and supervise projects so that interns, PhD students, and post-docs can contribute and collaborate effectively.
- Support recruiting efforts and help shape the growth of the infrastructure team.
Requirements
- 5+ years of backend or infrastructure engineering experience
- Strong Python programming skills (bonus points for lower-level languages)
- Experience with distributed systems and cloud platforms (AWS, GCP, Azure)
- Hands-on experience with containerization (Docker, Kubernetes) and infrastructure as code (Terraform)
- Experience building or supporting ML/AI infrastructure in production
- Experience with high-performance data tools (DuckDB, Apache Spark, Delta Lake)
- GPU orchestration and large-scale model training experience
- Familiarity with ML platforms (SageMaker, Vertex AI) and frameworks (PyTorch, JAX)
- Experience mentoring junior engineers, interns, or researchers and breaking down complex projects into manageable tasks
- Experience participating in technical hiring processes and evaluating candidates
Nice to Have
- Deep knowledge of training architectures, CUDA programming, or TPU optimization
- Full-stack development experience with frameworks like React for building web applications
- Experience managing HPC infrastructure with tools like Slurm or Kubernetes clusters
- Background in monitoring stacks (Prometheus, Grafana) for ML pipeline observability
Benefits
- Medical insurance, dental insurance, and vision insurance - ESP covers 100% of the premium
- 401k plan with match (if based in the United States)
- 2,000 USD home office stipend
- Unlimited paid time off, with a recommended minimum of three weeks per year
- Flexible working hours
- Regular team retreats around the world