About the Role
The role involves developing and refining model training infrastructure, working closely with ML practitioners to enhance training efficiency, scalability, and usability of core systems.
Responsibilities
- Design and implement scalable systems for training deep learning models
- Collaborate with machine learning teams to understand training workflow challenges
- Optimize training pipelines for speed, cost, and reliability
- Build tools that simplify hyperparameter tuning and experiment tracking
- Improve distributed training performance across GPU clusters
- Develop abstractions that make training accessible to non-experts
- Diagnose and resolve bottlenecks in data loading and model convergence
- Ensure training systems integrate smoothly with deployment infrastructure
- Maintain high standards for code quality and system observability
- Contribute to architectural decisions around model lifecycle management
- Evaluate new frameworks and libraries for training efficiency gains
- Support reproducibility and versioning of training runs
- Work on fault tolerance and automatic recovery in long-running jobs
- Help define best practices for training large-scale models
- Bridge gaps between research prototypes and production-ready systems
Nice to Have
- Experience with PyTorch or TensorFlow at scale
- Knowledge of mixed-precision training and memory optimization
- Background in high-performance computing or systems programming
- Prior work on ML platforming or MLOps tooling
- Familiarity with data parallelism and model parallelism strategies
- Contributions to open-source machine learning projects
- Understanding of model convergence monitoring and debugging
Benefits
- Comprehensive health, dental, and vision insurance
- Flexible paid time off policy
- Home office setup allowance
- Ongoing professional development budget
- Equity compensation in a growing startup
- Parental leave policy
- Mental health and wellness resources
- 401(k) or equivalent retirement plan
Compensation
Competitive salary and equity package
Work Arrangement
Remote-friendly with flexibility for hybrid or in-office collaboration
Team
Small, focused engineering team building infrastructure for machine learning workflows
About the Role
- This position focuses on strengthening the core infrastructure used to train modern machine learning models.
- You will work on reducing iteration time for ML teams and increasing the efficiency of compute usage.
What We Value
- Practical problem solving over theoretical perfection.
- Clear communication when coordinating across technical domains.
- Ownership of systems from design through production support.
Available for qualified candidates


