Baseten is hiring a Senior Software Engineer - Model Training

About the Role

The role involves developing and refining model training infrastructure, working closely with ML practitioners to enhance training efficiency, scalability, and usability of core systems.

Responsibilities

  • Design and implement scalable systems for training deep learning models
  • Collaborate with machine learning teams to understand training workflow challenges
  • Optimize training pipelines for speed, cost, and reliability
  • Build tools that simplify hyperparameter tuning and experiment tracking
  • Improve distributed training performance across GPU clusters
  • Develop abstractions that make training accessible to non-experts
  • Diagnose and resolve bottlenecks in data loading and model convergence
  • Ensure training systems integrate smoothly with deployment infrastructure
  • Maintain high standards for code quality and system observability
  • Contribute to architectural decisions around model lifecycle management
  • Evaluate new frameworks and libraries for training efficiency gains
  • Support reproducibility and versioning of training runs
  • Work on fault tolerance and automatic recovery in long-running jobs
  • Help define best practices for training large-scale models
  • Bridge gaps between research prototypes and production-ready systems

Nice to Have

  • Experience with PyTorch or TensorFlow at scale
  • Knowledge of mixed-precision training and memory optimization
  • Background in high-performance computing or systems programming
  • Prior work on ML platforming or MLOps tooling
  • Familiarity with data parallelism and model parallelism strategies
  • Contributions to open-source machine learning projects
  • Understanding of model convergence monitoring and debugging

Benefits

  • Comprehensive health, dental, and vision insurance
  • Flexible paid time off policy
  • Home office setup allowance
  • Ongoing professional development budget
  • Equity compensation in a growing startup
  • Parental leave policy
  • Mental health and wellness resources
  • 401(k) or equivalent retirement plan

Compensation

Competitive salary and equity package

Work Arrangement

Remote-friendly with flexibility for hybrid or in-office collaboration

Team

Small, focused engineering team building infrastructure for machine learning workflows

About the Role

  • This position focuses on strengthening the core infrastructure used to train modern machine learning models.
  • You will work on reducing iteration time for ML teams and increasing the efficiency of compute usage.

What We Value

  • Practical problem solving over theoretical perfection.
  • Clear communication when coordinating across technical domains.
  • Ownership of systems from design through production support.

Available for qualified candidates

Required Skills
PytorchKubernetesDistributed SystemsMachine Learning
About company
Baseten
Baseten provides the infrastructure, tooling, and expertise needed to bring great AI products to market - fast. We’re trusted by leading AI-driven innovators to deliver industry-leading performance, security, and reliability for their mission-critical workloads.
All jobs at Baseten Visit website
Job Details
Category other
Posted 10 months ago