About the Role

The role involves developing and refining model training infrastructure, working closely with ML practitioners to enhance training efficiency, scalability, and usability of core systems.

Responsibilities

Design and implement scalable systems for training deep learning models
Collaborate with machine learning teams to understand training workflow challenges
Optimize training pipelines for speed, cost, and reliability
Build tools that simplify hyperparameter tuning and experiment tracking
Improve distributed training performance across GPU clusters
Develop abstractions that make training accessible to non-experts
Diagnose and resolve bottlenecks in data loading and model convergence
Ensure training systems integrate smoothly with deployment infrastructure
Maintain high standards for code quality and system observability
Contribute to architectural decisions around model lifecycle management
Evaluate new frameworks and libraries for training efficiency gains
Support reproducibility and versioning of training runs
Work on fault tolerance and automatic recovery in long-running jobs
Help define best practices for training large-scale models
Bridge gaps between research prototypes and production-ready systems

Nice to Have

Experience with PyTorch or TensorFlow at scale
Knowledge of mixed-precision training and memory optimization
Background in high-performance computing or systems programming
Prior work on ML platforming or MLOps tooling
Familiarity with data parallelism and model parallelism strategies
Contributions to open-source machine learning projects
Understanding of model convergence monitoring and debugging

Benefits

Comprehensive health, dental, and vision insurance
Flexible paid time off policy
Home office setup allowance
Ongoing professional development budget
Equity compensation in a growing startup
Parental leave policy
Mental health and wellness resources
401(k) or equivalent retirement plan

Compensation

Competitive salary and equity package

Work Arrangement

Remote-friendly with flexibility for hybrid or in-office collaboration

Team

Small, focused engineering team building infrastructure for machine learning workflows

About the Role

This position focuses on strengthening the core infrastructure used to train modern machine learning models.
You will work on reducing iteration time for ML teams and increasing the efficiency of compute usage.

What We Value

Practical problem solving over theoretical perfection.
Clear communication when coordinating across technical domains.
Ownership of systems from design through production support.

Available for qualified candidates

Baseten is hiring a Senior Software Engineer - Model Training

About the Role

Responsibilities

Nice to Have

Benefits

Compensation

Work Arrangement

Team

About the Role

What We Value

Similar Jobs

Staff Machine Learning Engineer

Generative AI - 3D Foundation Model

Senior Computer Vision Engineer

AI Research Engineer

Senior Machine Learning Engineer - Content Enrichment

Machine Learning Engineer III

Related Articles

Platform Engineering: Kubernetes for All

Become an AI Developer: Your Career Guide

Developer Experience Platform: Lessons from Europe