About the Role
The role involves building and improving the infrastructure that supports machine learning workflows, ensuring models are efficiently trained, deployed, and monitored in production.
Responsibilities
- Design and implement scalable ML pipelines for training and inference
- Develop tools to automate model deployment and rollback processes
- Monitor system performance and model behavior in production
- Collaborate with data scientists to integrate models into production systems
- Improve reliability and efficiency of ML infrastructure
- Create versioning systems for models and datasets
- Optimize resource usage for training and serving workloads
- Ensure reproducibility across ML workflows
- Support continuous integration and delivery for ML systems
- Troubleshoot issues in model serving environments
- Maintain documentation for ML operations processes
- Enforce security and compliance standards in ML systems
- Work with infrastructure teams to manage cloud resources
- Implement monitoring and alerting for model performance
- Contribute to internal frameworks for experiment tracking
- Support model validation and testing procedures
- Assist in scaling systems for high-throughput inference
- Evaluate new tools and technologies for ML Ops
- Promote best practices in machine learning engineering
- Ensure seamless collaboration between research and engineering teams
Nice to Have
- Master’s degree in computer science or related field
- Experience with large-scale distributed systems
- Contributions to open-source ML projects
- Knowledge of real-time data processing frameworks
- Background in automated testing for ML systems
- Experience with feature store implementations
- Familiarity with regulatory requirements for ML systems
- Prior work in safety-critical or high-assurance domains
Compensation
Competitive salary and benefits package
Work Arrangement
Remote position with flexible hours
Team
Collaborative team focused on scalable machine learning systems
About the Team
This team builds foundational systems that enable reliable and scalable machine learning in production. Members work closely with researchers and engineers to bridge the gap between experimentation and deployment.
What We Value
We prioritize technical excellence, clear communication, and a collaborative mindset. Candidates should demonstrate a strong ownership culture and a drive to solve complex infrastructure challenges.
Available for qualified candidates
