About the Role
Build and maintain systems that empower researchers to iterate quickly on machine learning models by delivering robust, efficient tooling and infrastructure.
Responsibilities
- Develop core infrastructure for training and evaluating machine learning models
- Create tools that streamline experimental workflows for research teams
- Optimize performance and scalability of distributed training systems
- Collaborate with researchers to understand system requirements
- Design abstractions that simplify complex ML workflows
- Improve debugging and monitoring capabilities for training jobs
- Maintain reliability and efficiency across compute clusters
- Implement versioning and reproducibility features for experiments
- Support integration of new hardware into existing pipelines
- Troubleshoot low-level system issues affecting model training
- Ensure compatibility across software and hardware configurations
- Contribute to documentation and internal tooling standards
- Evaluate new technologies for potential adoption in research stack
- Automate repetitive tasks in the research development cycle
- Work closely with software engineers to align tooling with research needs
- Enhance data handling pipelines for faster model input
- Build interfaces between research code and production systems
- Monitor system usage patterns to guide infrastructure improvements
- Support secure access to sensitive model assets
- Refactor legacy systems to improve maintainability
- Develop APIs for internal research tools
- Instrument systems for performance measurement and analysis
- Assist in capacity planning for compute resources
- Participate in code reviews and system design discussions
- Ensure tools meet evolving research demands
Nice to Have
- Advanced degree in computer science or related field
- Direct experience with large-scale model training
- Contributions to open-source ML projects
- Background in high-performance computing
- Experience with reinforcement learning systems
- Knowledge of formal verification methods
- Familiarity with safety-critical software development
- Work with experimental programming languages
- Research publications in systems or ML conferences
- Experience in startup or research lab environments
Compensation
Competitive salary and benefits package offered
Work Arrangement
Hybrid or remote work options available
Team
Part of a research-focused engineering team building advanced AI systems
Research Culture
- Work in an environment that values rigorous inquiry and methodical development
- Engage with interdisciplinary teams exploring AI safety and capabilities
- Contribute to long-term research goals with real-world impact
Technology Stack
- Use modern ML frameworks and custom tooling for model development
- Work with GPU clusters and distributed training infrastructure
- Leverage internal systems for experiment tracking and analysis
Visa sponsorship may be available for qualified candidates