Responsibilities
- Take a central role in creating cutting-edge large language models and foundational AI systems within a lean, high-impact team
- Assist in developing a robust, scalable, and maintainable software infrastructure
- Utilize the developed stack to train top-tier models that are released as open source and embedded into new product offerings
- Architect, build, and refine core components across the entire model development lifecycle, including data preparation, evaluation, pre-training, and post-training phases
- Identify and fix issues related to GPU or kernel performance, memory and storage constraints, and failures across multiple compute nodes
- Work closely with team members to troubleshoot training instabilities and associated technical challenges
- Develop essential tooling, monitoring systems, and observability frameworks to support large-scale LLM development
- Provide technical leadership as training systems grow in scale and complexity
- Share insights and help shape strategic decisions across the organization
Work Arrangement
Remote (Worldwide)
Team
Small team with a blend of research and engineering expertise focused on high-impact contributions
