Luma AI is looking for a Research Scientist / Engineer – Multimodal Capabilities to unlock advanced behaviors in our foundation models. You'll join the Multimodal Capabilities team to conduct strategic research on combining vision, audio, and language to solve fundamental questions.
What You'll Do
- Collaborate with the Foundation Models team to identify capability gaps and research solutions.
- Design datasets, experiments, and methodologies to systematically improve model capabilities across vision, audio, and language.
- Develop evaluation frameworks and benchmarking approaches for multimodal AI capabilities.
- Create prototypes and demonstrations that showcase new multimodal capabilities.
What We're Looking For
- Strong programming skills in Python and PyTorch.
- Experience with multimodal data processing pipelines and large-scale dataset curation.
- Understanding of computer vision, audio processing, and/or natural language processing techniques.
Nice to Have
- Expertise working with interleaved multimodal data.
- Hands-on experience with Vision Language Models, Audio Language Models, or generative video models.
Technical Stack
- Python, PyTorch
Team & Environment
You will be part of the Multimodal Capabilities team and collaborate closely with the Foundation Models team.
Benefits & Compensation
- Salary: $200,000 - $300,000/yr + competitive equity in the form of stock options.
- A comprehensive benefits plan.
Luma AI is an equal opportunity employer.






