Luma AI is hiring a Research Scientist / Engineer – Performance Optimization to maximize the efficiency and performance of our AI models. You’ll work closely with both research and engineering teams to ensure cutting-edge multimodal models can be trained efficiently and deployed at scale while maintaining the highest quality standards.
What You'll Do
- Profile and optimize GPU/CPU/Accelerator code for maximum utilization and minimal latency.
- Write high-performance PyTorch, Triton, and CUDA code, deferring to custom PyTorch operations if necessary.
- Develop fused kernels and leverage tensor cores and modern hardware features for optimal utilization across platforms.
- Optimize model architectures and implementations for distributed multi-node production deployment.
- Build performance monitoring and analysis tools and automation.
- Research and implement cutting-edge optimization techniques for transformer models.
What We're Looking For
- Expert-level proficiency in Triton/CUDA programming and GPU optimization.
- Strong PyTorch skills.
- Experience with PyTorch kernel development and custom operations.
- Proficiency with profiling tools like NVIDIA Nsight, torch profiler, and custom tooling.
- Deep understanding of transformer architectures and attention mechanisms.
Nice to Have
- Experience with compilers/exporters such as torch.compile, TensorRT, ONNX, or XLA.
- Experience optimizing inference workloads for latency and throughput.
- Experience with Triton compiler and kernel fusion techniques.
- Knowledge of warp-level intrinsics and advanced CUDA optimization.
- Background in compiler optimization or hardware-software co-design.
Technical Stack
- PyTorch, Triton, CUDA
- NVIDIA Nsight, torch profiler
- torch.compile, TensorRT, ONNX, XLA
Benefits & Compensation
- Salary: $180,000 - $250,000/yr
- Competitive equity packages in the form of stock options.
- A comprehensive benefits plan.
Luma AI is an equal opportunity employer.





