Responsibilities
- Architect and implement the foundational systems that power Claude and push the frontiers of what's possible with large language models.
- Be responsible for maximizing GPU utilization and performance at unprecedented scale, developing cutting-edge optimizations that directly enable new model capabilities and dramatically improve inference efficiency.
- Working at the intersection of hardware and software, you'll implement state-of-the-art techniques from custom kernel development to distributed system architectures.
- Your work will span the entire stack—from low-level tensor core optimizations to orchestrating thousands of GPUs in perfect synchronization.
Requirements
- Have deep experience with GPU programming and optimization at scale
- Are impact-driven, passionate about delivering measurable performance breakthroughs
- Can navigate complex systems from hardware interfaces to high-level ML frameworks
- Enjoy collaborative problem-solving and pair programming
- Want to work on state-of-the-art language models with real-world impact
- Care about the societal impacts of your work
- Thrive in ambiguous environments where you define the path forward
Nice to Have
- GPU Kernel Development: CUDA, Triton, CUTLASS, Flash Attention, tensor core optimization
- ML Compilers & Frameworks: PyTorch/JAX internals, torch.compile, XLA, custom operators
- Performance Engineering: Kernel fusion, memory bandwidth optimization, profiling with Nsight
- Distributed Systems: NCCL, NVLink, collective communication, model parallelism
- Low-Precision: INT8/FP8 quantization, mixed-precision techniques
- Production Systems: Large-scale training infrastructure, fault tolerance, cluster orchestration