Responsibilities
- Profile and optimize training workloads from start to finish, identifying bottlenecks in GPU kernel execution, NCCL communication, and storage input/output.
- Work with systems engineering teams to enhance scheduling efficiency, collective communication performance, and kernel-level execution.
- Develop and maintain precise monitoring tools to track and visualize model flops utilization, throughput metrics, and cluster availability.
- Design structured technical procedures such as incident response protocols and postmortem analyses to prevent recurring performance issues.
Work Arrangement
Hybrid


