As a Senior HPC and AI Networking Performance Research and Analysis Engineer, you will investigate and enhance the performance of AI workloads running on extensive GPU and CPU systems. Your primary focus will be on distributed deep learning applications, particularly large language model training and inference, where communication patterns and network efficiency play a critical role.
Key Responsibilities
- Conduct in-depth profiling and analysis of AI workloads to uncover performance bottlenecks, especially in communication and data transfer layers
- Design and execute benchmarking strategies to evaluate system behavior under real-world conditions
- Collaborate with hardware and software teams to assess performance across CPUs, GPUs, host channel adapters, and network switches
- Develop and apply simulation models, performance tools, and analytical methods to diagnose system limitations
- Investigate low-level system interactions to determine root causes of performance issues
- Establish performance baselines and define testing strategies for emerging technologies
- Guide optimization efforts to achieve maximum system throughput and efficiency
Qualifications
Applicants should hold a Bachelor's degree in Computer Science or Software Engineering and bring at least six years of hands-on experience in high-performance networking. Essential skills include deep familiarity with RDMA, MPI, NCCL, and networking protocols such as RoCE. Proficiency in Python, Bash, and C is required, along with strong Linux system knowledge.
Experience with NVIDIA GPUs, CUDA libraries, and deep learning frameworks like TensorFlow or PyTorch is necessary. Demonstrated ability in performance analysis, problem solving, and cross-team collaboration is essential.
Preferred Background
- Proven track record in benchmarking AI workloads, especially for distributed LLM training
- Strong understanding of CUDA and NCCL internals
- Comprehensive knowledge of system architecture, including CPUs (Intel, AMD, ARM), GPUs, memory, and PCI subsystems
- Familiarity with congestion control mechanisms in high-speed networks
