ABOUT THE ORGANIZATION In the coming decade, Artificial General Intelligence will emerge. Only a select few entities will successfully achieve this milestone. Their capacity to accumulate strategic advantages will determine ultimate success. These organizations will demonstrate unprecedented acceleration, attracting global top-tier talent. They'll lead cutting-edge applied research, advanced engineering, infrastructure development, and large-scale deployment. Continuous model training and scaling will be paramount. Significant capital acquisition will fuel their journey, creating potent economic ecosystems centered on user and customer success. Our mission is to establish a technological framework where AI drives economic productivity and scientific advancement. TEAM COMPOSITION We operate as a distributed team spanning Europe and North America, converging monthly for three-day in-person sessions and biannual extended collaborative retreats. Our research and production teams blend research-oriented and engineering-focused professionals, united by a commitment to system quality and robust software development principles. We believe superior engineering enables faster iterative progress. ROLE OVERVIEW You'll join our pre-training division, focusing on distributed Large Language Model training and inference infrastructure. This hands-on position emphasizes software reliability and fault tolerance. Responsibilities include cross-platform checkpointing, NCCL recovery, hardware fault detection, and developing high-level diagnostic tools. Kernel module debugging is expected, with access to extensive GPU testing environments. Exceptional engineering skills are mandatory. Comprehensive understanding of Torch, NVIDIA architectures, distributed systems, and coding best practices is required. Basic LLM training knowledge is essential. We seek rapid learners comfortable with challenging learning trajectories. MISSION STATEMENT Develop world-leading foundational models for source code generation. RESPONSIBILITIES - Diagnose and resolve hardware training complications - Minimize GPU downtime during operational and strategic faults - Create acceleration tools for training recovery - Enhance checkpointing performance and reliability - Develop high-quality code in Python, Cython, C/C++, CUDA REQUIRED SKILLS - Large Language Model comprehension - Transformer fundamentals - Deep learning principles - Strong engineering background - Programming expertise - Linux API/kernel interaction - Advanced algorithmic capabilities - Python (numpy, PyTorch, Jax) - C/C++ proficiency - NCCL knowledge - Tool adaptability - Critical thinking - Distributed systems understanding - Reliability mechanisms - Observability - Fault-tolerance - Kubernetes ecosystem RECRUITMENT PROCESS - Initial discussion with Founding Engineer - Technical interview - Team compatibility assessment - Final interview with Founding Engineer COMPENSATION PACKAGE - Full remote work - Flexible scheduling - 37 annual vacation/holiday days - Comprehensive health coverage - Equipment provision - Professional development allowances - Wellness support - Inclusive organizational culture

Poolside is hiring a Member of Engineering (Pre-training and inference fault tolerance)

200+ professionals, 37 countries, one network

Similar Jobs

Member of Engineering (Agents)

Member of Engineering (Human Data)

Member of Engineering (Pre-training and inference software)

Member of Engineering (Search)

Data Engineer (Europe Only)

Lead Solutions Architect