Responsibilities
- Design and maintain training systems that can process and learn from petabyte-scale multimodal datasets (e.g., video and point cloud data). This includes ensuring data is efficiently loaded, distributed, and processed across large GPU clusters.
- Identify and resolve bottlenecks in the training pipeline, including data loading, preprocessing, model computation, and inter-node communication, to maximize GPU utilization and reduce training time.
- Work with the ML team to develop and refine neural network architectures suitable for autonomy tasks, particularly those handling high-dimensional and sequential sensor data.
- Create and adjust loss functions and training strategies that help the model learn effectively from complex multimodal inputs and improve autonomy performance.
- Configure, monitor, and maintain large-scale distributed training jobs across multiple machines and GPUs, ensuring stability, fault tolerance, and efficient resource usage.
- Implement scalable systems to preprocess, transform, and augment large robotics datasets so that they are suitable for model training.
- Work closely with ML scientists and other engineers to integrate new models, experiments, and training approaches into the production training pipeline.
- Analyze training metrics, model outputs, and experiment logs to assess model performance and guide improvements in architecture, data usage, or training strategies.
- Develop tools and workflows that allow teams to run experiments, track results, and iterate quickly on new model ideas or training approaches.
Requirements
- Master’s or PhD in Computer Science, Robotics, Electrical Engineering, Machine Learning, or a closely related technical discipline.
- Minimum of 5 years of professional experience developing, training, and deploying machine learning models in production environments.
- Hands-on experience training machine learning models across multiple GPUs or compute nodes, including familiarity with distributed training frameworks and large dataset handling.
- Strong programming skills in Python for implementing machine learning models, data pipelines, and training workflows.
- Solid knowledge of core concepts such as neural networks, optimization algorithms, loss functions, model evaluation, and training methodologies.
Nice to Have
- Experience identifying and resolving training bottlenecks related to compute utilization, memory usage, and data throughput in machine learning systems.
- Experience training machine learning models on robotics or autonomous driving datasets involving multimodal sensor inputs such as camera video, LiDAR point clouds, radar, or telemetry data.
- Experience developing models that combine multiple data modalities (e.g., images, point clouds, and structured sensor data) into a unified learning system.
- Peer-reviewed publications or significant research contributions in machine learning, robotics, or related areas.
Work Arrangement
Hybrid
Additional Information
- We are open to qualified candidates working remotely in Canada