Responsibilities
- Develop and maintain reliable, secure, and scalable backend services and operators deployed in data centers, automating hardware tasks such as Infiniband configuration, parallel storage setup, and virtual machine provisioning.
- Architect and implement the infrastructure-as-a-service layer for a new data center featuring thousands of GPUs.
- Contribute to the development of a global, multi-exabyte object storage system optimized for large-scale pretraining datasets.
- Design and deploy advanced monitoring and observability systems with automated node lifecycle management to support resilient distributed training.
- Conduct architectural research and design for decentralized AI computing workloads.
- Collaborate on the development of the core open-source platform powering AI infrastructure.
- Develop internal and external tools, services, and comprehensive technical documentation for developers.
- Build testing frameworks to ensure system robustness and fault tolerance under failure conditions.
Work Arrangement
Hybrid
Other
Hybrid work requiring two days per week in the Amsterdam office.