Responsibilities
- Develop and maintain a distributed GPU scheduling system for on-demand cluster provisioning.
- Construct a global management plane to oversee data center resources including compute, networking, and storage.
- Create new customer-facing cloud services that deliver powerful AI capabilities for enterprise users.
- Design and implement core backend systems that underpin the cloud platform’s functionality.
- Evaluate and enhance the resilience and performance of distributed systems, APIs, databases, and infrastructure components.
- Collaborate with product teams to translate business requirements into technical solutions.
- Write clean, tested, and maintainable code and infrastructure-as-code for current and new systems.
- Lead code and design reviews, produce technical documentation, and establish testing practices for system reliability.
- Join an on-call rotation to respond to and resolve critical production incidents.
Work Arrangement
Remote (Worldwide)