Responsibilities
- Develop and enhance a scalable on-demand GPU workstation platform using lightweight containerization or virtualization technologies
- Implement features including scheduling, reservations, registration, image management, storage mounting, SSH access with single sign-on, and intuitive developer access workflows
- Automate the configuration of cluster namespaces with dynamic allocation of CPU, GPU, memory, and storage resources
- Support tiered capacity allocation models governed by role-based access control for administrative management
- Automate data workflows for storage import, export, and archiving in response to changing resource allocations
- Establish monitoring systems, alerting mechanisms, and automated incident reporting for large-scale cluster infrastructures
- Enhance integrations between version control, continuous integration and delivery pipelines, package distribution, and GPU-enabled development environments
- Develop automation tools, scripts, and agentic systems to streamline infrastructure operations and daily research processes
Work Arrangement
Remote