Responsibilities
- Lead strategic efforts in managing large-scale high-performance computing systems, covering compute, networking, and storage deployment
- Enhance the ecosystem for GPU-powered computing by designing and refining scalable automation tools
- Design, deploy, and manage heterogeneous AI and machine learning clusters across on-premises and cloud environments
- Foster strong relationships with internal teams and users to ensure cluster reliability and adaptability to changing requirements
- Assist research teams with workload execution, including performance evaluation and optimization
- Perform in-depth root cause investigations and recommend effective remediation steps
- Identify potential system issues proactively and implement preventive solutions
Other
- The company supports a diverse and inclusive workplace and is an equal opportunity employer
- Employment decisions are made without regard to race, religion, color, national origin, gender, gender identity, sexual orientation, age, marital status, veteran status, disability, or other legally protected characteristics


