Responsibilities
- Manage and sustain a large fleet of GPU servers, including H100, B200, and GB200 models, used for AI and machine learning applications, ensuring optimal performance, system health, and adherence to service level agreements
- Diagnose and resolve intricate issues involving hardware, firmware, operating systems, and applications within GPU clusters, collaborating with vendors and engineering teams to address recurring problems
- Create and maintain automated scripts to enable scalable server provisioning, configuration, monitoring, and issue remediation
- Design and enhance tools for GPU health monitoring, performance analysis, driver verification, and self-healing recovery processes
- Carry out server setup, OS installation, firmware upgrades, and configuration using automated platforms, overseeing full lifecycle operations from deployment to retirement
- Join a round-the-clock on-call schedule to respond to critical system incidents and work with infrastructure, networking, and software teams to restore services
- Lead incident post-mortems, identify root causes, and implement improvements to boost automation, system reliability, monitoring, and operational effectiveness
Benefits
- Competitive compensation combining salary and equity
- Retirement or pension plan aligned with regional standards
- Comprehensive health, dental, and vision coverage
- Generous paid time off policy consistent with local practices
Compensation
Competitive total compensation package including salary and equity
Work Arrangement
Not specified
Team
Not specified
Other
- Participation in a 24x7 on-call rotation is required
- The annual base salary range for this role is $200,000 to $300,000, adjusted based on experience, skills, qualifications, and geographic location
- Total compensation may include equity in the form of stock options
- A confirmation email will be sent upon successful application submission. If no confirmation is received, contact careers@fluidstack.io with your resume/CV, the position applied for, and the application date
Not specified