Nebius is hiring a Senior Hardware Support Engineer to own production hardware reliability across our large-scale, mission-critical data center environments. This role operates at the intersection of hardware engineering, operations, and vendors to ensure fleet stability and continuous improvement.
What You'll Do
- Lead root cause analysis for complex hardware and firmware failures across production fleets.
- Aggregate recurring problems and error patterns to identify systemic reliability issues.
- Act as the senior escalation point for hardware-related incidents impacting availability or performance.
- Coordinate with vendors to drive timely diagnostics, RMAs, firmware fixes, and corrective actions.
- Partner with internal engineering teams to validate fixes and prevent recurrence.
- Perform hardware and firmware validation before fleet-wide rollout.
- Drive structured incident investigations using established IT problem management methodologies.
- Support on-site teams with technical coordination during critical hardware events.
- Improve hardware observability, failure tracking, and reporting processes.
- Contribute to long-term hardware reliability strategy and fleet-wide stability improvements.
What We're Looking For
- Strong hands-on expertise with server hardware in data center or large-scale production environments.
- Proven experience performing root cause analysis of hardware and firmware failures.
- Deep understanding of server components (CPU, memory, storage, networking, power, BMC) and failure modes.
- Experience working directly with hardware vendors and engineering teams to resolve production issues.
- Structured problem-solving skills using formal IT or incident management methodologies.
- Strong analytical capabilities and ability to interpret logs, telemetry, and error patterns.
- Experience coordinating technical activities with on-site operations teams.
- Ability to manage multiple concurrent investigations with production impact.
- Clear written and verbal communication skills in cross-functional environments.
Nice to Have
- Experience in GPU-dense, AI, or high-performance computing environments.
- Exposure to firmware lifecycle management and large-scale rollout validation.
- Familiarity with Linux-based production systems and infrastructure tooling.
- Experience improving fleet-wide hardware reliability metrics at scale.
Benefits & Compensation
- Compensation: $125,000 – $180,000 per year.
- Comprehensive medical, dental, and vision coverage.
- 401(k) plan with company contribution.
- Flexible paid time off.
- Paid parental leave.
- Professional development support.
Work Mode
This position is local-country, located in the United States.
Nebius is an equal opportunity employer.





