United States Remote (Country) Employment USD 125,000 - 180,000 Yearly

Nebius is hiring a Senior Hardware Support Engineer

About the Role

Nebius is hiring a Senior Hardware Support Engineer to own production hardware reliability across our large-scale, mission-critical data center environments. This role operates at the intersection of hardware engineering, operations, and vendors to ensure fleet stability and continuous improvement.

What You'll Do

  • Lead root cause analysis for complex hardware and firmware failures across production fleets.
  • Aggregate recurring problems and error patterns to identify systemic reliability issues.
  • Act as the senior escalation point for hardware-related incidents impacting availability or performance.
  • Coordinate with vendors to drive timely diagnostics, RMAs, firmware fixes, and corrective actions.
  • Partner with internal engineering teams to validate fixes and prevent recurrence.
  • Perform hardware and firmware validation before fleet-wide rollout.
  • Drive structured incident investigations using established IT problem management methodologies.
  • Support on-site teams with technical coordination during critical hardware events.
  • Improve hardware observability, failure tracking, and reporting processes.
  • Contribute to long-term hardware reliability strategy and fleet-wide stability improvements.

What We're Looking For

  • Strong hands-on expertise with server hardware in data center or large-scale production environments.
  • Proven experience performing root cause analysis of hardware and firmware failures.
  • Deep understanding of server components (CPU, memory, storage, networking, power, BMC) and failure modes.
  • Experience working directly with hardware vendors and engineering teams to resolve production issues.
  • Structured problem-solving skills using formal IT or incident management methodologies.
  • Strong analytical capabilities and ability to interpret logs, telemetry, and error patterns.
  • Experience coordinating technical activities with on-site operations teams.
  • Ability to manage multiple concurrent investigations with production impact.
  • Clear written and verbal communication skills in cross-functional environments.

Nice to Have

  • Experience in GPU-dense, AI, or high-performance computing environments.
  • Exposure to firmware lifecycle management and large-scale rollout validation.
  • Familiarity with Linux-based production systems and infrastructure tooling.
  • Experience improving fleet-wide hardware reliability metrics at scale.

Benefits & Compensation

  • Compensation: $125,000 – $180,000 per year.
  • Comprehensive medical, dental, and vision coverage.
  • 401(k) plan with company contribution.
  • Flexible paid time off.
  • Paid parental leave.
  • Professional development support.

Work Mode

This position is local-country, located in the United States.

Nebius is an equal opportunity employer.

Required Skills
server hardwareroot cause analysisCPUmemorystoragenetworkingpowerBMCfirmwareincident managementvendor managementdata center operations
Planning long-term in Thailand?

Full relocation support, start to finish

From visa strategy to housing, banking, and schools for your family — SVBL plans and manages every detail of your move to Thailand so nothing falls through the cracks.

Complete relocation planning
Family visa & school enrollment
Banking & insurance setup
Cultural integration support
Plan your move
One partner for everything
About company
Nebius

Nebius is leading a new era in cloud computing to serve the global AI economy. It creates tools and resources for customers to solve real-world challenges without massive infrastructure costs.

Visit website
Job Details
Department Engineering
Category infrastructure
Posted 14 days ago