Full-time

Vultr is hiring an AI Cluster Architect

About the Role

Vultr is looking for an AI Cluster Architect to design and architect large-scale GPU clusters within strict power and infrastructure limits. You will focus on power-aware design to maximize GPU density, evaluate networking fabrics, and balance customer requirements with facility constraints.

What You'll Do

  • Architect large-scale GPU clusters within fixed site power budgets to maximize GPU density while reserving necessary headroom for compute services, storage, and networking.
  • Model and validate power consumption across the full cluster bill of materials (GPUs, CPUs, NICs, switches, fabric components, storage, and facility limits).
  • Evaluate tradeoffs across multiple fabric networking architectures (InfiniBand, RoCE, SpectrumX) as well as multi-plane, 2-tier/3-tier, and rail-optimized topologies.
  • Determine network scale limits based on switch radix, link speed, topology, and blocking requirements.
  • Gather, interpret, and maintain detailed SKU-level power and thermal specifications for GPUs, NICs, switches, DPUs, storage, and server platforms.
  • Develop power-aware cluster configuration templates and capacity-planning models that scale across sites with varying constraints.
  • Document architecture, design choices, tradeoff analyses, and operational considerations for deployment and lifecycle management.
  • Provide guidance on future-proofing, including the ability to incorporate next-gen GPUs, NICs, or fabrics.
  • Collaborate with vendors on novel fabric architectures that enable large-scale cluster deployments (100k+ GPUs).

What We're Looking For

  • 7+ years designing or building large-scale HPC, AI, or hyperscale GPU clusters.
  • Expert understanding of GPU and accelerator system design, including node topology, PCIe/NVLink/NVSwitch/ROCm, and NIC-to-GPU affinity considerations.
  • Strong familiarity with InfiniBand, RoCE, and SpectrumX networking, including multi-tier, multi-plane, Clos/dragonfly variants, and large-radix switch design.
  • Demonstrated experience modeling power draw and thermal characteristics of servers, GPUs, NICs, switches, optics, and storage systems.
  • Ability to design networks that maintain full non-blocking performance or intentionally introduce over/under-subscription while understanding impacts on workload performance.
  • Proven ability to gather and analyze vendor SKU-level specifications and incorporate them into scalable cluster architectures.
  • Experience balancing customer-driven requirements for compute, storage, and service density in combination with overall GPU count.
  • Strong documentation, communication, and cross-functional collaboration skills.

Technical Stack

  • GPU clusters, InfiniBand, RoCE, SpectrumX, PCIe/NVLink/NVSwitch/ROCm

Benefits & Compensation

  • Compensation: $165,000 - $185,000
  • Excellent Medical Benefits with 100% company-paid premiums for employee only plan plus 100% company-paid dental & vision premiums
  • 401(k) plan that matches 100% up to 4% with immediate vesting
  • Professional Development Reimbursement of $2,500 each year
  • 11 Holidays + Paid Time Off Accrual + Rollover Plan + take your birthday off
  • Increased PTO at 3 year & 10 year anniversary + 1 month paid sabbatical every 5 years + Anniversary Bonus each year
  • $500 first year remote office setup + $400 each following year for new equipment
  • Internet reimbursement up to $75 per month
  • Gym membership reimbursement up to $50 per month
  • Company-paid Wellable subscription

We are an equal opportunity employer and are committed to creating an inclusive environment for all employees. We welcome applications from individuals of all backgrounds and experiences.

Required Skills
GPU clustersInfiniBandRoCESpectrumXPCIeNVLinkNVSwitchROCmKubernetesLinuxPythonBashTerraformAnsibleCI/CD
Earn more as a remote developer

Performance pay that rewards your skills

Iglu's revenue-sharing model means top performers earn significantly more than traditional salaries. Choose your projects, deliver great work, and see it reflected in your pay.

Revenue-sharing compensation
Project choice & autonomy
International client base
Career growth support
Check compensation
Top earners exceed market rate
About company
Vultr

Vultr makes high-performance cloud infrastructure easy to use, affordable, and locally accessible for enterprises and AI innovators worldwide. It is the world’s largest privately-held cloud infrastructure company with 32 global data center locations, trusted by hundreds of thousands of customers across 185 countries.

Visit website
Job Details
Category infrastructure
Posted 2 months ago