Vultr is looking for an AI Cluster Architect to design and architect large-scale GPU clusters within strict power and infrastructure limits. You will focus on power-aware design to maximize GPU density, evaluate networking fabrics, and balance customer requirements with facility constraints.
What You'll Do
- Architect large-scale GPU clusters within fixed site power budgets to maximize GPU density while reserving necessary headroom for compute services, storage, and networking.
- Model and validate power consumption across the full cluster bill of materials (GPUs, CPUs, NICs, switches, fabric components, storage, and facility limits).
- Evaluate tradeoffs across multiple fabric networking architectures (InfiniBand, RoCE, SpectrumX) as well as multi-plane, 2-tier/3-tier, and rail-optimized topologies.
- Determine network scale limits based on switch radix, link speed, topology, and blocking requirements.
- Gather, interpret, and maintain detailed SKU-level power and thermal specifications for GPUs, NICs, switches, DPUs, storage, and server platforms.
- Develop power-aware cluster configuration templates and capacity-planning models that scale across sites with varying constraints.
- Document architecture, design choices, tradeoff analyses, and operational considerations for deployment and lifecycle management.
- Provide guidance on future-proofing, including the ability to incorporate next-gen GPUs, NICs, or fabrics.
- Collaborate with vendors on novel fabric architectures that enable large-scale cluster deployments (100k+ GPUs).
What We're Looking For
- 7+ years designing or building large-scale HPC, AI, or hyperscale GPU clusters.
- Expert understanding of GPU and accelerator system design, including node topology, PCIe/NVLink/NVSwitch/ROCm, and NIC-to-GPU affinity considerations.
- Strong familiarity with InfiniBand, RoCE, and SpectrumX networking, including multi-tier, multi-plane, Clos/dragonfly variants, and large-radix switch design.
- Demonstrated experience modeling power draw and thermal characteristics of servers, GPUs, NICs, switches, optics, and storage systems.
- Ability to design networks that maintain full non-blocking performance or intentionally introduce over/under-subscription while understanding impacts on workload performance.
- Proven ability to gather and analyze vendor SKU-level specifications and incorporate them into scalable cluster architectures.
- Experience balancing customer-driven requirements for compute, storage, and service density in combination with overall GPU count.
- Strong documentation, communication, and cross-functional collaboration skills.
Technical Stack
- GPU clusters, InfiniBand, RoCE, SpectrumX, PCIe/NVLink/NVSwitch/ROCm
Benefits & Compensation
- Compensation: $165,000 - $185,000
- Excellent Medical Benefits with 100% company-paid premiums for employee only plan plus 100% company-paid dental & vision premiums
- 401(k) plan that matches 100% up to 4% with immediate vesting
- Professional Development Reimbursement of $2,500 each year
- 11 Holidays + Paid Time Off Accrual + Rollover Plan + take your birthday off
- Increased PTO at 3 year & 10 year anniversary + 1 month paid sabbatical every 5 years + Anniversary Bonus each year
- $500 first year remote office setup + $400 each following year for new equipment
- Internet reimbursement up to $75 per month
- Gym membership reimbursement up to $50 per month
- Company-paid Wellable subscription
We are an equal opportunity employer and are committed to creating an inclusive environment for all employees. We welcome applications from individuals of all backgrounds and experiences.






