Vultr is looking for an AI Cluster Architect to design and architect large-scale GPU clusters within strict power and infrastructure limits. You will focus on power-aware design to maximize GPU density, evaluate networking fabrics, and balance customer requirements with facility constraints.

What You'll Do

Architect large-scale GPU clusters within fixed site power budgets to maximize GPU density while reserving necessary headroom for compute services, storage, and networking.
Model and validate power consumption across the full cluster bill of materials (GPUs, CPUs, NICs, switches, fabric components, storage, and facility limits).
Evaluate tradeoffs across multiple fabric networking architectures (InfiniBand, RoCE, SpectrumX) as well as multi-plane, 2-tier/3-tier, and rail-optimized topologies.
Determine network scale limits based on switch radix, link speed, topology, and blocking requirements.
Gather, interpret, and maintain detailed SKU-level power and thermal specifications for GPUs, NICs, switches, DPUs, storage, and server platforms.
Develop power-aware cluster configuration templates and capacity-planning models that scale across sites with varying constraints.
Document architecture, design choices, tradeoff analyses, and operational considerations for deployment and lifecycle management.
Provide guidance on future-proofing, including the ability to incorporate next-gen GPUs, NICs, or fabrics.
Collaborate with vendors on novel fabric architectures that enable large-scale cluster deployments (100k+ GPUs).

What We're Looking For

7+ years designing or building large-scale HPC, AI, or hyperscale GPU clusters.
Expert understanding of GPU and accelerator system design, including node topology, PCIe/NVLink/NVSwitch/ROCm, and NIC-to-GPU affinity considerations.
Strong familiarity with InfiniBand, RoCE, and SpectrumX networking, including multi-tier, multi-plane, Clos/dragonfly variants, and large-radix switch design.
Demonstrated experience modeling power draw and thermal characteristics of servers, GPUs, NICs, switches, optics, and storage systems.
Ability to design networks that maintain full non-blocking performance or intentionally introduce over/under-subscription while understanding impacts on workload performance.
Proven ability to gather and analyze vendor SKU-level specifications and incorporate them into scalable cluster architectures.
Experience balancing customer-driven requirements for compute, storage, and service density in combination with overall GPU count.
Strong documentation, communication, and cross-functional collaboration skills.

Technical Stack

GPU clusters, InfiniBand, RoCE, SpectrumX, PCIe/NVLink/NVSwitch/ROCm

Benefits & Compensation

Compensation: $165,000 - $185,000
Excellent Medical Benefits with 100% company-paid premiums for employee only plan plus 100% company-paid dental & vision premiums
401(k) plan that matches 100% up to 4% with immediate vesting
Professional Development Reimbursement of $2,500 each year
11 Holidays + Paid Time Off Accrual + Rollover Plan + take your birthday off
Increased PTO at 3 year & 10 year anniversary + 1 month paid sabbatical every 5 years + Anniversary Bonus each year
$500 first year remote office setup + $400 each following year for new equipment
Internet reimbursement up to $75 per month
Gym membership reimbursement up to $50 per month
Company-paid Wellable subscription

We are an equal opportunity employer and are committed to creating an inclusive environment for all employees. We welcome applications from individuals of all backgrounds and experiences.

Vultr is hiring an AI Cluster Architect

What You'll Do

What We're Looking For

Technical Stack

Benefits & Compensation

Similar Jobs

Cloud Systems Engineer (Cleared)

DevOps Azure Senior MS055SG

Senior Engineer - Cloud Platforms

DevOPS Engineer

Cloud Systems Engineer

Software Engineer / DevOps

Related Articles

Network Configuration as Code: CI/CD for Automation | NVIDIA

Developer Experience Platform: Lessons from Europe

Kubernetes Remote Jobs: AI & Cloud-Native Careers in 2026