San Francisco, California Remote (Country) Full-time

Lavendo is hiring a HPC Solutions Architect

About the Role

Lavendo is looking for an HPC Solutions Architect to design and optimize high-performance computing and GPU clusters for AI training, large-scale simulations, and data-intensive workloads in cloud environments. This role involves deep technical collaboration with customers, partners like NVIDIA, and internal engineering teams to build scalable, efficient infrastructure that performs like a supercomputer.

What You'll Do

  • Design and implement HPC clusters for AI, simulation, and distributed training using Kubernetes and schedulers like Slurm
  • Think about node types, GPU topology, queues, partitions, and failure modes when architecting clusters
  • Integrate NVIDIA Hopper and Blackwell-class GPUs with NVLink/NVSwitch and InfiniBand/RoCE
  • Ensure hardware layout matches the communication patterns of running workloads
  • Deploy and manage GPU Operator and Network Operator for consistent, automated driver, CUDA, firmware, and networking management across large GPU fleets
  • Design and validate cloud-native HPC environments that deliver low latency, high bandwidth, and predictable scheduling
  • Analyze utilization, preemption, fragmentation, and optimize performance in cloud HPC environments
  • Define and document reference architectures for AI model training, data pipelines, and MLOps, including observability and CI/CD
  • Set the standard for what 'good' AI/HPC architecture looks like for customers
  • Collaborate with NVIDIA and other partners to evaluate new GPU generations, interconnects, and software stacks
  • Help determine which technologies are ready for production use and under what conditions
  • Benchmark performance and identify bottlenecks across compute, network, and storage
  • Recommend and implement concrete performance improvements
  • Lead design sessions, architecture reviews, and operational excellence check-ins with customers
  • Translate customer issues (e.g., job timeouts) into technical changes in topology and scheduler configuration

What We're Looking For

  • Bachelor’s or Master’s in Computer Science, Engineering, or a related field (PhD is a plus)
  • 3+ years of experience building or running HPC or large GPU clusters—on-prem, cloud, or hybrid
  • Ownership of outcomes, not just job submission
  • Strong Linux background
  • Experience with Kubernetes and container runtimes (containerd, CRI-O, Docker) in real environments
  • CI/CD integrated into workflows
  • Solid understanding of HPC networking and RDMA: InfiniBand, RoCE, NVLink/NVSwitch
  • Understanding of why topology and fabric design matter and experience with misconfigured systems
  • Experience with storage and I/O for large workloads: Ceph, Lustre, NFS at scale, GPUDirect Storage, or similar
  • Focus on throughput, latency, and contention in storage systems
  • Proficiency with Terraform, Ansible, Helm, and GitOps-style workflows
  • Ability to keep configurations reproducible and maintainable
  • Good scripting skills in Python or Bash
  • Ability to automate checks, integrate systems, and prototype tooling
  • Clear written and verbal communication skills
  • Ability to lead design reviews without losing the audience
  • Ability to communicate effectively with both engineers and non-technical stakeholders
  • Legal authorization to work in the U.S. on a full-time basis without visa sponsorship

Nice to Have

  • Hands-on experience with the NVIDIA ecosystem: GPU Operator, MIG, DCGM, NCCL, Nsight
  • Experience managing CUDA stacks across production GPU clusters
  • Experience with MLflow, Kubeflow, NeMo, or similar AI/ML pipeline tools
  • Experience with distributed training frameworks like PyTorch DDP, DeepSpeed, or Megatron
  • Real cluster experience with Slurm, LSF, PBS (not just lab environments)
  • Experience with multi-tenant GPU environments or 'AI training farms'
  • Familiarity with observability stacks for HPC: Prometheus, DCGM Exporter, Grafana, NGC tools
  • Open-source contributions in HPC, CUDA, or Kubernetes (strong plus)

Technical Stack

Kubernetes, Slurm, NVIDIA Hopper GPUs, NVIDIA Blackwell GPUs, NVLink, NVSwitch, InfiniBand, RoCE, GPU Operator, Network Operator, CUDA, Terraform, Ansible, Helm, GitOps, Python, Bash, containerd, CRI-O, Docker, Ceph, Lustre, NFS, GPUDirect Storage, Prometheus, DCGM Exporter, Grafana, NGC, MLflow, Kubeflow, NeMo, PyTorch DDP, DeepSpeed, Megatron

Team & Environment

Engineering-driven team with low bureaucracy and high ownership. We focus on solving hard infrastructure problems and seeing our work impact real customer workloads. Our culture values technical excellence, low ego, and doing things properly at scale.

Benefits & Compensation

  • 100% employer-paid medical, dental, and vision insurance for employee and family
  • 4% 401(k) match with immediate vesting
  • Company-paid short-term and long-term disability insurance
  • Company-paid life insurance
  • 20 weeks paid parental leave for primary caregivers
  • 12 weeks paid parental leave for secondary caregivers
  • Remote-first work model within the US
  • Home office support including mobile and internet stipend
  • Access to top-tier hardware: H200, B200, GB200-class GPUs, NVLink/NVSwitch, InfiniBand/RoCE

Compensation: $225,000–$315,000 OTE with equity included.

Work Mode

Remote-first, work from anywhere in the United States.

We are proud to be an equal opportunity workplace and are committed to equal employment opportunity regardless of race, color, religion, national origin, age, sex, marital status, ancestry, physical or mental disability, genetic information, veteran status, gender identity, or expression, sexual orientation, or any other characteristic protected by applicable federal, state or local law.

Required Skills
KubernetesSlurmNVIDIA Hopper GPUsNVIDIA Blackwell GPUsNVLinkNVSwitchInfiniBandRoCEGPU OperatorNetwork OperatorLinuxcontainerdCRI-ODockerHPC KubernetesSlurmNVIDIA Hopper GPUsNVIDIA Blackwell GPUsNVLinkNVSwitchInfiniBandRoCEGPU OperatorNetwork OperatorLinuxcontainerdCRI-ODockerHPC
Need to work legally in Thailand?

Work permits without the paperwork nightmare

Thai immigration rules are strict and easy to get wrong. SVBL handles the bureaucracy — correct visa type, proper documentation, timely submissions. You focus on your work.

Right visa type for your situation
Document preparation & submission
Deadline tracking & renewals
Direct liaison with immigration
Talk to an expert
10+ years experience
About company
Lavendo
Building AI-centric cloud infrastructure that combines large GPU clusters, high-speed networks, and cloud-native tooling into a platform used by enterprises, startups, and research teams. The goal is to enable serious AI and simulation workloads without requiring customers to build their own supercomputers.
All jobs at Lavendo Visit website
Job Details
Category infrastructure
Posted 2 months ago