San Francisco, California Remote (Country) Full-time

Lavendo is hiring a HPC Solutions Architect

About the Role

Lavendo is looking for an HPC Solutions Architect to design and optimize high-performance computing and GPU clusters for AI training, large-scale simulations, and data-intensive workloads in cloud environments. This role involves deep technical collaboration with customers, partners like NVIDIA, and internal engineering teams to build scalable, efficient infrastructure that performs like a supercomputer.

What You'll Do

Design and implement HPC clusters for AI, simulation, and distributed training using Kubernetes and schedulers like Slurm
Think about node types, GPU topology, queues, partitions, and failure modes when architecting clusters
Integrate NVIDIA Hopper and Blackwell-class GPUs with NVLink/NVSwitch and InfiniBand/RoCE
Ensure hardware layout matches the communication patterns of running workloads
Deploy and manage GPU Operator and Network Operator for consistent, automated driver, CUDA, firmware, and networking management across large GPU fleets
Design and validate cloud-native HPC environments that deliver low latency, high bandwidth, and predictable scheduling
Analyze utilization, preemption, fragmentation, and optimize performance in cloud HPC environments
Define and document reference architectures for AI model training, data pipelines, and MLOps, including observability and CI/CD
Set the standard for what 'good' AI/HPC architecture looks like for customers
Collaborate with NVIDIA and other partners to evaluate new GPU generations, interconnects, and software stacks
Help determine which technologies are ready for production use and under what conditions
Benchmark performance and identify bottlenecks across compute, network, and storage
Recommend and implement concrete performance improvements
Lead design sessions, architecture reviews, and operational excellence check-ins with customers
Translate customer issues (e.g., job timeouts) into technical changes in topology and scheduler configuration

What We're Looking For

Bachelor’s or Master’s in Computer Science, Engineering, or a related field (PhD is a plus)
3+ years of experience building or running HPC or large GPU clusters—on-prem, cloud, or hybrid
Ownership of outcomes, not just job submission
Strong Linux background
Experience with Kubernetes and container runtimes (containerd, CRI-O, Docker) in real environments
CI/CD integrated into workflows
Solid understanding of HPC networking and RDMA: InfiniBand, RoCE, NVLink/NVSwitch
Understanding of why topology and fabric design matter and experience with misconfigured systems
Experience with storage and I/O for large workloads: Ceph, Lustre, NFS at scale, GPUDirect Storage, or similar
Focus on throughput, latency, and contention in storage systems
Proficiency with Terraform, Ansible, Helm, and GitOps-style workflows
Ability to keep configurations reproducible and maintainable
Good scripting skills in Python or Bash
Ability to automate checks, integrate systems, and prototype tooling
Clear written and verbal communication skills
Ability to lead design reviews without losing the audience
Ability to communicate effectively with both engineers and non-technical stakeholders
Legal authorization to work in the U.S. on a full-time basis without visa sponsorship

Nice to Have

Hands-on experience with the NVIDIA ecosystem: GPU Operator, MIG, DCGM, NCCL, Nsight
Experience managing CUDA stacks across production GPU clusters
Experience with MLflow, Kubeflow, NeMo, or similar AI/ML pipeline tools
Experience with distributed training frameworks like PyTorch DDP, DeepSpeed, or Megatron
Real cluster experience with Slurm, LSF, PBS (not just lab environments)
Experience with multi-tenant GPU environments or 'AI training farms'
Familiarity with observability stacks for HPC: Prometheus, DCGM Exporter, Grafana, NGC tools
Open-source contributions in HPC, CUDA, or Kubernetes (strong plus)

Technical Stack

Kubernetes, Slurm, NVIDIA Hopper GPUs, NVIDIA Blackwell GPUs, NVLink, NVSwitch, InfiniBand, RoCE, GPU Operator, Network Operator, CUDA, Terraform, Ansible, Helm, GitOps, Python, Bash, containerd, CRI-O, Docker, Ceph, Lustre, NFS, GPUDirect Storage, Prometheus, DCGM Exporter, Grafana, NGC, MLflow, Kubeflow, NeMo, PyTorch DDP, DeepSpeed, Megatron

Team & Environment

Engineering-driven team with low bureaucracy and high ownership. We focus on solving hard infrastructure problems and seeing our work impact real customer workloads. Our culture values technical excellence, low ego, and doing things properly at scale.

Benefits & Compensation

100% employer-paid medical, dental, and vision insurance for employee and family
4% 401(k) match with immediate vesting
Company-paid short-term and long-term disability insurance
Company-paid life insurance
20 weeks paid parental leave for primary caregivers
12 weeks paid parental leave for secondary caregivers
Remote-first work model within the US
Home office support including mobile and internet stipend
Access to top-tier hardware: H200, B200, GB200-class GPUs, NVLink/NVSwitch, InfiniBand/RoCE

Compensation: $225,000–$315,000 OTE with equity included.

Work Mode

Remote-first, work from anywhere in the United States.

We are proud to be an equal opportunity workplace and are committed to equal employment opportunity regardless of race, color, religion, national origin, age, sex, marital status, ancestry, physical or mental disability, genetic information, veteran status, gender identity, or expression, sexual orientation, or any other characteristic protected by applicable federal, state or local law.

Required Skills

KubernetesSlurmNVIDIA Hopper GPUsNVIDIA Blackwell GPUsNVLinkNVSwitchInfiniBandRoCEGPU OperatorNetwork OperatorLinuxcontainerdCRI-ODockerHPC KubernetesSlurmNVIDIA Hopper GPUsNVIDIA Blackwell GPUsNVLinkNVSwitchInfiniBandRoCEGPU OperatorNetwork OperatorLinuxcontainerdCRI-ODockerHPC

Need to work legally in Thailand?

Work permits without the paperwork nightmare

Thai immigration rules are strict and easy to get wrong. SVBL handles the bureaucracy — correct visa type, proper documentation, timely submissions. You focus on your work.

Right visa type for your situation

Document preparation & submission

Deadline tracking & renewals

Direct liaison with immigration

Talk to an expert

10+ years experience

About company

Building AI-centric cloud infrastructure that combines large GPU clusters, high-speed networks, and cloud-native tooling into a platform used by enterprises, startups, and research teams. The goal is to enable serious AI and simulation workloads without requiring customers to build their own supercomputers.

All jobs at Lavendo Visit website

Job Details

Category infrastructure

Posted 2 months ago

Similar Jobs

Other opportunities you might be interested in

Senior Solutions Architect, HPC and AI

NVIDIA

Remote (Country)

Solutions Architect

NVIDIA

Wuerselen or Munich or Berlin

Solution Architect

NVIDIA

Australia

Senior HPC Applications Engineer

NVIDIA

Hybrid

AI Cluster Architect

Vultr

Senior HPC DevOps Engineer

NVIDIA

Insights related to this role

Remote data scientist working with Kubernetes through a low-code platform, enabling cloud-native tools without backend expertise

remote-work

Platform Engineering: Kubernetes for All

With cloud-native adoption nearing 20 million developers, platform engineering is breaking down infrastructure barriers. Now, data scientists and frontend engineers can leverage Kubernetes without deep backend knowledge—thanks to abstraction and unified platforms.

3 min 4 days ago

Data center with illuminated servers and a technician monitoring AI systems, representing enterprise AI platform jobs 2026.

job-search

Enterprise AI Platform Jobs 2026: TCS-NVIDIA Launch

Tata Consultancy Services has launched TCS Rapid Outcome AI, a new enterprise AI platform built with NVIDIA. The collaboration is accelerating AI adoption across industries and creating high-demand tech careers in 2026.

3 min 8 days ago

Data center with server racks and a technician working, representing AI agent infrastructure jobs in scalable browser cloud environments

job-search

AI Agent Infrastructure Jobs: TestMu Scales Browser Cloud

TestMu AI has launched its Browser Cloud to solve infrastructure bottlenecks in AI agent deployment. The platform supports over 1.5 billion tests annually and opens new remote AI agent infrastructure jobs across Europe and beyond.

3 min 3 days ago

Lavendo is hiring a HPC Solutions Architect

What You'll Do

What We're Looking For

Nice to Have

Technical Stack

Team & Environment

Benefits & Compensation

Work Mode

Work permits without the paperwork nightmare

Similar Jobs

Senior Solutions Architect, HPC and AI

Solutions Architect

Solution Architect

Senior HPC Applications Engineer

AI Cluster Architect

Senior HPC DevOps Engineer

Related Articles

Platform Engineering: Kubernetes for All

Enterprise AI Platform Jobs 2026: TCS-NVIDIA Launch

AI Agent Infrastructure Jobs: TestMu Scales Browser Cloud