Remote (Global) Full-time

Andromeda Cluster is hiring a Performance Engineer - AI Infrastructure

About the Role

Andromeda Cluster is hiring a Performance Engineer to join our AI Infrastructure team. This role focuses on optimizing the efficiency and throughput of our massive-scale AI clusters. You will profile end-to-end training runs to identify bottlenecks in compute, communication, and storage, translating performance data into concrete engineering improvements.

What You'll Do

  • Conduct end-to-end profiling of training workloads to identify bottlenecks across GPU kernels, NCCL communication, and storage I/O.
  • Collaborate with systems engineers to improve scheduling efficiency, collective communication performance, and kernel execution.
  • Build and maintain high-fidelity tooling to monitor and visualize MFU, throughput, and cluster uptime.
  • Design technical processes (e.g., postmortem reviews, incident response) to help the team operate effectively and avoid repeating performance regressions.

What We're Looking For

  • Systems intuition and a passion for optimizing performance and digging into systems to understand interactions from training loop to hardware.
  • Proven experience running distributed training jobs on multi-GPU systems or HPC clusters.
  • Strong programming skills in Python and C++.
  • Solid understanding of PyTorch, JAX, or TensorFlow, and how large-scale training loops are built.
  • Familiarity with modern cloud infrastructure, including Kubernetes and Infrastructure as Code.
  • A passion for measuring efficiency rigorously and translating raw profiling data into practical engineering improvements.

Nice to Have

  • Experience with Rust or CUDA.
  • Low-level mastery: Experience with Linux kernel tuning, eBPF, and understanding systems design tradeoffs at the hardware level.
  • Hands-on experience with GPUs, TPUs, or Trainium, and the networking libraries that power them (NCCL, MPI, UCX).
  • Expertise in security best practices for high-scale infrastructure.
  • Familiarity with monitoring tools like Prometheus and Grafana.

Technical Stack

  • Languages: Python, C++, Rust, CUDA
  • Frameworks: PyTorch, JAX, TensorFlow
  • Infrastructure: Kubernetes, Linux
  • Low-Level Tools: eBPF, NCCL, MPI, UCX
  • Monitoring: Prometheus, Grafana

Team & Environment

This role is part of the Growth team. It is a builder’s role with ownership and autonomy to shape how systems run.

Work Mode

This position is global and open to remote work globally, with optional hubs in San Francisco.

Andromeda Cluster is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

Required Skills
PythonC++RustCUDAPyTorchJAXTensorFlowKubernetesLinuxeBPFAI InfrastructurePerformance EngineeringDistributed SystemsBenchmarkingProfiling
Invoicing holding you back?

Focus on work, not paperwork

Stop worrying about invoicing, taxes, and compliance. Glopay handles the business setup, you handle the client work. Get paid faster and look professional.

Auto-generated compliant invoices
Built-in expense management
Income reports for tax season
95% of earnings stay with you
Try Glopay free
No credit card needed
About company
Andromeda Cluster

Andromeda Cluster gives early-stage startups access to scaled AI infrastructure. It works with leading AI labs, data centers, and cloud providers to deliver compute globally, routing training and inference jobs across global supply. Its long-term vision is to build the liquidity layer for global AI compute.

Visit website
Job Details
Category infrastructure
Posted 21 days ago