San Francisco, USA Hybrid

Andromeda Cluster is hiring a Performance Engineer - AI Infrastructure

Responsibilities

  • Profile and optimize training workloads from start to finish, identifying bottlenecks in GPU kernel execution, NCCL communication, and storage input/output.
  • Work with systems engineering teams to enhance scheduling efficiency, collective communication performance, and kernel-level execution.
  • Develop and maintain precise monitoring tools to track and visualize model flops utilization, throughput metrics, and cluster availability.
  • Design structured technical procedures such as incident response protocols and postmortem analyses to prevent recurring performance issues.

Work Arrangement

Hybrid

Required Skills
PythonC++RustCUDAPytorchTensorFlowKubernetesLinuxDistributed Systems
About company
Andromeda Cluster
Andromeda Cluster gives early-stage startups access to scaled AI infrastructure. It works with leading AI labs, data centers, and cloud providers to deliver compute globally, routing training and inference jobs across global supply. Its long-term vision is to build the liquidity layer for global AI compute.
All jobs at Andromeda Cluster Visit website
Job Details
Category infrastructure
Posted 3 months ago