London, England, United Kingdom Hybrid Employment

Wayve is hiring a Site Reliability Engineer

About the Role

Wayve is seeking a founding Cloud Site Reliability Engineer to build and scale the reliability foundations of our AI cloud platform, including the Model Development Platform and GPU Compute platform. This is a pivotal role where you will define frameworks, automation, and operational standards to ensure our infrastructure operates predictably and at scale.

What You'll Do

  • Own the reliability, availability, and performance of the Model Dev Platform and GPU Compute environments.
  • Define and operationalise SLOs, SLIs, and error budgets across platform services.
  • Improve capacity planning, scaling strategies, and resource efficiency across large GPU-backed clusters.
  • Partner with ML, platform, and software teams to establish clear production readiness standards.
  • Participate in a 24/7 on-call rotation as first-line response for cloud and cluster-related incidents.
  • Lead incident triage, escalation, communications, and root cause analysis.
  • Translate post-incident learning into durable architectural or automation improvements.
  • Continuously reduce alert noise and recurring operational burden.
  • Design and operate monitoring, logging, tracing, and alerting systems that enable rapid detection and recovery.
  • Build dashboards that reflect real user-centric platform health.
  • Improve deployment safety through better change management, validation, and rollback mechanisms.
  • Build automation for cluster operations, training workflows, remediation, and scaling tasks.
  • Implement self-healing patterns and resilient recovery workflows.
  • Harden CI/CD and release processes to improve deployment safety and velocity.
  • Support infrastructure-as-code and policy-driven guardrails to ensure secure, reliable cloud environments.

What We're Looking For

  • Proven experience in an SRE, Production Engineer, or Cloud Reliability role supporting large-scale cloud systems.
  • Strong Kubernetes experience, including operating production clusters.
  • Hands-on experience running production workloads in AWS, GCP, or Azure.
  • Experience operating complex distributed systems in production, ideally including compute-heavy or high-performance workloads.
  • Experience working with large compute clusters; exposure to AI/ML training or inference workloads strongly preferred.
  • Strong Linux fundamentals and proficiency in at least one scripting or systems language (e.g. Python, Go, C++) with a bias toward automation.
  • Deep troubleshooting skills across networking, storage, distributed systems, and performance at scale.
  • Experience designing and operating observability stacks (e.g. Datadog, Prometheus, Grafana, OpenTelemetry).
  • Clear communication skills, including leading incidents, writing postmortems, and influencing teams to prioritise reliability improvements.

Nice to Have

  • Experience operating GPU-backed environments or large-scale ML infrastructure.
  • Experience running model training or inference pipelines in production (MLOps).
  • Familiarity with infrastructure-as-code (e.g. Terraform) and secure cloud production environments.
  • Experience defining and running SLOs/SLIs and building reliability programs across multiple teams.
  • Experience as an early or founding SRE hire establishing processes from scratch.
  • Interest in helping shape and grow a Cloud SRE function, with potential to take on leadership responsibilities over time.

Technical Stack

  • Kubernetes
  • AWS/GCP/Azure
  • Linux
  • Python/Go/C++
  • Datadog/Prometheus/Grafana/OpenTelemetry
  • Infrastructure-as-code (e.g. Terraform)
  • CI/CD

Team & Environment

This is a founding Cloud SRE role. You'll help create the SRE function.

Work Mode

This role operates on a hybrid basis from our London office.

Wayve is committed to creating a diverse, fair and respectful culture that is inclusive of everyone based on their unique skills and perspectives, and regardless of sex, race, religion or belief, ethnic or national origin, disability, age, citizenship, marital, domestic or civil partnership status, sexual orientation, gender identity, veteran status, pregnancy or related condition (including breastfeeding) or any other basis as protected by applicable law.

Required Skills
KubernetesAWSGCPAzureLinuxPythonGoC++DatadogPrometheusGrafanaOpenTelemetryTerraformCI/CDDistributed SystemsAI/ML Workloads
Visa expiring soon?

Extend or switch without leaving Thailand

Running out of time on your current visa? SVBL identifies your best option — extension, category switch, or long-term visa — and handles the entire process.

Visa extensions & category switches
LTR & DTV visa applications
90-day reporting managed
Overstay prevention
Check your options
Prevent overstay issues
About company
Wayve

Wayve is the leading developer of Embodied AI technology. Our advanced AI software and foundation models enable vehicles to perceive, understand, and navigate any complex environment, enhancing the usability and safety of automated driving systems.

Visit website
Job Details
Department Engineering
Category infrastructure
Posted 14 days ago