Wayve is seeking a founding Cloud Site Reliability Engineer to build and scale the reliability foundations of our AI cloud platform, including the Model Development Platform and GPU Compute platform. This is a pivotal role where you will define frameworks, automation, and operational standards to ensure our infrastructure operates predictably and at scale.
What You'll Do
- Own the reliability, availability, and performance of the Model Dev Platform and GPU Compute environments.
- Define and operationalise SLOs, SLIs, and error budgets across platform services.
- Improve capacity planning, scaling strategies, and resource efficiency across large GPU-backed clusters.
- Partner with ML, platform, and software teams to establish clear production readiness standards.
- Participate in a 24/7 on-call rotation as first-line response for cloud and cluster-related incidents.
- Lead incident triage, escalation, communications, and root cause analysis.
- Translate post-incident learning into durable architectural or automation improvements.
- Continuously reduce alert noise and recurring operational burden.
- Design and operate monitoring, logging, tracing, and alerting systems that enable rapid detection and recovery.
- Build dashboards that reflect real user-centric platform health.
- Improve deployment safety through better change management, validation, and rollback mechanisms.
- Build automation for cluster operations, training workflows, remediation, and scaling tasks.
- Implement self-healing patterns and resilient recovery workflows.
- Harden CI/CD and release processes to improve deployment safety and velocity.
- Support infrastructure-as-code and policy-driven guardrails to ensure secure, reliable cloud environments.
What We're Looking For
- Proven experience in an SRE, Production Engineer, or Cloud Reliability role supporting large-scale cloud systems.
- Strong Kubernetes experience, including operating production clusters.
- Hands-on experience running production workloads in AWS, GCP, or Azure.
- Experience operating complex distributed systems in production, ideally including compute-heavy or high-performance workloads.
- Experience working with large compute clusters; exposure to AI/ML training or inference workloads strongly preferred.
- Strong Linux fundamentals and proficiency in at least one scripting or systems language (e.g. Python, Go, C++) with a bias toward automation.
- Deep troubleshooting skills across networking, storage, distributed systems, and performance at scale.
- Experience designing and operating observability stacks (e.g. Datadog, Prometheus, Grafana, OpenTelemetry).
- Clear communication skills, including leading incidents, writing postmortems, and influencing teams to prioritise reliability improvements.
Nice to Have
- Experience operating GPU-backed environments or large-scale ML infrastructure.
- Experience running model training or inference pipelines in production (MLOps).
- Familiarity with infrastructure-as-code (e.g. Terraform) and secure cloud production environments.
- Experience defining and running SLOs/SLIs and building reliability programs across multiple teams.
- Experience as an early or founding SRE hire establishing processes from scratch.
- Interest in helping shape and grow a Cloud SRE function, with potential to take on leadership responsibilities over time.
Technical Stack
- Kubernetes
- AWS/GCP/Azure
- Linux
- Python/Go/C++
- Datadog/Prometheus/Grafana/OpenTelemetry
- Infrastructure-as-code (e.g. Terraform)
- CI/CD
Team & Environment
This is a founding Cloud SRE role. You'll help create the SRE function.
Work Mode
This role operates on a hybrid basis from our London office.
Wayve is committed to creating a diverse, fair and respectful culture that is inclusive of everyone based on their unique skills and perspectives, and regardless of sex, race, religion or belief, ethnic or national origin, disability, age, citizenship, marital, domestic or civil partnership status, sexual orientation, gender identity, veteran status, pregnancy or related condition (including breastfeeding) or any other basis as protected by applicable law.


