Dave is looking for a Staff Site Reliability Engineer for a senior, deeply hands-on role on our small, high-leverage SRE team. You'll serve as a technical anchor across cloud infrastructure and networking, shaping how reliability, automation, and performance are embedded into every layer of our platform. You'll design systems to prevent incidents and evolve our GCP platform to support product velocity while protecting long-term durability.
What You'll Do
- Lead architecture and automation across our GCP environment, ensuring reliability, scalability, security, and thoughtful cost management.
- Define and improve SLIs, SLOs, and error budgets using Cloud Monitoring and Datadog — connecting reliability goals to real business outcomes.
- Shape our multi-region, disaster recovery, and capacity planning strategies so the platform holds up as we grow.
- Design and optimize cloud networking, including VPC architecture, ingress/egress, Cloud Armor, VPN, and DNS to support internal systems, partner integrations, and member-facing services.
- Drive infrastructure-as-code and GitOps practices using Terraform, Kubernetes, Helm, and ArgoCD to make deployments predictable and repeatable.
- Mentor SREs and infrastructure engineers through design reviews, incident retros, and hands-on collaboration — strengthening technical depth across the team.
- Explore practical LLM-driven automation where it meaningfully reduces operational toil and shortens incident resolution time.
What We're Looking For
- 8+ years in software, infrastructure, or site reliability engineering.
- 5+ years of hands-on experience operating production systems in GCP (compute, networking, storage, IAM, observability).
- Deep experience with Kubernetes (GKE), Helm, containerization, Terraform (IaC), and ArgoCD.
- Strong programming skills in Python, Go, or TypeScript/JavaScript for automation and internal tooling.
- Experience defining and operating against SLIs, SLOs, and error budgets.
- Strong knowledge of relational and distributed databases (e.g., MySQL, Cloud SQL, Cloud Spanner, Redis), including performance tuning and HA strategies.
- Experience leading incident response, root cause analysis, and systemic remediation.
Nice to Have
- Experience in fintech or regulated environments.
- Familiarity with CI tooling (GHA, Jenkins, Tekton, CircleCI).
- Experience in high-growth startups.
Technical Stack
- Cloud: GCP
- Orchestration & IaC: Kubernetes (GKE), Helm, Terraform, ArgoCD
- Observability: Cloud Monitoring, Datadog
- Languages: Python, Go, TypeScript/JavaScript
- Databases: MySQL, Cloud SQL, Cloud Spanner, Redis
Team & Environment
You'll join a small, high-leverage SRE team of 3–4 engineers and report to the Director of DevX & Infrastructure Engineering.
Benefits & Compensation
- Flexible hours and virtual first work culture with a home office stipend.
- Premium Medical, Dental, and Vision Insurance plans.
- Generous paid parental and caregiver leave.
- 401(k) savings plan with matching contributions.
- Financial advisor and financial wellness support.
- Flexible PTO and generous company holidays, including Juneteenth and Winter Break.
- All-company in-person events once or twice a year and virtual events throughout.
Work Mode
This is a remote position open to candidates located anywhere in the United States, except Hawaii.
Dave Operating LLC is proud to be an Equal Employment Opportunity employer and is dedicated to cultivating a diverse and inclusive workplace. We will consider for employment all qualified applicants and do not discriminate on any basis protected by federal, state, or local law, including the City of Los Angeles’ Fair Chance Initiative for Hiring Ordinance relating to an applicant's criminal history.





