AuthZed is looking for an Engineering Manager: SRE to lead the team responsible for the reliability, scalability, and performance of our infrastructure as we grow globally. This is a hands-on leadership role where you will manage and develop a team of SREs while staying deeply engaged with production systems, incident response, and platform architecture.
What You'll Do
- Lead a global team of Site Reliability Engineers delivering infrastructure automation, observability, and operational scalability across multi-cloud, multi-region Kubernetes architectures.
- Recruit, hire, onboard, and develop engineers while elevating the overall strength of the team.
- Act as a player-coach by contributing to critical projects while mentoring engineers and supporting their professional growth.
- Participate in on-call rotations at a sustainable level to stay grounded in real operational issues.
- Guide project planning by defining milestones, identifying dependencies, and working toward timely delivery.
- Identify toil and lead initiatives to eliminate it through engineering solutions.
- Drive automation and platform engineering: safer deploys, progressive delivery, guardrails, and paved paths that reduce toil.
- Collaborate with product and engineering to ship features like self-service workflows and infra-as-code expectations with reliability baked in.
- Serve as a senior escalation point for complex incident triage and root cause analysis.
What We're Looking For
- 10+ years of experience in infrastructure, SRE, or platform engineering roles.
- 5+ years of team management or technical leadership in SRE or Platform Engineering.
- Experience managing distributed teams across US, Canada, EU, and global time zones.
- Experience leading or mentoring SRE/Infrastructure/Platform teams in a production SaaS environment.
- Strong leadership skills with the ability to mentor and coach senior-level engineers.
- Strong grasp of SRE fundamentals: SLOs/SLIs, error budgets, incident management, capacity planning, and operational excellence.
- Extensive experience with AWS, GCP and Azure managed services.
- Strong programming skills and experience writing production-quality automation or tooling (e.g., Go, Python, Bash).
- Hands-on experience with Kubernetes, Kubernetes Operators/Controllers, containerized workloads, and Infrastructure as Code (Terraform, Pulumi).
- Experience with monitoring and observability systems (e.g., Prometheus, Grafana, logging/tracing pipelines).
- Excellent communication: can translate reliability tradeoffs to product/exec stakeholders and write crisp incident/postmortem artifacts.
- Proven ability to translate operational pain points into engineering deliverables.
Nice to Have
- Experience working with or integrating AI-powered systems or tooling.
- Experience operating multi-tenant or high-isolation customer environments.
- Familiarity with distributed databases and performance tuning at scale.
- Experience building internal developer platforms or paved paths.
Technical Stack
- Cloud: AWS, GCP, Azure
- Infrastructure: Kubernetes, Terraform, Pulumi
- Observability: Prometheus, Grafana
- Programming: Go, Python, Bash
Benefits & Compensation
- Competitive salary based on experience.
- Stock options at an early-stage startup.
- Comprehensive benefits including healthcare (US-based) and other insurance.
- Twice-yearly travel for team offsites focused on team bonding, collaboration, and having fun.
Work Mode
This is a global remote role open to candidates in the US, Canada, and Europe.
AuthZed celebrates the representation of diverse perspectives and backgrounds as a catalyst for creating an inclusive work environment.





