Confluent is hiring a Senior Manager - Incident Response Engineering to lead a dedicated team responsible for incident command, response, postmortems, and customer-facing root cause analysis for Confluent Cloud's most critical incidents. You will own the program end-to-end, including people, process, tooling, and outcomes, serving as a player-coach who steps in to run high-severity incidents.
What You'll Do
- Recruit, hire, and develop a team of ~5 senior incident response engineers distributed across AMER and APAC time zones.
- Design sustainable on-call models with follow-the-sun coverage.
- Provide incident command for high-severity and critical customer-impacting incidents, acting as the senior escalation point.
- Set and enforce standards for how incidents are run, including communications cadence and stakeholder coordination.
- Drive a customer-first posture in every incident to ensure timely, accurate updates and clear ownership.
- Own postmortem quality end-to-end, including facilitation, root cause analysis, and corrective action definition.
- Manage the Customer Root Cause Analysis (CRCA) program, ensuring timely, accurate, clearly written documents.
- Drive an AI-centric approach to scaling incident operations using intelligent tooling to improve triage speed and documentation.
- Own and evolve the incident management tooling stack with a bias towards agentic assistance.
- Analyze incident data to identify recurring patterns and feed learnings back into engineering practices.
- Partner with Legal, PR, and Customer Success on customer-facing communications during and after major incidents.
- Brief engineering leadership and executives during active incidents with clarity and composure.
What We're Looking For
- 10+ years in SRE, incident management, or reliability engineering.
- At least 5 years managing teams in SRE/incident management/reliability engineering.
- Proven experience as an incident commander in high-severity, customer-impacting outages at scale.
- Cloud infrastructure experience across at least one of AWS, GCP, or Azure.
- Deep understanding of distributed systems failure modes.
- Strong track record with postmortem facilitation and driving corrective actions to completion.
- Excellent written communication with customers regarding root-cause analysis.
- Experience working with cross-functional stakeholders (legal, PR, customer success) during incident response.
- Track record of hiring and developing senior technical talent in a globally distributed, remote-first environment.
- Comfort operating with significant autonomy and making high-stakes decisions under pressure.
Nice to Have
- Kafka/event streaming experience.
- Experience with incident response in a multi-cloud context.
- Experience building an incident management function or team from scratch.
- Post-incident review methodologies beyond standard '5 whys' (e.g., Learning from Incidents, resilience engineering).
- Demonstrated use of AI-assisted tooling to improve operational quality at scale.
Technical Stack
- AWS
- GCP
- Azure
Team & Environment
The team consists of ~5 experienced incident response engineers providing 24/7 coverage across time zones. This role sits within the Cloud Architecture & Reliability (CAR) organization.
Benefits & Compensation
- Compensation: CA$271.6K - CA$319.1K
Work Mode
This is a global role open to candidates in the AMER and APAC regions.
We’re proud to be an equal opportunity workplace. Employment decisions are based on job-related criteria, without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, veteran status, or any other classification protected by law.





