Remote (Global) Full-time

Confluent is hiring a Senior Manager - Incident Response Engineering

About the Role

Confluent is hiring a Senior Manager - Incident Response Engineering to lead a dedicated team responsible for incident command, response, postmortems, and customer-facing root cause analysis for Confluent Cloud's most critical incidents. You will own the program end-to-end, including people, process, tooling, and outcomes, serving as a player-coach who steps in to run high-severity incidents.

What You'll Do

  • Recruit, hire, and develop a team of ~5 senior incident response engineers distributed across AMER and APAC time zones.
  • Design sustainable on-call models with follow-the-sun coverage.
  • Provide incident command for high-severity and critical customer-impacting incidents, acting as the senior escalation point.
  • Set and enforce standards for how incidents are run, including communications cadence and stakeholder coordination.
  • Drive a customer-first posture in every incident to ensure timely, accurate updates and clear ownership.
  • Own postmortem quality end-to-end, including facilitation, root cause analysis, and corrective action definition.
  • Manage the Customer Root Cause Analysis (CRCA) program, ensuring timely, accurate, clearly written documents.
  • Drive an AI-centric approach to scaling incident operations using intelligent tooling to improve triage speed and documentation.
  • Own and evolve the incident management tooling stack with a bias towards agentic assistance.
  • Analyze incident data to identify recurring patterns and feed learnings back into engineering practices.
  • Partner with Legal, PR, and Customer Success on customer-facing communications during and after major incidents.
  • Brief engineering leadership and executives during active incidents with clarity and composure.

What We're Looking For

  • 10+ years in SRE, incident management, or reliability engineering.
  • At least 5 years managing teams in SRE/incident management/reliability engineering.
  • Proven experience as an incident commander in high-severity, customer-impacting outages at scale.
  • Cloud infrastructure experience across at least one of AWS, GCP, or Azure.
  • Deep understanding of distributed systems failure modes.
  • Strong track record with postmortem facilitation and driving corrective actions to completion.
  • Excellent written communication with customers regarding root-cause analysis.
  • Experience working with cross-functional stakeholders (legal, PR, customer success) during incident response.
  • Track record of hiring and developing senior technical talent in a globally distributed, remote-first environment.
  • Comfort operating with significant autonomy and making high-stakes decisions under pressure.

Nice to Have

  • Kafka/event streaming experience.
  • Experience with incident response in a multi-cloud context.
  • Experience building an incident management function or team from scratch.
  • Post-incident review methodologies beyond standard '5 whys' (e.g., Learning from Incidents, resilience engineering).
  • Demonstrated use of AI-assisted tooling to improve operational quality at scale.

Technical Stack

  • AWS
  • GCP
  • Azure

Team & Environment

The team consists of ~5 experienced incident response engineers providing 24/7 coverage across time zones. This role sits within the Cloud Architecture & Reliability (CAR) organization.

Benefits & Compensation

  • Compensation: CA$271.6K - CA$319.1K

Work Mode

This is a global role open to candidates in the AMER and APAC regions.

We’re proud to be an equal opportunity workplace. Employment decisions are based on job-related criteria, without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, veteran status, or any other classification protected by law.

Required Skills
Incident ResponseAWSGCPAzureThreat HuntingDigital ForensicsSIEMSOARPythonScriptingCloud SecuritySecurity OperationsThreat Intelligence
Ready to relocate and code from paradise?

Thailand or Vietnam — your office, your rules

Iglu offers relocation to Bangkok, Chiang Mai, Ho Chi Minh City, or Hong Kong. Full employment, legal setup, and a community of 200+ digital professionals.

Relocation to 5 countries
Full legal work setup
Developer community access
Work-life balance culture
Explore locations
Relocation support included
About company
Confluent

Confluent provides a data streaming platform that puts information in motion in near real-time across AWS, GCP, and Azure, enabling companies to react faster and build smarter.

Visit website
Job Details
Category security
Posted 21 days ago