Responsibilities
- Lead incident management during daytime hours in your region by directing technical investigations, unifying communication, and involving engineering or SRE teams when escalation is needed.
- Handle escalated issues from Tier 1 support using runbooks, system data, and diagnostic tools to resolve problems or determine if Tier 3 involvement is required.
- Create and maintain runbooks, operational workflows, and documentation to ensure consistent handling of recurring incidents, working with product teams to broaden issue coverage.
- Build, update, and improve automation scripts and tools to simplify remediation tasks, speed up responses, and reduce manual effort in operations.
- Monitor system health using metrics, logs, and tracing platforms such as Grafana, Prometheus, GCP Monitoring, and OpenTelemetry to detect issues early and improve detection systems.
- Serve as the main communication hub during active incidents, providing timely updates and ensuring correct routing to relevant engineering and SRE teams.
- Work with reliability and product engineering groups to exchange operational insights, suggest system improvements, and refine processes for better system stability and manageability.
- Take part in a rotating weekend on-call schedule to ensure continuous support for production systems, responding to incidents and coordinating with technical teams as needed.
- Support the development of operational standards, optimize incident response workflows, and help lay the groundwork for expanding the reliability operations function.