Responsibilities

Lead incident management during daytime hours in your region by directing technical investigations, unifying communication, and involving engineering or SRE teams when escalation is needed.
Handle escalated issues from Tier 1 support using runbooks, system data, and diagnostic tools to resolve problems or determine if Tier 3 involvement is required.
Create and maintain runbooks, operational workflows, and documentation to ensure consistent handling of recurring incidents, working with product teams to broaden issue coverage.
Build, update, and improve automation scripts and tools to simplify remediation tasks, speed up responses, and reduce manual effort in operations.
Monitor system health using metrics, logs, and tracing platforms such as Grafana, Prometheus, GCP Monitoring, and OpenTelemetry to detect issues early and improve detection systems.
Serve as the main communication hub during active incidents, providing timely updates and ensuring correct routing to relevant engineering and SRE teams.
Work with reliability and product engineering groups to exchange operational insights, suggest system improvements, and refine processes for better system stability and manageability.
Take part in a rotating weekend on-call schedule to ensure continuous support for production systems, responding to incidents and coordinating with technical teams as needed.
Support the development of operational standards, optimize incident response workflows, and help lay the groundwork for expanding the reliability operations function.

Serve Robotics is hiring a Senior Reliability Operations Engineer (Malaysia)

Responsibilities