Lead incident investigations during local business hours, delivering timely updates, escalating when necessary, and assisting senior engineers in incident response.
Handle escalations from Tier 1 support by applying runbooks, analyzing metrics, logs, and diagnostics to resolve issues or escalate to Tier 3 as required.
Maintain and improve runbooks and operational documentation based on incident learnings, feedback, and new findings to ensure clarity and consistency.
Execute existing automation workflows and work with senior team members to refine tools and scripts that improve troubleshooting and remediation efficiency.
Utilize observability platforms such as Grafana/Prometheus, GCP Monitoring, and OpenTelemetry to analyze system metrics, logs, and traces for anomaly detection and performance validation.
Deliver clear and accurate incident updates, ensuring timely communication to relevant engineering and SRE personnel and supporting structured incident management processes.
Engage in root cause discussions, share operational knowledge, and help implement process changes that improve system reliability and maintainability.
Take part in a rotating weekend on-call schedule to ensure continuous production system coverage, responding to incidents and coordinating with engineering teams as needed.
Proactively improve operational workflows, adopt industry best practices, and help establish and mature the Reliability Operations function.

SEK 490K - SEK 600K

This position includes participation in a shared weekend on-call rotation to ensure continuous operational coverage for production systems.

Serve Robotics is hiring a Reliability Operations Engineer (Sweden)