Responsibilities
- Lead incident investigations during local business hours, delivering timely updates, escalating when necessary, and assisting senior engineers in incident response.
- Handle escalations from Tier 1 support by applying runbooks, analyzing metrics, logs, and diagnostics to resolve issues or escalate to Tier 3 as required.
- Maintain and improve runbooks and operational documentation based on incident learnings, feedback, and new findings to ensure clarity and consistency.
- Execute existing automation workflows and work with senior team members to refine tools and scripts that improve troubleshooting and remediation efficiency.
- Utilize observability platforms such as Grafana/Prometheus, GCP Monitoring, and OpenTelemetry to analyze system metrics, logs, and traces for anomaly detection and performance validation.
- Deliver clear and accurate incident updates, ensuring timely communication to relevant engineering and SRE personnel and supporting structured incident management processes.
- Engage in root cause discussions, share operational knowledge, and help implement process changes that improve system reliability and maintainability.
- Take part in a rotating weekend on-call schedule to ensure continuous production system coverage, responding to incidents and coordinating with engineering teams as needed.
- Proactively improve operational workflows, adopt industry best practices, and help establish and mature the Reliability Operations function.
Compensation
SEK 490K - SEK 600K
Other
This position includes participation in a shared weekend on-call rotation to ensure continuous operational coverage for production systems.