Responsibilities
- Engage in a 24/7 on-call schedule to deliver immediate responses during critical system outages and maintain uptime for the North American eCommerce platform.
- Serve as a key contributor in incident triage, diagnosing and resolving complex production issues to reduce recovery time.
- Maintain and enhance operational runbooks and standard procedures to ensure consistent and effective incident handling.
- Lead and take part in blameless post-incident reviews and root cause analyses to uncover systemic issues and prevent recurrence.
- Work closely with development and platform teams to design scalable reliability improvements based on incident learnings.
- Establish and monitor service level indicators and objectives to quantify system performance and availability.
- Collaborate with product leadership to define service expectations and manage error budgets that align speed with system resilience.
- Analyze monthly release cycles for potential risks to system health and compliance with service level targets.
- Utilize and refine observability tools such as Dynatrace and GCP Logging to monitor system behavior and detect issues early.
- Identify gaps in monitoring coverage and implement technical enhancements for full system visibility.
- Develop, manage, and improve metrics, dashboards, and alerts using Terraform in alignment with organizational standards.
- Create effective alerting strategies with thresholds based on service level violations and error budget consumption.
- Drive automation by building scripts and tools to eliminate repetitive manual operations.
- Build self-healing systems that automatically detect and correct common failures, minimizing human intervention.
- Deploy and oversee AI-powered observability platforms to enable predictive monitoring and maintenance.
- Collaborate with engineering teams to address performance bottlenecks and improve operational workflows.
- Produce clear, data-backed reports on system reliability, incident patterns, and SRE program progress for leadership review.
Work Arrangement
On-site
Other
- Grade 7 or 8.
- #LI-On-Site
- #LI-DS2


