Responsibilities
- Apply established SRE frameworks, best practices, and operational playbooks from the Center of Excellence.
- Serve as an active engineer focused on improving observability, system reliability, and incident response capabilities.
- Collaborate with senior SREs and leadership to standardize monitoring and incident handling procedures.
- Support automation initiatives that enhance system reliability and minimize manual intervention.
- Develop and manage monitoring tools using platforms such as New Relic, Datadog, Prometheus, Grafana, CloudWatch, OpenTelemetry, and Graylog.
- Design and improve dashboards, metrics, and alerting systems to detect anomalies proactively.
- Expand observability across infrastructure components, applications, APIs, and database layers.
- Define and implement service level indicators, objectives, agreements, and error budgets with product and platform teams.
- Help reduce mean time to detection and mean time to resolution through better instrumentation and automated responses.
- Engage in capacity planning, system resiliency testing, and scalability assessments.
- Support chaos engineering efforts and reliability validation exercises.
- Take part in incident response operations, including rotating on-call duties for round-the-clock coverage.
- Assist in conducting root cause analyses and deploying corrective measures to prevent recurrence.
- Ensure compliance with IT service management processes for incident, problem, and change control.
- Develop and maintain runbooks and playbooks to improve on-call team preparedness.
- Work cross-functionally with Engineering, Product, Security, Cloud, and DevSecOps teams to integrate reliability into development lifecycles.
- Provide guidance on operational readiness, including instrumentation and monitoring integration for new services.
- Partner with database administrators and platform teams to improve database observability and performance.
- Share expertise within the SRE team and learn from senior-level engineers to advance team-wide practices.
Work Arrangement
Hybrid
Team
Site Reliability Engineering Center of Excellence (CoE)