Responsibilities

Apply established SRE frameworks, best practices, and operational playbooks from the Center of Excellence.
Serve as an active engineer focused on improving observability, system reliability, and incident response capabilities.
Collaborate with senior SREs and leadership to standardize monitoring and incident handling procedures.
Support automation initiatives that enhance system reliability and minimize manual intervention.
Develop and manage monitoring tools using platforms such as New Relic, Datadog, Prometheus, Grafana, CloudWatch, OpenTelemetry, and Graylog.
Design and improve dashboards, metrics, and alerting systems to detect anomalies proactively.
Expand observability across infrastructure components, applications, APIs, and database layers.
Define and implement service level indicators, objectives, agreements, and error budgets with product and platform teams.
Help reduce mean time to detection and mean time to resolution through better instrumentation and automated responses.
Engage in capacity planning, system resiliency testing, and scalability assessments.
Support chaos engineering efforts and reliability validation exercises.
Take part in incident response operations, including rotating on-call duties for round-the-clock coverage.
Assist in conducting root cause analyses and deploying corrective measures to prevent recurrence.
Ensure compliance with IT service management processes for incident, problem, and change control.
Develop and maintain runbooks and playbooks to improve on-call team preparedness.
Work cross-functionally with Engineering, Product, Security, Cloud, and DevSecOps teams to integrate reliability into development lifecycles.
Provide guidance on operational readiness, including instrumentation and monitoring integration for new services.
Partner with database administrators and platform teams to improve database observability and performance.
Share expertise within the SRE team and learn from senior-level engineers to advance team-wide practices.

Work Arrangement

Hybrid

Team

Site Reliability Engineering Center of Excellence (CoE)

A1M Solutions is hiring a Sr Site Reliability Engineer

Responsibilities

Work Arrangement

Team