Hyderabad, Telangana, India Hybrid Employment

A1M Solutions is hiring a Sr Site Reliability Engineer

Responsibilities

  • Apply established SRE frameworks, best practices, and operational playbooks from the Center of Excellence.
  • Serve as an active engineer focused on improving observability, system reliability, and incident response capabilities.
  • Collaborate with senior SREs and leadership to standardize monitoring and incident handling procedures.
  • Support automation initiatives that enhance system reliability and minimize manual intervention.
  • Develop and manage monitoring tools using platforms such as New Relic, Datadog, Prometheus, Grafana, CloudWatch, OpenTelemetry, and Graylog.
  • Design and improve dashboards, metrics, and alerting systems to detect anomalies proactively.
  • Expand observability across infrastructure components, applications, APIs, and database layers.
  • Define and implement service level indicators, objectives, agreements, and error budgets with product and platform teams.
  • Help reduce mean time to detection and mean time to resolution through better instrumentation and automated responses.
  • Engage in capacity planning, system resiliency testing, and scalability assessments.
  • Support chaos engineering efforts and reliability validation exercises.
  • Take part in incident response operations, including rotating on-call duties for round-the-clock coverage.
  • Assist in conducting root cause analyses and deploying corrective measures to prevent recurrence.
  • Ensure compliance with IT service management processes for incident, problem, and change control.
  • Develop and maintain runbooks and playbooks to improve on-call team preparedness.
  • Work cross-functionally with Engineering, Product, Security, Cloud, and DevSecOps teams to integrate reliability into development lifecycles.
  • Provide guidance on operational readiness, including instrumentation and monitoring integration for new services.
  • Partner with database administrators and platform teams to improve database observability and performance.
  • Share expertise within the SRE team and learn from senior-level engineers to advance team-wide practices.

Work Arrangement

Hybrid

Team

Site Reliability Engineering Center of Excellence (CoE)

About company
A1M Solutions
A1M Solutions is a woman-owned small business focused on preserving and improving government healthcare programs for underserved populations in the United States. They work on projects with nationwide impact at the intersection of policy, data, and user experience design, helping teams improve their agile, user-centered design practices.
All jobs at A1M Solutions Visit website
Job Details
Department Site Reliability Engineering
Category infrastructure
Posted 9 days ago