Own and operate the monitoring and observability stack across on-prem and GCP environments
Design, build, and maintain Grafana dashboards for infrastructure, Kubernetes, and applications
Define, tune, and maintain alerts to ensure high signal-to-noise ratio
Establish observability standards and best practices across teams
Improve visibility into system health, performance, and reliability
Apply SRE principles to improve availability, performance, and resilience
Define and track SLIs, SLOs, and error budgets
Participate in on-call rotations and SEV incident response
Lead or contribute to incident investigations and root cause analysis (RCA)
Drive preventative actions to reduce repeat incidents
Support and monitor Kubernetes environments (GKE and on-prem clusters)
Monitor cluster health, capacity, and resource utilization
Troubleshoot platform-level issues impacting application reliability
Collaborate with Platform and Engineering teams on reliability improvements
Provide L2/L3 application support coverage during Support team resource shortages
Provide L2/L3 application support coverage during High-severity incidents (SEVs)
Provide L2/L3 application support coverage during Peak support periods or escalations
Triage and troubleshoot application issues using existing runbooks and dashboards
Collaborate with Application Support and Engineering teams during incidents
Ensure all actions, findings, and resolutions are documented in ServiceNow (SNOW)

Deep expertise in monitoring, observability, and reliability engineering
Experience supporting systems running across on-premises infrastructure and Google Cloud Platform (GCP)
Strong focus on Grafana and Kubernetes environments
Ability to design, build, and maintain Grafana dashboards for infrastructure, Kubernetes, and applications
Ability to define, tune, and maintain alerts to ensure high signal-to-noise ratio
Experience establishing observability standards and best practices across teams
Application of SRE principles to improve availability, performance, and resilience
Experience defining and tracking SLIs, SLOs, and error budgets
Participation in on-call rotations and SEV incident response
Leadership or contribution to incident investigations and root cause analysis (RCA)
Ability to drive preventative actions to reduce repeat incidents
Support and monitoring of Kubernetes environments (GKE and on-prem clusters)
Monitoring of cluster health, capacity, and resource utilization
Troubleshooting of platform-level issues impacting application reliability
Collaboration with Platform and Engineering teams on reliability improvements
Provision of L2/L3 application support coverage during periods of resource constraints or major incidents
Triage and troubleshooting of application issues using existing runbooks and dashboards
Collaboration with Application Support and Engineering teams during incidents
Documentation of all actions, findings, and resolutions in ServiceNow (SNOW)

Devsu is hiring a Senior Site Reliability Engineer (SRE) - (GCP)