Responsibilities
- Own and operate the monitoring and observability stack across on-prem and GCP environments
- Design, build, and maintain Grafana dashboards for infrastructure, Kubernetes, and applications
- Define, tune, and maintain alerts to ensure high signal-to-noise ratio
- Establish observability standards and best practices across teams
- Improve visibility into system health, performance, and reliability
- Apply SRE principles to improve availability, performance, and resilience
- Define and track SLIs, SLOs, and error budgets
- Participate in on-call rotations and SEV incident response
- Lead or contribute to incident investigations and root cause analysis (RCA)
- Drive preventative actions to reduce repeat incidents
- Support and monitor Kubernetes environments (GKE and on-prem clusters)
- Monitor cluster health, capacity, and resource utilization
- Troubleshoot platform-level issues impacting application reliability
- Collaborate with Platform and Engineering teams on reliability improvements
- Provide L2/L3 application support coverage during Support team resource shortages
- Provide L2/L3 application support coverage during High-severity incidents (SEVs)
- Provide L2/L3 application support coverage during Peak support periods or escalations
- Triage and troubleshoot application issues using existing runbooks and dashboards
- Collaborate with Application Support and Engineering teams during incidents
- Ensure all actions, findings, and resolutions are documented in ServiceNow (SNOW)
Requirements
- Deep expertise in monitoring, observability, and reliability engineering
- Experience supporting systems running across on-premises infrastructure and Google Cloud Platform (GCP)
- Strong focus on Grafana and Kubernetes environments
- Ability to design, build, and maintain Grafana dashboards for infrastructure, Kubernetes, and applications
- Ability to define, tune, and maintain alerts to ensure high signal-to-noise ratio
- Experience establishing observability standards and best practices across teams
- Application of SRE principles to improve availability, performance, and resilience
- Experience defining and tracking SLIs, SLOs, and error budgets
- Participation in on-call rotations and SEV incident response
- Leadership or contribution to incident investigations and root cause analysis (RCA)
- Ability to drive preventative actions to reduce repeat incidents
- Support and monitoring of Kubernetes environments (GKE and on-prem clusters)
- Monitoring of cluster health, capacity, and resource utilization
- Troubleshooting of platform-level issues impacting application reliability
- Collaboration with Platform and Engineering teams on reliability improvements
- Provision of L2/L3 application support coverage during periods of resource constraints or major incidents
- Triage and troubleshooting of application issues using existing runbooks and dashboards
- Collaboration with Application Support and Engineering teams during incidents
- Documentation of all actions, findings, and resolutions in ServiceNow (SNOW)