Remote Remote (Global) Full-time

Devsu is hiring a Senior Site Reliability Engineer (SRE) - (GCP)

Responsibilities

  • Own and operate the monitoring and observability stack across on-prem and GCP environments
  • Design, build, and maintain Grafana dashboards for infrastructure, Kubernetes, and applications
  • Define, tune, and maintain alerts to ensure high signal-to-noise ratio
  • Establish observability standards and best practices across teams
  • Improve visibility into system health, performance, and reliability
  • Apply SRE principles to improve availability, performance, and resilience
  • Define and track SLIs, SLOs, and error budgets
  • Participate in on-call rotations and SEV incident response
  • Lead or contribute to incident investigations and root cause analysis (RCA)
  • Drive preventative actions to reduce repeat incidents
  • Support and monitor Kubernetes environments (GKE and on-prem clusters)
  • Monitor cluster health, capacity, and resource utilization
  • Troubleshoot platform-level issues impacting application reliability
  • Collaborate with Platform and Engineering teams on reliability improvements
  • Provide L2/L3 application support coverage during Support team resource shortages
  • Provide L2/L3 application support coverage during High-severity incidents (SEVs)
  • Provide L2/L3 application support coverage during Peak support periods or escalations
  • Triage and troubleshoot application issues using existing runbooks and dashboards
  • Collaborate with Application Support and Engineering teams during incidents
  • Ensure all actions, findings, and resolutions are documented in ServiceNow (SNOW)

Requirements

  • Deep expertise in monitoring, observability, and reliability engineering
  • Experience supporting systems running across on-premises infrastructure and Google Cloud Platform (GCP)
  • Strong focus on Grafana and Kubernetes environments
  • Ability to design, build, and maintain Grafana dashboards for infrastructure, Kubernetes, and applications
  • Ability to define, tune, and maintain alerts to ensure high signal-to-noise ratio
  • Experience establishing observability standards and best practices across teams
  • Application of SRE principles to improve availability, performance, and resilience
  • Experience defining and tracking SLIs, SLOs, and error budgets
  • Participation in on-call rotations and SEV incident response
  • Leadership or contribution to incident investigations and root cause analysis (RCA)
  • Ability to drive preventative actions to reduce repeat incidents
  • Support and monitoring of Kubernetes environments (GKE and on-prem clusters)
  • Monitoring of cluster health, capacity, and resource utilization
  • Troubleshooting of platform-level issues impacting application reliability
  • Collaboration with Platform and Engineering teams on reliability improvements
  • Provision of L2/L3 application support coverage during periods of resource constraints or major incidents
  • Triage and troubleshooting of application issues using existing runbooks and dashboards
  • Collaboration with Application Support and Engineering teams during incidents
  • Documentation of all actions, findings, and resolutions in ServiceNow (SNOW)
Required Skills
Monitoring
About company
Devsu
Compañía que ofrece servicios de desarrollo de software y soluciones tecnológicas personalizadas para empresas.
All jobs at Devsu Visit website
Job Details
Department Engineering
Category infrastructure
Posted 15 days ago