Develop and improve observability using monitoring, logging, tracing, and alerting tools (Prometheus, Grafana, ELK, OpenTelemetry, etc.).
Optimize system performance, troubleshoot incidents, and conduct post-mortems/RCA to prevent future issues.
Collaborate with developers to enhance application reliability, scalability, and performance.
Drive cost optimization efforts in cloud environments.
Experience with multiple databases Mongo, Redis, ES, Queue based etc

7+ years in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles.
Hands-on experience with GCP and AWS.
Terraform, Helm, or equivalent tools for Infrastructure as Code (IaC).
Docker, Kubernetes (GKE) for containerization and orchestration.
Experience with Prometheus, Grafana, ELK, OpenTelemetry, or similar monitoring/logging tools.
Proficiency in Python, Bash, or Shell scripting.
Basic understanding of API parsing and JSON manipulation.
Hands-on experience with Jenkins, GitHub Actions, ArgoCD, or similar CI/CD tools.
Experience with on-call rotations, SLOs, SLIs, SLAs, Escalation Policies, and incident resolution.
Experience in monitoring Mongo, Redis, ES, Queue based etc

Remote (Worldwide)

The company is an Equal Opportunity Employer.
Applicants may be asked to voluntarily provide demographic information for compliance with affirmative action regulations.
AI tools may be used to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses.
Final hiring decisions are made by humans.
Data provided will be kept separate from the application and will not be used in hiring decisions.

HighLevel is hiring a Lead Site Reliability Engineer