Responsibilities
- Develop and improve observability using monitoring, logging, tracing, and alerting tools (Prometheus, Grafana, ELK, OpenTelemetry, etc.).
- Optimize system performance, troubleshoot incidents, and conduct post-mortems/RCA to prevent future issues.
- Collaborate with developers to enhance application reliability, scalability, and performance.
- Drive cost optimization efforts in cloud environments.
- Experience with multiple databases Mongo, Redis, ES, Queue based etc
Requirements
- 7+ years in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles.
- Hands-on experience with GCP and AWS.
- Terraform, Helm, or equivalent tools for Infrastructure as Code (IaC).
- Docker, Kubernetes (GKE) for containerization and orchestration.
- Experience with Prometheus, Grafana, ELK, OpenTelemetry, or similar monitoring/logging tools.
- Proficiency in Python, Bash, or Shell scripting.
- Basic understanding of API parsing and JSON manipulation.
- Hands-on experience with Jenkins, GitHub Actions, ArgoCD, or similar CI/CD tools.
- Experience with on-call rotations, SLOs, SLIs, SLAs, Escalation Policies, and incident resolution.
- Experience in monitoring Mongo, Redis, ES, Queue based etc
Work Arrangement
Remote (Worldwide)
Additional Information
- The company is an Equal Opportunity Employer.
- Applicants may be asked to voluntarily provide demographic information for compliance with affirmative action regulations.
- AI tools may be used to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses.
- Final hiring decisions are made by humans.
- Data provided will be kept separate from the application and will not be used in hiring decisions.