MoEngage Inc. is looking for a Site Reliability Engineer (SRE-2) to take ownership of the health and performance of our key services and contribute to the evolution of our infrastructure at scale. You will be a critical member of the SRE team, responsible for driving impactful reliability initiatives and ensuring the efficiency of our systems.
What You'll Do
- Take ownership of the reliability, performance, and efficiency of critical services.
- Design, develop, and implement robust automation solutions to eliminate toil, streamline operations, and improve system resilience.
- Lead troubleshooting efforts for complex production incidents, perform in-depth root cause analysis, and implement sustainable preventative measures.
- Actively contribute to the design, implementation, and optimization of cloud infrastructure on AWS and GCP, leveraging technologies like Kubernetes.
- Implement and refine advanced monitoring, alerting, and logging solutions to gain deep insights into system behavior.
- Partner closely with development teams to influence architectural decisions, ensuring reliability, scalability, and security are built in from the start.
- Implement and advocate for advanced security practices within infrastructure and operational workflows.
- Analyze and optimize cloud infrastructure spend, identifying and implementing cost-saving opportunities.
- Mentor and guide SRE-1 engineers, contributing to the growth and knowledge sharing within the team.
- Participate in the on-call rotation, acting as a key point of escalation and resolution for critical issues.
What We're Looking For
- 3-5 years of hands-on experience in Site Reliability Engineering, DevOps, or a similar role with a strong focus on production systems.
- Demonstrated expertise in Python or Go with a proven track record of automating complex tasks.
- Strong command of AWS and/or GCP cloud platforms.
- In-depth experience with containerization and orchestration using Kubernetes (K8s, ArgoCD, Helm/Kustomize).
- Solid understanding and experience with monitoring and observability stacks (VictoriaMetrics, Prometheus, Grafana, ELK stack, etc.).
- Deep knowledge of Linux/Unix systems internals and advanced networking concepts.
- Proven ability to diagnose and resolve complex issues in large-scale distributed systems.
- A strong understanding of Cloud Security and Information Security principles and best practices.
- Experience with cloud cost analysis and optimization techniques.
- Familiarity with CI/CD pipelines and GitOps methodologies.
- Excellent communication, collaboration, and problem-solving skills.
- A desire to mentor and lead by example.
Nice to Have
- Experience with infrastructure as code tools like Terraform or Ansible is highly valued.
- Experience with messaging queues and distributed systems (Celery, Kafka) is a plus.
Technical Stack
- Languages: Python, Go
- Cloud: AWS, GCP
- Orchestration: Kubernetes (K8s, ArgoCD, Helm/Kustomize)
- Infrastructure as Code: Terraform, Ansible
- Monitoring/Observability: VictoriaMetrics, Prometheus, Grafana, ELK stack
- Messaging: Celery, Kafka
Team & Environment
You will be a critical member of the SRE team.
Work Mode
This role is based in our office in Bengaluru, Karnataka, India.
Employment at MoEngage is based solely on professional competence, skills, and experience. We stand firmly against all forms of discrimination and support equal rights and opportunities regardless of gender, ethnicity, abilities, age, identity, orientation or expression, marital status (including pregnancy), religion and beliefs, or any other status protected by law.

