The Site Reliability Engineer will be responsible for ensuring the reliability, performance, and efficiency of critical systems. This includes driving automation, managing incidents, optimizing cloud infrastructure across AWS and GCP, and mentoring junior engineers within a scalable, distributed environment.

Responsibilities

Own the reliability, performance, and operational efficiency of key services.
Design and implement automation solutions to reduce manual work and strengthen system resilience.
Lead diagnosis and resolution of complex production issues, conducting root cause analysis and applying long-term fixes.
Contribute to the architecture and optimization of cloud infrastructure on AWS and GCP, with expertise in Kubernetes.
Enhance monitoring, alerting, and logging systems to proactively detect and resolve issues.
Collaborate with development teams to integrate reliability, scalability, and security into system design.
Promote and apply strong security practices across infrastructure and operational processes.
Optimize cloud spending by identifying cost-saving opportunities and efficient resource usage.
Guide and mentor junior SREs to foster team growth and knowledge sharing.
Participate in on-call rotations as a primary responder for critical system incidents.

Requirements

3 to 5 years of hands-on experience in Site Reliability Engineering, DevOps, or a related role focused on production systems.
Proven experience using Python or Go to automate complex operational tasks.
Strong proficiency with AWS and/or GCP cloud platforms.
Deep experience in containerization and orchestration using Kubernetes, ArgoCD, Helm, or Kustomize.
Experience with infrastructure-as-code tools such as Terraform or Ansible.
Familiarity with observability tools including Prometheus, VictoriaMetrics, Grafana, and the ELK stack.
Solid understanding of Linux/Unix internals and advanced networking concepts.
Demonstrated ability to troubleshoot and resolve issues in large-scale distributed systems.
Knowledge of cloud and information security best practices.
Experience analyzing and optimizing cloud costs.
Exposure to CI/CD pipelines and GitOps workflows.
Strong problem-solving, communication, and collaboration skills.
Willingness to mentor others and lead through example.

Nice to Have

Experience working with distributed systems and message queues such as Kafka or Celery.

Tech Stack

AWS, GCP, Kubernetes, ArgoCD, Helm, Kustomize, Terraform, Ansible, VictoriaMetrics, Prometheus, Grafana, ELK stack, Python, Go, CI/CD, GitOps, Celery, Kafka

Benefits

Work on a platform that processes over a billion messages each day.
Collaborate with a talented and passionate technology team.
Be part of an inclusive and diverse workplace.
Equal access to growth and success opportunities regardless of background.
Work in an environment that values differences and promotes inclusivity.
Opportunity to bring your authentic self to work.
Supportive culture focused on solving meaningful technical challenges.
Respect for individual perspectives and identities in the workplace.

Compensation

Not specified

Work Arrangement

Local position in Bengaluru, Karnataka, India

Team

SRE team consisting of SRE-1 and SRE-2 roles, working collaboratively with development teams

Committed to diversity and inclusion across all dimensions.
Encourages collaboration among individuals with varied backgrounds and perspectives.
Provides equal opportunities and opposes discrimination in all forms.
Supports employees in bringing their authentic selves to work.
Driven by passion for technology and team success.
Focused on solving impactful challenges through teamwork.

Additional Information

Hiring decisions are based on professional competence, skills, and experience.
Company maintains a strict stance against all forms of discrimination.
Complies with national, state, and local non-discrimination laws.
Core values include inclusivity, respect, and equal opportunity.
Opportunity to mentor others and lead by example.
On-call participation is a required part of the role.
Emphasis on automation, reliability, and scalability in infrastructure design.
Platform operates at massive scale, serving over a billion monthly customers.

Not specified

MoEngage Inc is hiring a Site Reliability Engineer

Responsibilities

Requirements

Nice to Have

Tech Stack

Benefits

Compensation

Work Arrangement

Team

Additional Information

Similar Jobs

Senior Platform Engineer - Observability

Senior DevOps Engineer (hiring in US/CAN & LATAM)

DevOps Engineer (Mid level)

Platform Engineer - Product Reliability (Mid Level)

Senior Infrastructure Engineer

Lead Engineer – Platform & Infrastructure

Related Articles

Platform Engineering: Kubernetes for All

Network Configuration as Code: CI/CD for Automation | NVIDIA

Developer Experience Platform: Lessons from Europe

MoEngage Inc is hiring a Site Reliability Engineer

Responsibilities

Requirements

Nice to Have

Tech Stack

Benefits

Compensation

Work Arrangement

Team

Additional Information

Similar Jobs

Senior Platform Engineer - Observability

Senior DevOps Engineer (hiring in US/CAN &amp; LATAM)

DevOps Engineer (Mid level)

Platform Engineer - Product Reliability (Mid Level)

Senior Infrastructure Engineer

Lead Engineer – Platform & Infrastructure

Related Articles

Platform Engineering: Kubernetes for All

Network Configuration as Code: CI/CD for Automation | NVIDIA

Developer Experience Platform: Lessons from Europe

Senior DevOps Engineer (hiring in US/CAN & LATAM)