The Site Reliability Engineer will be responsible for ensuring the reliability, performance, and efficiency of critical systems. This includes driving automation, managing incidents, optimizing cloud infrastructure across AWS and GCP, and mentoring junior engineers within a scalable, distributed environment.
Responsibilities
- Own the reliability, performance, and operational efficiency of key services.
- Design and implement automation solutions to reduce manual work and strengthen system resilience.
- Lead diagnosis and resolution of complex production issues, conducting root cause analysis and applying long-term fixes.
- Contribute to the architecture and optimization of cloud infrastructure on AWS and GCP, with expertise in Kubernetes.
- Enhance monitoring, alerting, and logging systems to proactively detect and resolve issues.
- Collaborate with development teams to integrate reliability, scalability, and security into system design.
- Promote and apply strong security practices across infrastructure and operational processes.
- Optimize cloud spending by identifying cost-saving opportunities and efficient resource usage.
- Guide and mentor junior SREs to foster team growth and knowledge sharing.
- Participate in on-call rotations as a primary responder for critical system incidents.
Requirements
- 3 to 5 years of hands-on experience in Site Reliability Engineering, DevOps, or a related role focused on production systems.
- Proven experience using Python or Go to automate complex operational tasks.
- Strong proficiency with AWS and/or GCP cloud platforms.
- Deep experience in containerization and orchestration using Kubernetes, ArgoCD, Helm, or Kustomize.
- Experience with infrastructure-as-code tools such as Terraform or Ansible.
- Familiarity with observability tools including Prometheus, VictoriaMetrics, Grafana, and the ELK stack.
- Solid understanding of Linux/Unix internals and advanced networking concepts.
- Demonstrated ability to troubleshoot and resolve issues in large-scale distributed systems.
- Knowledge of cloud and information security best practices.
- Experience analyzing and optimizing cloud costs.
- Exposure to CI/CD pipelines and GitOps workflows.
- Strong problem-solving, communication, and collaboration skills.
- Willingness to mentor others and lead through example.
Nice to Have
- Experience working with distributed systems and message queues such as Kafka or Celery.
Tech Stack
AWS, GCP, Kubernetes, ArgoCD, Helm, Kustomize, Terraform, Ansible, VictoriaMetrics, Prometheus, Grafana, ELK stack, Python, Go, CI/CD, GitOps, Celery, Kafka
Benefits
- Work on a platform that processes over a billion messages each day.
- Collaborate with a talented and passionate technology team.
- Be part of an inclusive and diverse workplace.
- Equal access to growth and success opportunities regardless of background.
- Work in an environment that values differences and promotes inclusivity.
- Opportunity to bring your authentic self to work.
- Supportive culture focused on solving meaningful technical challenges.
- Respect for individual perspectives and identities in the workplace.
Compensation
Not specified
Work Arrangement
Local position in Bengaluru, Karnataka, India
Team
SRE team consisting of SRE-1 and SRE-2 roles, working collaboratively with development teams
- Committed to diversity and inclusion across all dimensions.
- Encourages collaboration among individuals with varied backgrounds and perspectives.
- Provides equal opportunities and opposes discrimination in all forms.
- Supports employees in bringing their authentic selves to work.
- Driven by passion for technology and team success.
- Focused on solving impactful challenges through teamwork.
Additional Information
- Hiring decisions are based on professional competence, skills, and experience.
- Company maintains a strict stance against all forms of discrimination.
- Complies with national, state, and local non-discrimination laws.
- Core values include inclusivity, respect, and equal opportunity.
- Opportunity to mentor others and lead by example.
- On-call participation is a required part of the role.
- Emphasis on automation, reliability, and scalability in infrastructure design.
- Platform operates at massive scale, serving over a billion monthly customers.
Not specified


