This role ensures operational stability by resolving incidents and problems across business systems. The engineer implements change requests, maintains knowledge documentation, and works with vendors and service management teams. It is a Site Reliability Engineering position focused on observability, automation, and building resilient systems.
Responsibilities
- Resolve incidents and problems affecting various components of business systems
- Maintain stable operations across platforms and services
- Create and execute Requests for Change (RFC) to support system updates
- Keep knowledge base content current to aid troubleshooting and support
- Work with external vendors and service teams to analyze and resolve technical issues
- Monitor and improve system uptime, latency, and throughput to meet service level objectives
- Lead responses during system outages or critical incidents
- Manage escalation procedures for unresolved or high-severity issues
- Perform root cause analysis to identify underlying system failures
- Lead postmortem reviews to document lessons and prevent recurrence
- Build and maintain CI/CD pipelines for automated software delivery
- Automate infrastructure provisioning and management tasks
- Reduce manual operational work using scripting and orchestration tools
- Implement monitoring, logging, and tracing systems using tools like Prometheus, Grafana, ELK, and Datadog
- Forecast infrastructure resource needs based on usage trends
- Design infrastructure that scales efficiently with demand
- Ensure systems maintain performance during traffic surges
- Collaborate with development teams to deploy new features safely, including automated testing and rollback capabilities
- Implement multi-region deployment strategies to enhance system resilience
- Conduct chaos engineering tests to validate system robustness
- Automate failover processes to support business continuity
- Use data from post-incident reviews to improve operational workflows
- Enhance system reliability through data-informed improvements
- Work with product, design, machine learning, and DevOps teams to develop intelligent, reliable workflows
- Apply Infrastructure as Code (IaC) using tools such as Terraform, CloudFormation, Azure DevOps, or Pulumi
Requirements
- Minimum of 3 years of professional experience in a technical operations or engineering role
- Completion of 15 years of full-time education
- Demonstrated skills in SRE observability practices
- Proficiency in Python, Bash, PowerShell, and YAML for automation and tool development
- Hands-on experience with Azure cloud environments and orchestration tools like Kubernetes and Terraform
- Strong understanding of Windows systems, networking fundamentals, and distributed system architectures
- Experience using observability platforms such as Dynatrace and Azure Monitoring
- Familiarity with incident management and alerting systems like ServiceNow (SNOW)
- Proficiency with CI/CD tools including Azure DevOps, GitHub Actions, or GitLab CI
- Working knowledge of security, compliance, and performance optimization for highly available systems
Tech Stack
Python, Bash, PowerShell, YAML, Azure, Kubernetes, Terraform, Windows systems, Networking, Distributed architectures, Dynatrace, Azure Monitoring, ServiceNow (SNOW), Prometheus, Grafana, ELK Stack, Datadog, Azure DevOps, GitHub Actions, GitLab CI, CloudFormation, Pulumi, Infrastructure as Code (IaC)
Work Arrangement
onsite — Bengaluru
Additional Information
- This position is based at the Bengaluru office
- A 15 years full time education is required


