This role ensures operational stability by resolving incidents and problems across business systems. The engineer implements change requests, maintains knowledge documentation, and works with vendors and service management teams. It is a Site Reliability Engineering position focused on observability, automation, and building resilient systems.

Responsibilities

Resolve incidents and problems affecting various components of business systems
Maintain stable operations across platforms and services
Create and execute Requests for Change (RFC) to support system updates
Keep knowledge base content current to aid troubleshooting and support
Work with external vendors and service teams to analyze and resolve technical issues
Monitor and improve system uptime, latency, and throughput to meet service level objectives
Lead responses during system outages or critical incidents
Manage escalation procedures for unresolved or high-severity issues
Perform root cause analysis to identify underlying system failures
Lead postmortem reviews to document lessons and prevent recurrence
Build and maintain CI/CD pipelines for automated software delivery
Automate infrastructure provisioning and management tasks
Reduce manual operational work using scripting and orchestration tools
Implement monitoring, logging, and tracing systems using tools like Prometheus, Grafana, ELK, and Datadog
Forecast infrastructure resource needs based on usage trends
Design infrastructure that scales efficiently with demand
Ensure systems maintain performance during traffic surges
Collaborate with development teams to deploy new features safely, including automated testing and rollback capabilities
Implement multi-region deployment strategies to enhance system resilience
Conduct chaos engineering tests to validate system robustness
Automate failover processes to support business continuity
Use data from post-incident reviews to improve operational workflows
Enhance system reliability through data-informed improvements
Work with product, design, machine learning, and DevOps teams to develop intelligent, reliable workflows
Apply Infrastructure as Code (IaC) using tools such as Terraform, CloudFormation, Azure DevOps, or Pulumi

Requirements

Minimum of 3 years of professional experience in a technical operations or engineering role
Completion of 15 years of full-time education
Demonstrated skills in SRE observability practices
Proficiency in Python, Bash, PowerShell, and YAML for automation and tool development
Hands-on experience with Azure cloud environments and orchestration tools like Kubernetes and Terraform
Strong understanding of Windows systems, networking fundamentals, and distributed system architectures
Experience using observability platforms such as Dynatrace and Azure Monitoring
Familiarity with incident management and alerting systems like ServiceNow (SNOW)
Proficiency with CI/CD tools including Azure DevOps, GitHub Actions, or GitLab CI
Working knowledge of security, compliance, and performance optimization for highly available systems

Tech Stack

Python, Bash, PowerShell, YAML, Azure, Kubernetes, Terraform, Windows systems, Networking, Distributed architectures, Dynatrace, Azure Monitoring, ServiceNow (SNOW), Prometheus, Grafana, ELK Stack, Datadog, Azure DevOps, GitHub Actions, GitLab CI, CloudFormation, Pulumi, Infrastructure as Code (IaC)

Work Arrangement

onsite — Bengaluru

Additional Information

This position is based at the Bengaluru office
A 15 years full time education is required

NeuraFlash is hiring a Technology Support Engineer

Responsibilities

Requirements

Tech Stack

Work Arrangement

Additional Information

Similar Jobs

Senior Platform Engineer - Observability

Senior Site Reliability Engineer

DevOPS Engineer

Senior Platform Engineer / Senior Devops Engineer

Senior Site Reliability Engineer - Ireland

Cloud Platform Engineer

Related Articles

Developer Experience Platform: Lessons from Europe

Kubernetes Remote Jobs: AI & Cloud-Native Careers in 2026

Remote SRE Jobs: Vanguard’s Cloud Transformation