Site Reliability Engineer at PhonePe (Expired)

PhonePe is looking for a Site Reliability Engineer with 5 to 12 years of experience to manage, scale, and ensure the high availability of our core infrastructure. This role is open to experts specialized in either Microsoft Azure or AWS. You will be responsible for deep-level cloud architecture, automation, and complex networking to support a high-volume, mission-critical environment where downtime is not an option.

What You'll Do

Configure, maintain, and manage Ubuntu/Linux Virtual Machines in your primary cloud environment (Azure or AWS).
Design and manage cloud-native components for log storage, database management, and alerting (e.g., Azure Storage/ADX or AWS S3/CloudWatch).
Configure and maintain critical network components, including Firewalls, Route Tables, and Virtual Gateways (VPC/VNet).
Establish and manage high-speed connectivity via Express Route (Azure) or Direct Connect (AWS) along with IPsec VPNs for external environments.
Resolve complex routing issues and manage network migrations with zero-to-minimal downtime.
Drive automation for all BAU (Business As Usual) tasks using Terraform, writing new code for all infrastructure components.
Use Saltstack or Ansible for automated deployment and configuration of services on VMs.
Develop custom scripts or services in Python, Go, or Java to eliminate manual toil.
Set up and manage HA services like MySQL and Aerospike.
Implement database replication across regions, manage migrations, and ensure data synchronization during network partitions.
Handle robust backup strategies for databases, logs, and system configurations.
Implement and manage monitoring systems like Prometheus, Victoria Metrics, or Riemann.
Proficiency with Loki for centralized logging and Grafana for building mission-critical dashboards and alerting.
Integrate platform and VM-level services with the SOC; collaborate with Infosec to fix vulnerabilities.
Expert management of Nginx and HAProxy (proxy management, endpoint addition, and complex rewrite rules).
Experience with RabbitMQ (RMQ) and containerization using Docker.
A proactive approach to identifying and solving infrastructure challenges before they impact users.
Lead incident response, create Root Cause Analysis (RCA) documents, and manage post-mortems.
Define SLOs/SLIs and demonstrate a commitment to Toil Reduction through automation.
Identify and implement cloud resource optimization to save costs.

What We're Looking For

5 to 12 years in an SRE or high-level DevOps role.
Deep hands-on experience with either Azure (VMs, Storage Accounts, CosmosDB, ADX) or AWS (EC2, S3, RDS).
Expert proficiency in Linux (Ubuntu) for system administration and kernel-level performance troubleshooting.
Deep knowledge of DNS, BGP routing, and private connectivity troubleshooting.

Technical Stack

Cloud: Microsoft Azure, AWS
OS/Languages: Ubuntu/Linux, Python, Go, Java
Infrastructure as Code: Terraform, Saltstack, Ansible
Data Stores: MySQL, Aerospike
Monitoring/Observability: Prometheus, Victoria Metrics, Riemann, Loki, Grafana
Networking/Services: Nginx, HAProxy, RabbitMQ (RMQ)
Containerization: Docker

Benefits & Compensation

Medical Insurance
Critical Illness Insurance
Accidental Insurance
Life Insurance
Employee Assistance Program
Onsite Medical Center
Emergency Support System
Maternity Benefit
Paternity Benefit Program
Adoption Assistance Program
Day-care Support Program
Relocation benefits
Transfer Support Policy
Travel Policy
Employee PF Contribution
Flexible PF Contribution
Gratuity
NPS
Leave Encashment
Higher Education Assistance
Car Lease
Salary Advance Policy

PhonePe is an equal opportunity employer and is committed to treating all its employees and job applicants equally; regardless of gender, sexual preference, religion, race, color or disability.

PhonePe was looking for a Site Reliability Engineer

What You'll Do

What We're Looking For

Technical Stack

Benefits & Compensation

Similar Jobs

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer - Systems

Related Articles

AI-Resilient Engineering Careers: Skills for the Shift

Remote Engineering Leadership Jobs: A Layoff's Silver Lining

AI Agent Infrastructure Jobs: TestMu Scales Browser Cloud