PhonePe is looking for a Site Reliability Engineer with 5 to 12 years of experience to manage, scale, and ensure the high availability of our core infrastructure. This role is open to experts specialized in either Microsoft Azure or AWS. You will be responsible for deep-level cloud architecture, automation, and complex networking to support a high-volume, mission-critical environment where downtime is not an option.
What You'll Do
- Configure, maintain, and manage Ubuntu/Linux Virtual Machines in your primary cloud environment (Azure or AWS).
- Design and manage cloud-native components for log storage, database management, and alerting (e.g., Azure Storage/ADX or AWS S3/CloudWatch).
- Configure and maintain critical network components, including Firewalls, Route Tables, and Virtual Gateways (VPC/VNet).
- Establish and manage high-speed connectivity via Express Route (Azure) or Direct Connect (AWS) along with IPsec VPNs for external environments.
- Resolve complex routing issues and manage network migrations with zero-to-minimal downtime.
- Drive automation for all BAU (Business As Usual) tasks using Terraform, writing new code for all infrastructure components.
- Use Saltstack or Ansible for automated deployment and configuration of services on VMs.
- Develop custom scripts or services in Python, Go, or Java to eliminate manual toil.
- Set up and manage HA services like MySQL and Aerospike.
- Implement database replication across regions, manage migrations, and ensure data synchronization during network partitions.
- Handle robust backup strategies for databases, logs, and system configurations.
- Implement and manage monitoring systems like Prometheus, Victoria Metrics, or Riemann.
- Proficiency with Loki for centralized logging and Grafana for building mission-critical dashboards and alerting.
- Integrate platform and VM-level services with the SOC; collaborate with Infosec to fix vulnerabilities.
- Expert management of Nginx and HAProxy (proxy management, endpoint addition, and complex rewrite rules).
- Experience with RabbitMQ (RMQ) and containerization using Docker.
- A proactive approach to identifying and solving infrastructure challenges before they impact users.
- Lead incident response, create Root Cause Analysis (RCA) documents, and manage post-mortems.
- Define SLOs/SLIs and demonstrate a commitment to Toil Reduction through automation.
- Identify and implement cloud resource optimization to save costs.
What We're Looking For
- 5 to 12 years in an SRE or high-level DevOps role.
- Deep hands-on experience with either Azure (VMs, Storage Accounts, CosmosDB, ADX) or AWS (EC2, S3, RDS).
- Expert proficiency in Linux (Ubuntu) for system administration and kernel-level performance troubleshooting.
- Deep knowledge of DNS, BGP routing, and private connectivity troubleshooting.
Technical Stack
- Cloud: Microsoft Azure, AWS
- OS/Languages: Ubuntu/Linux, Python, Go, Java
- Infrastructure as Code: Terraform, Saltstack, Ansible
- Data Stores: MySQL, Aerospike
- Monitoring/Observability: Prometheus, Victoria Metrics, Riemann, Loki, Grafana
- Networking/Services: Nginx, HAProxy, RabbitMQ (RMQ)
- Containerization: Docker
Benefits & Compensation
- Medical Insurance
- Critical Illness Insurance
- Accidental Insurance
- Life Insurance
- Employee Assistance Program
- Onsite Medical Center
- Emergency Support System
- Maternity Benefit
- Paternity Benefit Program
- Adoption Assistance Program
- Day-care Support Program
- Relocation benefits
- Transfer Support Policy
- Travel Policy
- Employee PF Contribution
- Flexible PF Contribution
- Gratuity
- NPS
- Leave Encashment
- Higher Education Assistance
- Car Lease
- Salary Advance Policy
PhonePe is an equal opportunity employer and is committed to treating all its employees and job applicants equally; regardless of gender, sexual preference, religion, race, color or disability.



