Bangalore, Karnataka, India Employment

PhonePe is hiring a Site Reliability Engineer

About the Role

PhonePe is looking for a Site Reliability Engineer with 5 to 12 years of experience to manage, scale, and ensure the high availability of our core infrastructure. This role is open to experts specialized in either Microsoft Azure or AWS. You will be responsible for deep-level cloud architecture, automation, and complex networking to support a high-volume, mission-critical environment where downtime is not an option.

What You'll Do

  • Configure, maintain, and manage Ubuntu/Linux Virtual Machines in your primary cloud environment (Azure or AWS).
  • Design and manage cloud-native components for log storage, database management, and alerting (e.g., Azure Storage/ADX or AWS S3/CloudWatch).
  • Configure and maintain critical network components, including Firewalls, Route Tables, and Virtual Gateways (VPC/VNet).
  • Establish and manage high-speed connectivity via Express Route (Azure) or Direct Connect (AWS) along with IPsec VPNs for external environments.
  • Resolve complex routing issues and manage network migrations with zero-to-minimal downtime.
  • Drive automation for all BAU (Business As Usual) tasks using Terraform, writing new code for all infrastructure components.
  • Use Saltstack or Ansible for automated deployment and configuration of services on VMs.
  • Develop custom scripts or services in Python, Go, or Java to eliminate manual toil.
  • Set up and manage HA services like MySQL and Aerospike.
  • Implement database replication across regions, manage migrations, and ensure data synchronization during network partitions.
  • Handle robust backup strategies for databases, logs, and system configurations.
  • Implement and manage monitoring systems like Prometheus, Victoria Metrics, or Riemann.
  • Proficiency with Loki for centralized logging and Grafana for building mission-critical dashboards and alerting.
  • Integrate platform and VM-level services with the SOC; collaborate with Infosec to fix vulnerabilities.
  • Expert management of Nginx and HAProxy (proxy management, endpoint addition, and complex rewrite rules).
  • Experience with RabbitMQ (RMQ) and containerization using Docker.
  • A proactive approach to identifying and solving infrastructure challenges before they impact users.
  • Lead incident response, create Root Cause Analysis (RCA) documents, and manage post-mortems.
  • Define SLOs/SLIs and demonstrate a commitment to Toil Reduction through automation.
  • Identify and implement cloud resource optimization to save costs.

What We're Looking For

  • 5 to 12 years in an SRE or high-level DevOps role.
  • Deep hands-on experience with either Azure (VMs, Storage Accounts, CosmosDB, ADX) or AWS (EC2, S3, RDS).
  • Expert proficiency in Linux (Ubuntu) for system administration and kernel-level performance troubleshooting.
  • Deep knowledge of DNS, BGP routing, and private connectivity troubleshooting.

Technical Stack

  • Cloud: Microsoft Azure, AWS
  • OS/Languages: Ubuntu/Linux, Python, Go, Java
  • Infrastructure as Code: Terraform, Saltstack, Ansible
  • Data Stores: MySQL, Aerospike
  • Monitoring/Observability: Prometheus, Victoria Metrics, Riemann, Loki, Grafana
  • Networking/Services: Nginx, HAProxy, RabbitMQ (RMQ)
  • Containerization: Docker

Benefits & Compensation

  • Medical Insurance
  • Critical Illness Insurance
  • Accidental Insurance
  • Life Insurance
  • Employee Assistance Program
  • Onsite Medical Center
  • Emergency Support System
  • Maternity Benefit
  • Paternity Benefit Program
  • Adoption Assistance Program
  • Day-care Support Program
  • Relocation benefits
  • Transfer Support Policy
  • Travel Policy
  • Employee PF Contribution
  • Flexible PF Contribution
  • Gratuity
  • NPS
  • Leave Encashment
  • Higher Education Assistance
  • Car Lease
  • Salary Advance Policy

PhonePe is an equal opportunity employer and is committed to treating all its employees and job applicants equally; regardless of gender, sexual preference, religion, race, color or disability.

Required Skills
Microsoft AzureAWSUbuntu/LinuxTerraformSaltstackAnsiblePythonGoJavaMySQLDNSBGP routingPerformance TroubleshootingSystem Administration
Earn more as a remote developer

Performance pay that rewards your skills

Iglu's revenue-sharing model means top performers earn significantly more than traditional salaries. Choose your projects, deliver great work, and see it reflected in your pay.

Revenue-sharing compensation
Project choice & autonomy
International client base
Career growth support
Check compensation
Top earners exceed market rate
About company
PhonePe

PhonePe's flagship product is a digital payments app. Its portfolio includes distribution of financial products (Insurance, Lending, and Wealth) and new consumer tech businesses (Pincode - hyperlocal e-commerce and Indus AppStore Localized App Store for the Android ecosystem) in India.

Visit website
Job Details
Department Information Technology
Category infrastructure
Posted 14 days ago