Mastercard powers economies and empowers people across more than 200 countries and territories. We are committed to building an inclusive, digital economy that benefits everyone, everywhere. We are hiring a Site Reliability Engineer to lead efforts ensuring the end-to-end service quality, availability, scalability, and resilience of our critical core payment systems.
What You'll Do
- Lead continuous assessments of application infrastructure health, performance, monitoring, alerting, and capacity analysis for critical Mastercard applications.
- Collaborate with Product and Development teams to forecast growth requirements and ensure scalability and resiliency.
- Champion observability by assessing environments for monitoring gaps and implementing strategies to close them.
- Build custom dashboards to investigate and perform root cause analysis on complex issues.
- Lead regular incident reviews with internal support teams to ensure root causes are identified and remediated.
- Develop and implement strategies to mitigate risks from patterns of failure or compatibility issues.
- Leverage automation and AI technologies to enhance proactive issue detection and enable self-healing capabilities.
- Develop testing and validation plans for new environment builds and disaster recovery exercises.
- Champion continuous learning, development, and knowledge sharing across networking and infrastructure disciplines.
- Lead training initiatives for team members and stakeholders on networking aspects of the platforms.
- Evaluate vendor hardware, firmware, and software upgrade roadmaps and conduct proof-of-concept testing.
What We're Looking For
- 5–10 years of experience in an SRE or SRE-related operations role.
- 3+ years supporting e-commerce, financial services, or large-scale SaaS platforms.
- Excellent infrastructure troubleshooting and analytical problem-solving skills.
- Strong hands-on experience with observability tools such as Splunk and Dynatrace.
- Proven ability to triage and investigate complex issues.
- Familiarity with network telemetry tools such as SolarWinds and NetScout.
- Proficiency in packet-level debugging using tcpdump and Wireshark.
- Broad understanding of end-to-end infrastructure supporting payment platforms.
- Experience with automation and Infrastructure as Code tools such as Chef, Ansible, and Terraform.
- Experience with structured data formats (JSON/YAML).
- Excellent communication skills with the ability to coordinate cross-functional troubleshooting efforts.
- Demonstrated ability to troubleshoot complex production issues and drive long-term corrective actions.
- Experience partnering with development teams to shape architecture and define SLIs/SLOs.
- Strong understanding of monitoring ecosystems, including Prometheus, Grafana, ELK/EFK, and OpenTelemetry.
- Effective incident management skills with a structured, analytical approach.
Technical Stack
- Monitoring/Observability: Splunk, Dynatrace, SolarWinds, NetScout, Prometheus, Grafana, ELK/EFK, OpenTelemetry
- Networking Tools: tcpdump, Wireshark
- Automation/IaC: Chef, Ansible, Terraform
- Data Formats: JSON, YAML
Team & Environment
You will be part of the Program-aligned Site Reliability Engineering (SRE) teams, specifically the Payments Network SRE team.
Mastercard is an equal opportunity employer.




