Responsibilities
- Serve as Incident Commander for major incidents — coordinating cross-functional response teams, driving investigation, making escalation decisions, and ensuring incidents are resolved within SLA targets.
- Own all incident communications: draft and send clear, timely updates to senior leadership, Customer Success, and partner/customer contacts throughout the incident lifecycle, and manage customer-facing status page updates (status.xsolla.com).
- Facilitate blameless Post-Incident Reviews (PIRs) for major incidents — leading root cause identification, assigning corrective actions with clear owners and deadlines, and tracking them to closure.
- During non-incident periods, proactively analyze incident trends, recurring issues, and production bugs — identify patterns, create Problem tickets, and report findings and recommendations to product and engineering teams on a regular cadence.
- Enforce the incident management framework across the organization, including the severity model, priority matrix, SLA targets, escalation procedures, and deployment readiness gates.
- Oversee and mentor the Operations Engineer on your shift — coaching on triage, investigation, runbook execution, and documentation quality while conducting regular knowledge transfer sessions to build depth across the service portfolio.
- Produce shift handoff reports and deliver regular operational reporting: incident trends, KPI performance (MTTD, MTTA, MTTR), SLA adherence, proactive detection rate, and repeat incident analysis.
- Audit service catalogue completeness on a regular cadence and govern JIRA Service Management workflows for incident, PIR, and problem management.
- Cover for the Operations Engineer role during absences, breaks, or surge incidents. Participate in weekend on-call rotation for major incidents.
Requirements
- 6+ years of experience in incident management, SRE, NOC leadership, or technical operations in a production environment supporting high-availability, high-transaction systems (payments, e-commerce, SaaS, or gaming platforms preferred).
- Proven incident management experience — coordinating multi-team response, making real-time escalation decisions, and communicating with executive stakeholders under pressure.
- Excellent written and verbal communication skills in English — ability to draft clear, concise executive updates at 3 AM under pressure, facilitate blameless PIRs, present operational metrics to senior leadership, and communicate incident status to customers and partners with clarity and professionalism.
- Strong ITIL foundation — understanding of incident, problem, and change management lifecycles with practical experience implementing or operating ITIL-aligned workflows.
- Technical depth across the observability stack — ability to read and interpret logs, traces, and metrics in Datadog (or equivalent: Grafana, Splunk, New Relic). Understanding of APM, SLOs, error budgets, burn-rate alerting, and synthetic monitoring.
- Hands-on experience with incident tooling: Datadog, PagerDuty or OpsGenie, JIRA or JIRA Service Management, Slack, and Confluence.
- Analytical mindset — ability to identify trends, patterns, and recurring issues from incident data and translate them into actionable recommendations for product and engineering teams.
- Experience with SLA/SLO-driven operations where MTTD, MTTA, and MTTR are measured, reported, and improved.
- Experience with or strong interest in AI/ML-assisted operations: anomaly detection, alert correlation, predictive alerting, automated remediation, or self-healing automation.
- Comfort with 24x7 shift-based operations as part of a follow-the-sun model with handoff overlaps. Weekend on-call (rotating) for critical severities is required.
Nice to Have
- Experience in the gaming, payments, or fintech industry.
- Experience with customer/partner-facing incident communications and status page management.
- JIRA Service Management administration experience: workflows, SLA timers, automation rules, queues, and permissions.
- Familiarity with Datadog Service Catalog, scorecards, and SLOs — especially burn-rate alerts and multi-window SLOs.
- Experience building an operations function from scratch — defining processes, writing runbooks, establishing governance cadences.
- Background in Kubernetes, cloud infrastructure (GCP preferred), microservices architecture, or distributed systems.
- ITIL certification (Foundation or higher) is a plus but not required.
Benefits
- medical, dental, and vision
- PTO
- a personalized career roadmap for each employee
- professional development through training and educational opportunities
Additional Information
- Weekend on-call (rotating) for critical severities is required.
- A background check that may include the following: Criminal history check, Employment verification, Education verification.
- The background check is relevant to this position because of the following role responsibilities: Accessing confidential company data, Ensuring compliance with regulatory requirements.
