Xceptor is looking for a Site Reliability Engineer to join our cross-cutting SRE function. In this role, you will partner with tribes across the company to make our services reliable, performant, secure, and operable in production. This is an AI-first SRE role where you will use AI to accelerate investigation, diagnostics, runbook creation, and automation, while staying accountable for verification and safe operation.
What You'll Do
- Contribute at the tribe level to service reliability, performance, and operability.
- Help build and run the reliability system: observability standards, incident response practices, runbooks, and automation that reduces toil.
- Partner closely with Software Engineering, QA, Platform Engineering, and Senior/Lead SREs to embed reliability into delivery.
- Own well-scoped operational improvements end-to-end, from design through to implementation, testing, rollout, and measurement.
- Contribute to defining and improving SLIs, SLOs, and service health signals aligned to customer outcomes.
- Implement reliability improvements within established patterns such as timeouts, retries, graceful degradation, and safe failure modes.
- Support capacity and performance work, including basic baselining, load investigation, and scaling hygiene.
- Help maintain operational quality across production and staging environments, improving consistency where possible.
- Participate in incident response and on-call duties, contributing to triage, mitigation, and recovery.
- Produce clear post-incident notes and support root cause analysis, focusing on actions that prevent recurrence.
- Create and improve runbooks and playbooks to make incident resolution faster and more consistent.
- Help improve change safety through practical release checks, readiness checks, and operational guardrails.
- Implement and improve observability for services, including logs, metrics, traces, dashboards, and alerting aligned to standards.
- Tune alerts to reduce noise and improve actionability, helping manage flakiness and false positives.
- Build and maintain service health dashboards that support quick diagnosis and release confidence.
- Work with QA and Engineering to align operational signals with end-to-end journey health.
- Automate repetitive operational tasks and reduce toil through scripts, tooling, and pipeline improvements.
- Contribute to deployment automation and reliability guardrails in CI/CD, working with Platform Engineering.
Team & Environment
You will be part of a cross-cutting SRE function that partners with tribes across Xceptor. The team culture emphasizes client centricity, operating as one team, and delivering impactful results.





