Xceptor is hiring a Site Reliability Engineer to join a cross-cutting function that partners with tribes across the company to make services reliable, performant, secure, and operable in production. This is an AI-first role where you will use AI routinely to accelerate investigation, diagnostics, runbook creation, and automation, while embedding reliability into the delivery process from the start.
What You'll Do
- Contribute at the tribe level to service reliability, performance, and operability.
- Help build and run the reliability system: observability standards, incident response practices, runbooks, and automation.
- Partner closely with Software Engineering, QA, Platform Engineering, and Senior/Lead SREs.
- Own well-scoped operational improvements end-to-end, from design and implementation through testing, rollout, and measurement.
- Contribute to defining and improving SLIs/SLOs and service health signals, aligned to customer outcomes.
- Implement reliability improvements within established patterns like timeouts, retries, graceful degradation, and safe failure modes.
- Support capacity and performance work, including basic baselining, load investigation, and scaling hygiene.
- Help maintain operational quality across production and staging environments and improve environment consistency.
- Participate in incident response and on-call rotations, contributing to triage, mitigation, and recovery.
- Produce clear post-incident notes and support root cause analysis, focusing on actions that prevent recurrence.
- Create and improve runbooks and playbooks so incidents are faster and more consistent to resolve.
- Help improve change safety through practical release/readiness checks and operational guardrails.
- Implement and improve observability for services: logs, metrics, traces, dashboards, and alerting aligned to standards.
- Tune alerts to reduce noise and improve actionability; help manage flakiness and false positives.
- Build and maintain service health dashboards that support quick diagnosis and release confidence.
- Work with QA and Engineering to align operational signals with end-to-end journey health.
- Automate repetitive operational tasks and reduce toil through scripts, tooling, and pipeline improvements.
- Contribute to deployment automation and reliability guardrails in CI/CD, working with Platform Engineering.
Team & Environment
You will be part of a cross-cutting function that partners with tribes across Xceptor, embedding reliability practices directly into their workflows and systems.
Xceptor fosters a company culture built on Client Centricity, One Team, and Impactful work.





