London or Guildford Employment

Xceptor is hiring a Site Reliability Engineer

About the Role

Xceptor is looking for a Site Reliability Engineer to join our cross-cutting SRE function. In this role, you will partner with tribes across the company to make our services reliable, performant, secure, and operable in production. This is an AI-first SRE role where you will use AI to accelerate investigation, diagnostics, runbook creation, and automation, while staying accountable for verification and safe operation.

What You'll Do

  • Contribute at the tribe level to service reliability, performance, and operability.
  • Help build and run the reliability system: observability standards, incident response practices, runbooks, and automation that reduces toil.
  • Partner closely with Software Engineering, QA, Platform Engineering, and Senior/Lead SREs to embed reliability into delivery.
  • Own well-scoped operational improvements end-to-end, from design through to implementation, testing, rollout, and measurement.
  • Contribute to defining and improving SLIs, SLOs, and service health signals aligned to customer outcomes.
  • Implement reliability improvements within established patterns such as timeouts, retries, graceful degradation, and safe failure modes.
  • Support capacity and performance work, including basic baselining, load investigation, and scaling hygiene.
  • Help maintain operational quality across production and staging environments, improving consistency where possible.
  • Participate in incident response and on-call duties, contributing to triage, mitigation, and recovery.
  • Produce clear post-incident notes and support root cause analysis, focusing on actions that prevent recurrence.
  • Create and improve runbooks and playbooks to make incident resolution faster and more consistent.
  • Help improve change safety through practical release checks, readiness checks, and operational guardrails.
  • Implement and improve observability for services, including logs, metrics, traces, dashboards, and alerting aligned to standards.
  • Tune alerts to reduce noise and improve actionability, helping manage flakiness and false positives.
  • Build and maintain service health dashboards that support quick diagnosis and release confidence.
  • Work with QA and Engineering to align operational signals with end-to-end journey health.
  • Automate repetitive operational tasks and reduce toil through scripts, tooling, and pipeline improvements.
  • Contribute to deployment automation and reliability guardrails in CI/CD, working with Platform Engineering.

Team & Environment

You will be part of a cross-cutting SRE function that partners with tribes across Xceptor. The team culture emphasizes client centricity, operating as one team, and delivering impactful results.

Required Skills
Site Reliability EngineeringSREAWSAzureGCPKubernetesDockerTerraformCI/CDGitLabJenkinsGitHub ActionsPythonBashMonitoringAlertingIncident ManagementOn-call
Your first international client?

Don't lose them over invoicing

Clients ghost freelancers with unprofessional invoicing. Glopay gives you a real EU company partnership so they take you seriously from invoice #1.

Instant EU company partnership
Invoice builder with your branding
Automated payment reminders
Real-time payment tracking
Get EU company now
Ready in 24 hours
About company
Xceptor

Xceptor is a company that designs around data manipulation, sourcing data from wherever it flows, then curating, normalising, validating, repairing, and enriching that data so it reaches its destination in a reliable and consistent format. It is an expert in the Financial Services vertical, enabling business users to solve their data challenges by themselves.

Visit website
Job Details
Department Engineering
Category infrastructure
Posted 14 days ago