Remote (Global) Full-time

Endgame Systems, LLC is hiring a Senior Site Reliability Engineer (Resilience) - Platform Resilience

About the Role

As a Senior Site Reliability Engineer (Resilience), you'll drive the evolution of large-scale, multi-cloud systems by designing automation and improving platform reliability. You'll work across engineering teams to build resilient infrastructure for cloud-hosted and serverless services, ensuring systems scale efficiently and operate consistently under real-world demands.

What You'll Do

  • Design and implement automation to streamline system engineering workflows and strengthen platform stability.
  • Scale global infrastructure to meet growing demand through code-driven tooling and maintainable systems.
  • Lead incident response and problem management efforts to reduce recurring customer impact and improve resolution efficiency.
  • Collaborate across time zones in a follow-the-sun on-call model, primarily during regular working hours.
  • Foster a culture of shared ownership, operational rigor, and continuous learning within engineering teams.

Requirements

  • Demonstrated experience applying engineering principles to improve platform reliability and reduce operational toil.
  • Customer-focused mindset with the ability to assess and resolve operational issues through an SRE lens.
  • Strong software engineering background enabling effective collaboration on system design and implementation.

Preferred Qualifications

  • Hands-on experience with public cloud platforms and managed Kubernetes services.
  • Proven work with Infrastructure-as-Code tools such as Crossplane or Terraform in SaaS environments.
  • Experience operating large-scale Kubernetes clusters across multiple cloud providers.
  • Proficiency in Golang or other programming languages for building system-level tools.
  • Familiarity with container technologies like Docker and distributed Linux environments.
  • Track record improving alerting, incident response, and observability systems using tools like Prometheus, Influx, Graphite, or the Elastic Stack.
  • Experience mentoring engineers and promoting knowledge sharing in distributed teams.
  • Background in inclusive communication practices that strengthen team and partner relationships.
  • Remote work experience in self-directed, globally distributed teams.

Benefits

  • Compensation aligned with role impact, not prior salary history.
  • Comprehensive health coverage for employees and dependents in many regions.
  • Flexible work arrangements with support for remote and asynchronous collaboration.
  • Generous annual vacation allowance.
  • Up to $2000 in matched donations for charitable giving or community service.
  • 40 hours annually dedicated to volunteer activities.
  • Minimum of 16 weeks of parental leave.
  • Commitment to diversity, equity, and inclusion across a global workforce.
  • Clear pathways for professional development regardless of age, background, or tenure.
Required Skills
KubernetesTerraformDockerPrometheusElastic StackGolangLinuxCrossplaneInfluxGraphiteSite Reliability EngineeringPlatform ResilienceSRECloud InfrastructureObservability KubernetesTerraformDockerPrometheusElastic StackGolangLinuxCrossplaneInfluxGraphiteSite Reliability EngineeringPlatform ResilienceSRECloud InfrastructureObservability
Got hired remotely?

Get paid like a professional

Remote clients expect company invoices, not personal PayPal requests. Glopay forms an EU partnership that makes you look legitimate while you stay independent.

Professional invoices with EU company details
Compliance handled automatically
Withdraw to any bank account
Income reports for easy tax filing
Create free account
Free signup • 5 min setup
About company
Endgame Systems, LLC
Endgame Systems, LLC provides consulting services related to Elastic technology to Government agencies with heightened security needs. It is a wholly-owned subsidiary of Elastic, focused on Government services.
All jobs at Endgame Systems, LLC Visit website
Job Details
Department Platform Engineering
Category infrastructure
Posted 11 days ago