Remote (Global)

Ethos is hiring a Lead Site Reliability Engineer

About the Role

Happy Money is hiring a Lead Site Reliability Engineer to spearhead the SRE team, driving operational maturity by defining reliability standards, hardening our security posture, and scaling the Intellum platform. You are an experienced software engineer who excels at architecture, code optimization, and deep troubleshooting.

What You'll Do

  • Set clear goals for the SRE team and partner with Engineering leadership to align platform initiatives with business objectives.
  • Lead the definition and enforcement of SLAs, SLIs, and SLOs. Architect observability frameworks to translate telemetry data into actionable roadmaps.
  • Take ownership of critical code components and lead efforts to identify bottlenecks, optimize performance, and improve code quality across the engineering department.
  • Champion infrastructure security. Partner with InfoSec to define hardening standards, manage perimeter defense, and automate vulnerability remediation within the CI/CD pipeline.
  • Participate in the 24x7 on-call rotation and lead post-incident reviews (RCAs), ensuring action items are implemented to improve MTTR and prevent recurrence.
  • Empower developers with better tooling and guidance on performant coding practices, fostering a culture of collaboration and reliability.

What We're Looking For

  • 10+ years of engineering experience, with 5+ years specifically developing Ruby on Rails applications.
  • Expertise in Cloud Computing (AWS/GCP) and Infrastructure as Code (Terraform/Ansible).
  • Strong proficiency with SQL databases (PostgreSQL) and the ability to quickly navigate and optimize complex, unfamiliar codebases.
  • Proven experience designing monitoring solutions (Datadog, New Relic, Prometheus) based on the 'Golden Signals'.
  • Demonstrated ability to define SLIs/SLOs from scratch, negotiate Error Budgets, and use data to balance feature velocity with reliability.
  • Experience securing cloud environments and container platforms (Kubernetes), including hands-on management of WAF rules and edge security.
  • Experience leading post-incident reviews (RCAs) and implementing action items that directly improve MTTR and MTTD.
  • Proven experience leading technical teams, mentoring engineers, and working in a team-oriented, collaborative environment with strong communication skills.
  • Skilled in documenting solutions and training operational teams on how to effectively support and maintain systems.
  • Demonstrated ability to communicate clearly, seek help proactively, and take ownership of tasks, leading them to completion.
  • Bachelor’s degree in Computer Science or related technical field.

Nice to Have

  • Experience in developing solutions using server automation tools such as Terraform, Ansible.
  • Experience in writing and maintaining CI/CD pipelines and services.
  • Experience in building, deploying, and optimizing Kubernetes-based infrastructure.
  • Experience configuring and managing Web Application Firewalls (WAF) and DDOS protection mechanisms.

Technical Stack

  • Languages/Frameworks: Ruby on Rails, Node.js
  • Databases/Caches: PostgreSQL, MongoDB, Redis, Memcached, Elasticsearch
  • Infrastructure/Queueing: Sidekiq, ActiveJob, Websockets
  • Platforms: Linux, AWS, Google Cloud
  • Cloud Services: MongoDB Atlas, ECS/EC2/Kubernetes, Elasticache, MemoryStore, RDS, CloudSQL, BigQuery
  • Tools: GitHub, Terragrunt, Terraform, Ansible, Spinnaker, Jenkins
  • Monitoring: New Relic, AWS CloudWatch, Google Cloud Stackdriver, Squadcast, JIRA

Team & Environment

You will partner with Engineering leadership to define platform strategy and reliability goals.

Benefits & Compensation

  • Medical - 100% of employee premiums for selected individual plans
  • Dental - 100% of employee premiums covered
  • Vision - 100% of employee premiums covered
  • LinkedIn Learning
  • 401(k) plus matching (US Based Only)
  • Unlimited PTO
  • Calm subscription
  • Annual Company Retreat
  • Personal development budgets

Work Mode

This is a remote position open to candidates in the United States.

Intellum is an equal-opportunity employer. We're committed to building an inclusive team that celebrates diversity in people, perspectives, and backgrounds regardless of race, color, national origin, gender, sexual orientation, age, religion, disability, citizenship, veteran status, or any other protected status.

Required Skills
Ruby on RailsNode.jsPostgreSQLMongoDBRedisMemcachedSidekiqElasticsearchAWSKubernetesDockerTerraformCI/CDObservabilityIncident Management
Ready to relocate and code from paradise?

Thailand or Vietnam — your office, your rules

Iglu offers relocation to Bangkok, Chiang Mai, Ho Chi Minh City, or Hong Kong. Full employment, legal setup, and a community of 200+ digital professionals.

Relocation to 5 countries
Full legal work setup
Developer community access
Work-life balance culture
Explore locations
Relocation support included
About company
Ethos

Ethos is a leading life insurance technology company on a mission to protect families by democratizing access to life insurance and empowering agents at scale. It offers instant, accessible life insurance products with a seamless online process requiring no medical exams.

Visit website
Job Details
Category infrastructure
Posted 2 months ago