United States Employment USD 150,000 - 185,000 Yearly

HavocAI is hiring a Senior Site Reliability Engineer

About the Role

HavocAI is seeking a Senior Site Reliability Engineer with 7+ years of experience to join our Cloud Platform team. In this role, you will be a key technical leader responsible for ensuring the availability, performance, and resilience of mission-critical services that support autonomy, simulation, and data-intensive workloads.

What You'll Do

  • Design and evolve reliability architecture for distributed and cloud-hosted systems.
  • Define and implement SRE best practices, including SLIs, SLOs, error budgets, and capacity planning.
  • Partner with platform and application teams to design systems for reliability, scalability, and operability.
  • Identify and mitigate systemic reliability risks across infrastructure and services.
  • Lead incident response processes including on-call rotations, escalation, and post-incident reviews.
  • Conduct root cause analysis for complex production incidents and drive long-term improvements.
  • Improve operational readiness through runbooks, automation, and resilience testing.
  • Reduce operational toil through tooling, automation, and process improvements.
  • Design and maintain observability systems for metrics, logging, tracing, and alerting.
  • Ensure services and data pipelines are observable, debuggable, and performant in production.
  • Drive performance analysis and tuning across infrastructure and service layers.
  • Build automation to improve system reliability, deployment safety, and recovery processes.
  • Partner with DevOps and Cloud Platform teams on CI/CD reliability, rollout strategies, and safe deployment patterns.
  • Support and improve Kubernetes-based environments and containerized workloads.
  • Collaborate with security teams to ensure secure and resilient system design.
  • Participate in disaster recovery planning and testing.
  • Maintain strong operational practices around access control, secrets management, and change management.

What We're Looking For

  • 7+ years of experience in SRE, infrastructure, or systems engineering roles.
  • Strong experience operating large-scale distributed production systems.
  • Deep understanding of Linux systems, networking, and distributed systems fundamentals.
  • Hands-on experience with Kubernetes and container orchestration.
  • Programming or scripting experience in Go, Python, or similar languages.
  • Experience designing and operating observability systems for production environments.
  • Proven ability to lead incident response and reliability improvements.
  • Strong communication skills and ability to collaborate across engineering teams.
  • Must be a US Citizen.
  • Must be eligible to obtain a Government Clearance if required.

Nice to Have

  • Experience supporting autonomy, robotics, simulation, or real-time systems.
  • Familiarity with AWS and large-scale cloud infrastructure.
  • Experience with chaos engineering, fault injection, or resilience testing.
  • Knowledge of CI/CD systems and progressive delivery practices.
  • Experience working in high-reliability or safety-critical environments.

Technical Stack

  • Kubernetes
  • Go
  • Python
  • AWS

Team & Environment

You will be a key technical leader within the Cloud Platform team.

Benefits & Compensation

  • 100% Employer-paid Health, Dental and Vision Insurance for you and your families
  • Life Insurance (Employer Paid)
  • Ability to participate in the company's 401k program with matching
  • Unlimited PTO policy with an enforced 2 week minimum
  • Equity Package
  • Work / Home Office Stipend
  • Global Entry
  • 16 Week Paid Parental Leave
  • Monthly Health and Wellness Stipend

HavocAI is an Equal Opportunity Employer and is committed to creating an inclusive and diverse workplace. We welcome applicants from all backgrounds and do not discriminate based on race, color, religion, gender, sexual orientation, age, national origin, disability, veteran status, or any other legally protected status.

Required Skills
KubernetesGoPythonAWSLinuxNetworkingDistributed SystemsSite Reliability EngineeringInfrastructureContainer OrchestrationScriptingSystems Engineering
Ready to relocate and code from paradise?

Thailand or Vietnam — your office, your rules

Iglu offers relocation to Bangkok, Chiang Mai, Ho Chi Minh City, or Hong Kong. Full employment, legal setup, and a community of 200+ digital professionals.

Relocation to 5 countries
Full legal work setup
Developer community access
Work-life balance culture
Explore locations
Relocation support included
About company
HavocAI

HavocAI is an unquestioned leader in collaborative autonomy. We set the standard for autonomous surface vessels for a wide range of defense and commercial maritime missions.

Visit website
Job Details
Department Engineering
Category infrastructure
Posted 14 days ago