As a Senior Site Reliability Engineer, you will be responsible for ensuring the reliability, scalability, and performance of mission-critical distributed systems that power autonomous maritime vessels. You will play a key role in defining and implementing SRE best practices, driving incident management and post-mortem culture, and enabling engineering teams to build resilient, observable, and maintainable systems. This role requires deep technical expertise, strong collaboration skills, and a proactive mindset to anticipate and mitigate systemic risks before they impact operations.
Responsibilities
- Architect and refine reliability frameworks for distributed and cloud-native systems.
- Establish and operationalize SRE practices such as service level indicators, objectives, error budgets, and capacity forecasting.
- Collaborate with engineering teams to build systems with reliability, scalability, and maintainability at their core.
- Proactively identify and resolve systemic risks in infrastructure and services.
- Lead incident management, including on-call duties, escalation protocols, and post-mortem analysis.
- Conduct deep-dive root cause investigations and implement corrective actions to prevent recurrence.
- Enhance operational preparedness through automation, runbooks, and resilience testing.
- Reduce manual operational burden via tooling, automation, and process optimization.
- Develop and manage observability platforms covering metrics, logs, traces, and alerting.
- Ensure production systems and data pipelines are observable, debuggable, and high-performing.
- Perform performance analysis and tuning across infrastructure and service layers.
- Build automated solutions to increase system reliability, deployment safety, and recovery efficiency.
- Work closely with DevOps and platform teams on CI/CD reliability and safe rollout strategies.
- Support and optimize Kubernetes environments and containerized workloads.
- Partner with security teams to integrate resilience and security into system design.
- Participate in disaster recovery planning and validation exercises.
- Uphold strong operational standards in access control, secrets handling, and change management.
Requirements
- Minimum of 7 years in roles focused on site reliability, systems engineering, or infrastructure operations.
- Proven track record operating large-scale distributed systems in production.
- Strong knowledge of Linux, networking, and distributed systems principles.
- Hands-on experience with Kubernetes and container orchestration technologies.
- Proficiency in programming or scripting using Go, Python, or similar languages.
- Demonstrated experience building and managing observability systems in production environments.
- Experience leading incident response and driving reliability improvements.
- Excellent communication skills with ability to collaborate across engineering functions.
- Must be a US Citizen.
- Eligible to obtain a government security clearance if required.
Nice to Have
- Experience with autonomy, robotics, simulation, or real-time control systems.
- Familiarity with AWS and large-scale cloud infrastructure operations.
- Background in chaos engineering, fault injection, or resilience testing.
- Knowledge of CI/CD pipelines and progressive delivery techniques.
- Experience in high-reliability or safety-critical operational environments.
Tech Stack
Kubernetes, Go, Python, AWS, Distributed Systems, Linux, Networking, Container Orchestration, Observability, CI/CD, Chaos Engineering, Fault Injection, Resilience Testing
Benefits
- Employer-paid Health, Dental, and Vision Insurance for employee and family.
- Employer-provided Life Insurance.
- Participation in 401k plan with company matching.
- Unlimited Paid Time Off with a mandatory minimum of two weeks.
- Equity compensation package.
- Work-from-home or home office stipend.
- Global Entry program benefit.
- 16 weeks of paid parental leave.
Team
This role is embedded within the Platform Reliability Engineering team, which is responsible for ensuring the stability, scalability, and operational excellence of the company's core infrastructure and services. The team works closely with product engineering, DevOps, and security to drive SRE adoption, improve system resilience, and reduce operational toil. You will collaborate with cross-functional teams across software development, operations, and maritime systems to deliver reliable, high-performance solutions for autonomous vessel operations.
Additional Information
- This is a full-time position based remotely in the United States.
- Occasional travel to company offices or maritime test sites may be required.
- Candidates must be able to work during core business hours in a US time zone.
- The company provides all necessary equipment and technical resources.
- We are committed to building a diverse and inclusive workplace.
- Employees are encouraged to participate in professional development and conference attendance.


