About the Role

The candidate will collaborate with engineering teams to build and maintain highly available systems, improve observability, automate operations, and drive reliability best practices across production environments.

Responsibilities

Design and implement scalable infrastructure for autonomous vehicle operations
Own on-call incident response and postmortem analysis processes
Develop automation tools to reduce manual operational overhead
Enhance monitoring, alerting, and metrics collection systems
Collaborate with development teams to improve service reliability
Drive adoption of SRE principles across engineering teams
Optimize system performance and troubleshoot complex production issues
Maintain and evolve CI/CD pipelines and deployment strategies
Ensure infrastructure meets security and compliance standards
Lead capacity planning and scalability initiatives
Contribute to disaster recovery and business continuity planning
Improve logging infrastructure for faster root cause analysis
Support cloud and edge computing environments
Work closely with product teams to influence system design
Promote blameless postmortem culture and follow-through on action items
Evaluate and integrate new reliability tools and technologies
Document system architecture and operational procedures
Mentor junior engineers in SRE best practices
Participate in system design reviews and provide operational feedback
Monitor service level objectives and error budget management
Reduce technical debt in production systems
Implement proactive alerting to minimize mean time to detection
Support high-availability requirements for real-time vehicle operations
Contribute to incident command structure during major outages
Ensure systems are resilient under peak load conditions

Compensation

Competitive salary and equity package

Work Arrangement

Hybrid or remote with team presence in the Bay Area

Team

Engineering team focused on autonomous middle-mile logistics

Why This Role Matters

The systems you maintain directly impact the safety and efficiency of autonomous delivery fleets operating in real-world conditions.
Your work ensures minimal downtime for critical logistics operations serving commercial customers.
You’ll help scale infrastructure to support rapid geographic and operational expansion.

Tech Stack

Kubernetes for container orchestration
AWS and hybrid cloud environments
Prometheus, Grafana, and ELK stack for observability
Terraform for infrastructure as code
GitLab CI/CD for pipelines
Go and Python for tooling and automation
gRPC and REST APIs for service communication

Growth Opportunities

Opportunity to shape SRE practices in a growing engineering organization.
Exposure to cutting-edge challenges in autonomous vehicle operations.
Leadership roles available for staff-level contributors.
Cross-functional collaboration with AI, robotics, and product teams.

Available for qualified candidates

Gatik AI is hiring a Senior/Staff Site Reliability Engineer

About the Role

Responsibilities

Compensation

Work Arrangement

Team

Why This Role Matters

Tech Stack

Growth Opportunities

Similar Jobs

Senior DevOps Engineer with Core Python Programming

DevOPS Engineer

IT Software Engineer - Monks

DevOps Azure Senior MS055SG

Enterprise Architect

Software Engineers Python / Devops

Related Articles

Platform Engineering: Kubernetes for All

AI Boom Job Impact: Tech Decline vs. Service Growth in SF

Tech Layoffs AI Efficiency: Block Cuts 40% Workforce