Site Reliability Engineer at GitLab

You will ensure the reliability, scalability, and efficiency of user-facing services and production systems, specializing in Environment Automation. Your work will focus on automating the provisioning, management, and maintenance of numerous isolated environments to support secure and consistent operations at scale.

Responsibilities

Design and deploy automation for provisioning and managing large numbers of isolated environments using Terraform, Ansible, and Kubernetes
Handle complex state management and workspace configurations to support scalability and long-term maintainability
Diagnose and resolve issues across Kubernetes clusters, cloud platforms, and application layers
Determine root causes of deployment failures, pod crash loops, and scheduling conflicts to maintain service uptime
Replace manual processes with scalable infrastructure-as-code solutions
Automate version upgrades, configuration changes, and provisioning workflows across multi-tenant systems
Develop observability pipelines using Prometheus, ELK, and Grafana to detect performance bottlenecks and optimize resource use
Lead incident response and conduct postmortems to improve system resilience
Apply deep technical expertise to resolve operational issues and establish standards that reduce future risks
Influence system architecture decisions related to automation, scalability, and operational best practices
Collaborate with engineering teams to enhance automation, platform stability, and production readiness

Requirements

Demonstrated experience operating and troubleshooting production workloads across multiple environments or tenants
Strong understanding of failure modes in distributed systems at scale and techniques to build resilient systems
Extensive hands-on experience with Terraform, including state management, workspaces, and scalable automation patterns
Skilled at resolving state isolation issues and writing reliable, reusable infrastructure code
Proficient in diagnosing deployment issues, analyzing pod logs, and debugging scheduling and rollback problems in live systems
Knowledge of how Kubernetes components such as pods, ReplicaSets, and controllers function in production
Ability to read and debug code written in Go or Ruby
Experience identifying performance and scalability issues and improving infrastructure tooling through code analysis
Background supporting infrastructure serving multiple customers or environments concurrently
Comfortable managing isolation, scaling, monitoring, and incident response across diverse workloads
Able to analyze and reason through complex system and operational challenges
On-call experience with leadership in technical incident resolution under pressure
Proven ability to collaborate across teams and with stakeholders to solve technical problems while meeting service commitments
Familiarity with using GitLab daily for infrastructure automation, collaboration, and operational workflows

Nice to Have

Experience with Ansible and templating tools such as Jsonnet

Tech Stack

Terraform, Ansible, Kubernetes, Helm Charts, omnibus-gitlab, GCP, AWS, Prometheus, ELK, Grafana, Jsonnet, Go, Ruby

Benefits

Comprehensive benefits supporting health, financial security, and personal well-being
Flexible Paid Time Off policy
Employee-led resource groups to support inclusion and community
Equity compensation and Employee Stock Purchase Plan
Funding for growth, learning, and professional development

Compensation

The base salary range for this role is $124,300 - $266,400 USD per year for U.S. residents. This does not include bonuses, equity, or benefits. Equity compensation and an Employee Stock Purchase Plan are offered. Sales roles may be eligible for incentive pay up to 100% of base salary, though this role is not sales-focused.

Work Arrangement

global — worldwide — All roles are remote

Team

Part of the Dedicated team, which delivers a fully managed, single-tenant GitLab experience through the GitLab Dedicated platform.

AI is integrated as a core productivity tool across daily workflows
Culture of high performance, continuous learning, and knowledge sharing
All voices are valued and encouraged to contribute
Environment where careers grow rapidly and innovation thrives
Collaboration with technical leaders to solve complex challenges

Additional Information

Careers grow quickly here, innovation is central, and every team member's input is valued
AI is expected to be used by all team members as a key driver of efficiency, innovation, and impact
Candidates with diverse experience levels are encouraged to apply
People from underrepresented groups are strongly encouraged to apply, even if they don’t meet every listed requirement
Hiring occurs globally, with team members located around the world
Some roles may have location-specific eligibility criteria
The company is an equal opportunity employer and affirmative action workplace
P

GitLab is hiring a Site Reliability Engineer

Responsibilities

Requirements

Nice to Have

Tech Stack

Benefits

Compensation

Work Arrangement

Team

Additional Information

Similar Jobs

Senior Platform Engineer - Observability

Platform Engineer, Infrastructure

Senior Site Reliability Engineer

DevOps Engineer (Mid level)

Containerization Cloud Consulting

Platform Engineer - Product Reliability (Mid Level)

Related Articles

Platform Engineering: Kubernetes for All

Network Configuration as Code: CI/CD for Automation | NVIDIA

Developer Experience Platform: Lessons from Europe