You will ensure the reliability, scalability, and efficiency of user-facing services and production systems, specializing in Environment Automation. Your work will focus on automating the provisioning, management, and maintenance of numerous isolated environments to support secure and consistent operations at scale.
Responsibilities
- Design and deploy automation for provisioning and managing large numbers of isolated environments using Terraform, Ansible, and Kubernetes
- Handle complex state management and workspace configurations to support scalability and long-term maintainability
- Diagnose and resolve issues across Kubernetes clusters, cloud platforms, and application layers
- Determine root causes of deployment failures, pod crash loops, and scheduling conflicts to maintain service uptime
- Replace manual processes with scalable infrastructure-as-code solutions
- Automate version upgrades, configuration changes, and provisioning workflows across multi-tenant systems
- Develop observability pipelines using Prometheus, ELK, and Grafana to detect performance bottlenecks and optimize resource use
- Lead incident response and conduct postmortems to improve system resilience
- Apply deep technical expertise to resolve operational issues and establish standards that reduce future risks
- Influence system architecture decisions related to automation, scalability, and operational best practices
- Collaborate with engineering teams to enhance automation, platform stability, and production readiness
Requirements
- Demonstrated experience operating and troubleshooting production workloads across multiple environments or tenants
- Strong understanding of failure modes in distributed systems at scale and techniques to build resilient systems
- Extensive hands-on experience with Terraform, including state management, workspaces, and scalable automation patterns
- Skilled at resolving state isolation issues and writing reliable, reusable infrastructure code
- Proficient in diagnosing deployment issues, analyzing pod logs, and debugging scheduling and rollback problems in live systems
- Knowledge of how Kubernetes components such as pods, ReplicaSets, and controllers function in production
- Ability to read and debug code written in Go or Ruby
- Experience identifying performance and scalability issues and improving infrastructure tooling through code analysis
- Background supporting infrastructure serving multiple customers or environments concurrently
- Comfortable managing isolation, scaling, monitoring, and incident response across diverse workloads
- Able to analyze and reason through complex system and operational challenges
- On-call experience with leadership in technical incident resolution under pressure
- Proven ability to collaborate across teams and with stakeholders to solve technical problems while meeting service commitments
- Familiarity with using GitLab daily for infrastructure automation, collaboration, and operational workflows
Nice to Have
- Experience with Ansible and templating tools such as Jsonnet
Tech Stack
Terraform, Ansible, Kubernetes, Helm Charts, omnibus-gitlab, GCP, AWS, Prometheus, ELK, Grafana, Jsonnet, Go, Ruby
Benefits
- Comprehensive benefits supporting health, financial security, and personal well-being
- Flexible Paid Time Off policy
- Employee-led resource groups to support inclusion and community
- Equity compensation and Employee Stock Purchase Plan
- Funding for growth, learning, and professional development
Compensation
The base salary range for this role is $124,300 - $266,400 USD per year for U.S. residents. This does not include bonuses, equity, or benefits. Equity compensation and an Employee Stock Purchase Plan are offered. Sales roles may be eligible for incentive pay up to 100% of base salary, though this role is not sales-focused.
Work Arrangement
global — worldwide — All roles are remote
Team
Part of the Dedicated team, which delivers a fully managed, single-tenant GitLab experience through the GitLab Dedicated platform.
- AI is integrated as a core productivity tool across daily workflows
- Culture of high performance, continuous learning, and knowledge sharing
- All voices are valued and encouraged to contribute
- Environment where careers grow rapidly and innovation thrives
- Collaboration with technical leaders to solve complex challenges
Additional Information
- Careers grow quickly here, innovation is central, and every team member's input is valued
- AI is expected to be used by all team members as a key driver of efficiency, innovation, and impact
- Candidates with diverse experience levels are encouraged to apply
- People from underrepresented groups are strongly encouraged to apply, even if they don’t meet every listed requirement
- Hiring occurs globally, with team members located around the world
- Some roles may have location-specific eligibility criteria
- The company is an equal opportunity employer and affirmative action workplace
- P


