Responsibilities
- Plan and execute rolling upgrades across tens of thousands of ClickHouse clusters, ensuring safety, correctness, and minimal customer impact
- Own the full release pipeline: from pre-upgrade validation and staged rollouts to post-upgrade monitoring and incident response
- Investigate and resolve production issues as part of a regular on-call rotation, including snowflake clusters and edge cases that automation can't yet handle
- Build and improve the internal tooling and automation that makes large-scale database operations reliable and repeatable
- Work closely with the core database and cloud infrastructure teams to identify operational pain points and turn them into solved problems
- Support and educate other engineering teams using our internal tools
Requirements
- 5+ years of experience operating stateful distributed systems in production, such as databases, message queues, or storage systems
- Hands-on experience running upgrades or maintenance operations on live production data stores, at scale
- Strong production debugging skills; you are comfortable digging into unfamiliar systems under pressure
- Experience with cloud infrastructure (AWS, Azure, or GCP) and Kubernetes
- Software development experience in Go (or strong experience in another language and genuine willingness to learn)
Nice to Have
- Experience with ClickHouse preferred (as a user, operator or contributor)
Team
Structure: The Release Team owns the safe, continuous delivery of ClickHouse Cloud, a managed database platform running tens of thousands of ClickHouse clusters. We are responsible for upgrading and maintaining those clusters at scale, building the internal tooling that makes it possible, and being the last line of defense when something doesn't go according to plan.