Responsibilities
- Continuously enhance the reliability and performance of the core database system.
- Develop and improve metrics and alerts to identify and prevent production issues before they impact users.
- Investigate common customer issues to find root causes and propose fixes, reports, and improvements.
- Refine incident response processes and post-mortem analyses for outages, collaborating with support and cloud teams to inform affected users.
- Plan, implement, and lead chaos initiatives across engineering teams based on internal priorities.
- Oversee on-call processes to address performance and reliability issues, establishing best practices for issue resolution and minimizing user impact.
Requirements
- Bachelor’s or Master’s degree in Computer Science or a related field.
- At least 5 years of experience in Reliability Engineering, QA, or customer-facing engineering.
- Previous experience operating the core database system or other SQL databases in production.
- Scripting experience with Shell or Python, and ability to read and understand C++ code.
- Knowledge of cloud computing platforms such as AWS, Azure, or Google Cloud Platform.
- Strong problem-solving skills and solid production debugging abilities.
- Ability to thrive in a fast-paced environment as part of a global team, with a focus on business goals.
- High level of responsibility, ownership, and accountability.
- Excellent communication skills
Nice to Have
- Excellent understanding of distributed database internals and SQL, particularly the core database system.
Work Arrangement
Remote (Worldwide)
Team
Site Reliability Engineering team in the core database system
Responsibilities
- Continuously enhance the reliability and performance of the core database system.
- Develop and improve metrics and alerts to identify and prevent production issues before they impact users.
- Investigate common customer issues to find root causes and propose fixes, reports, and improvements.
- Refine incident response processes and post-mortem analyses for outages, collaborating with support and cloud teams to inform affected users.
- Plan, implement, and lead chaos initiatives across engineering teams based on internal priorities.
- Oversee on-call processes to address performance and reliability issues, establishing best practices for issue resolution and minimizing user impact.
Required
- Bachelor’s or Master’s degree in Computer Science or a related field.
- At least 5 years of experience in Reliability Engineering, QA, or customer-facing engineering.
- Previous experience operating the core database system or other SQL databases in production.
- Scripting experience with Shell or Python, and ability to read and understand C++ code.
- Knowledge of cloud computing platforms such as AWS, Azure, or Google Cloud Platform.
- Strong problem-solving skills and solid production debugging abilities.
- Ability to thrive in a fast-paced environment as part of a global team, with a focus on business goals.
- High level of responsibility, ownership, and accountability.
- Excellent communication skills
Preferred
Excellent understanding of distributed database internals and SQL, particularly the core database system.