Responsibilities
- Define and guide the architectural strategy for database systems including SQL Server, Aurora PostgreSQL, and Snowflake, with focus on availability, disaster recovery, replication, backup, capacity, performance, and security.
- Lead end-to-end delivery of complex, organization-wide initiatives such as multi-region failover, platform upgrades, alerting improvements, and cost efficiency, from problem identification to production deployment.
- Set technical standards by producing high-quality code, system designs, RFCs, and operational runbooks that elevate engineering practices across the database reliability team.
- Minimize manual operations by building automated solutions for provisioning, patching, scaling, failover, and decommissioning, treating any human intervention as a defect to resolve.
- Improve alerting effectiveness by reducing noise and increasing signal clarity, working closely with development teams to assign ownership, define SLAs, and refine alert logic.
- Take charge during critical incidents, conduct root-cause analyses, and implement systemic changes to prevent future occurrences.
- Establish and monitor key reliability metrics such as uptime, mean time to recovery, alert sustainability, and SLA compliance through dashboards and regular reporting.
- Collaborate with application developers, infrastructure, and SRE teams on data modeling, query optimization, lifecycle management, and shared reliability patterns, while advising leadership on strategic planning and resource allocation.
Compensation
Competitive salary and comprehensive benefits package
Work Arrangement
Hybrid
Team
Part of the Database Reliability Engineering team focused on scalable data infrastructure
Responsibilities
- Drive architectural direction for the database platform across SQL Server, Aurora PostgreSQL, and Snowflake — covering high availability, disaster recovery, replication, backup and recovery, capacity, performance, and security.
- Own complex, cross-cutting initiatives such as cross-region disaster recovery, platform refresh orchestration, alerting redesign, and cost optimization, taking each from problem statement through to a deployed, owned solution.
- Lead by example with exemplary code, design documents, RFCs, and runbooks, setting the standard for technical writing, code quality, and operational rigor across the DBRE team.
- Reduce operational toil by engineering automation across provisioning, refresh, patching, scaling, failover, and decommissioning — treating manual operations as bugs to be eliminated.
- Lead alert engineering to drive sustainable reductions in alert volume while improving signal quality, partnering with application teams on alert ownership, attribution, and SLA design.
- Drive incident response and root-cause analysis for the most complex production incidents, and convert RCAs into platform-level improvements that prevent recurrence.
- Define reliability KPIs (availability, MTTR, alert sustainability, SLA adherence) and build the dashboards and reporting cadence to track them.
- Partner with application engineering, infrastructure, and SRE teams on schema design, query performance, data lifecycle, and shared reliability patterns, and engage senior leadership on strategy, multi-quarter roadmaps, and budget trade-offs.
Available for qualified candidates