Responsibilities
- Design and operate Braze’s MongoDB infrastructure to meet strict enterprise-grade SLAs, with deep ownership of availability, durability, and query performance
- Build proactive monitoring and alerting that fires on symptoms – before customers feel impact – with rich MongoDB-specific observability (oplog lag, replication health, lock contention, index hit rates, etc.)
- Lead capacity planning and sharding strategy as data volumes and query patterns evolve
- Drive root-cause analysis on MongoDB incidents and translate findings into permanent system improvements
- Partner with product engineering teams to review schema designs, index strategies, and aggregation pipelines – catching scalability anti-patterns before they reach production
- Build self-service tooling, automation, and runbooks that let engineers interact with MongoDB safely and efficiently without needing to page the platform team
- Define and enforce connection pool sizing, write-concern defaults, and read-preference standards across the fleet
- Manage MongoDB cluster lifecycle (provisioning, upgrades, failovers, decommissions) on Kubernetes using the MongoDB Enterprise Kubernetes Operator, with infrastructure defined as code via Terraform and Ansible
- Develop and maintain automated backup, restore, and point-in-time recovery workflows – tested regularly against real workloads
- Contribute to internal platform tooling in Ruby and/or Go that reduces operational toil across the SRE organization
- Participate in a PagerDuty on-call rotation with a clear charter: use every quiet shift to eliminate the next page
- Lead incident retrospectives with a bias toward systemic fixes, automation, and documentation – not blame
- Maintain and improve runbooks so that any engineer on the team can respond effectively to MongoDB incidents
Requirements
- 5+ years of experience as a Software Engineer, DevOps Engineer, or Site Reliability Engineer in a production environment
- Hands-on MongoDB expertise: replica sets, sharding, index design, aggregation pipelines, explain plans, and performance tuning under real load
- Strong Linux fundamentals and comfort operating at the OS level (disk I/O, memory, networking, process management)
- Strong programming skills in one or more of: Python, Go, Ruby, or JavaScript – you write automation, not just scripts (JavaScript/Python experience is a plus for MongoDB shell scripting and aggregation pipeline work)
- Experience with IaC tools: Terraform, Ansible, or equivalent
- Experience with container orchestration: Docker and Kubernetes
- A systems thinker who reasons about interfaces, failure modes, edge cases, and cascading effects across the stack
- Bias toward documentation and asynchronous collaboration across global remote teams
Nice to Have
- Experience running MongoDB at multi-terabyte scale or in a sharded topology
- Familiarity with MongoDB Atlas, Ops Manager, or Cloud Manager
- Experience with complementary data technologies in Braze’s stack: Redis, Kafka, Postgres
- Prior work on database platform engineering or database reliability engineering (DBRE) teams