Responsibilities

Design and operate Braze’s MongoDB infrastructure to meet strict enterprise-grade SLAs, with deep ownership of availability, durability, and query performance
Build proactive monitoring and alerting that fires on symptoms – before customers feel impact – with rich MongoDB-specific observability (oplog lag, replication health, lock contention, index hit rates, etc.)
Lead capacity planning and sharding strategy as data volumes and query patterns evolve
Drive root-cause analysis on MongoDB incidents and translate findings into permanent system improvements
Partner with product engineering teams to review schema designs, index strategies, and aggregation pipelines – catching scalability anti-patterns before they reach production
Build self-service tooling, automation, and runbooks that let engineers interact with MongoDB safely and efficiently without needing to page the platform team
Define and enforce connection pool sizing, write-concern defaults, and read-preference standards across the fleet
Manage MongoDB cluster lifecycle (provisioning, upgrades, failovers, decommissions) on Kubernetes using the MongoDB Enterprise Kubernetes Operator, with infrastructure defined as code via Terraform and Ansible
Develop and maintain automated backup, restore, and point-in-time recovery workflows – tested regularly against real workloads
Contribute to internal platform tooling in Ruby and/or Go that reduces operational toil across the SRE organization
Participate in a PagerDuty on-call rotation with a clear charter: use every quiet shift to eliminate the next page
Lead incident retrospectives with a bias toward systemic fixes, automation, and documentation – not blame
Maintain and improve runbooks so that any engineer on the team can respond effectively to MongoDB incidents

Requirements

5+ years of experience as a Software Engineer, DevOps Engineer, or Site Reliability Engineer in a production environment
Hands-on MongoDB expertise: replica sets, sharding, index design, aggregation pipelines, explain plans, and performance tuning under real load
Strong Linux fundamentals and comfort operating at the OS level (disk I/O, memory, networking, process management)
Strong programming skills in one or more of: Python, Go, Ruby, or JavaScript – you write automation, not just scripts (JavaScript/Python experience is a plus for MongoDB shell scripting and aggregation pipeline work)
Experience with IaC tools: Terraform, Ansible, or equivalent
Experience with container orchestration: Docker and Kubernetes
A systems thinker who reasons about interfaces, failure modes, edge cases, and cascading effects across the stack
Bias toward documentation and asynchronous collaboration across global remote teams

Nice to Have

Experience running MongoDB at multi-terabyte scale or in a sharded topology
Familiarity with MongoDB Atlas, Ops Manager, or Cloud Manager
Experience with complementary data technologies in Braze’s stack: Redis, Kafka, Postgres
Prior work on database platform engineering or database reliability engineering (DBRE) teams

Braze is hiring a Senior Site Reliability Engineer

Responsibilities

Requirements

Nice to Have