Responsibilities
- Lead, coach, and expand a site reliability engineering team, promoting accountability, continuous learning, and teamwork.
- Ensure high availability, performance, and scalability of critical production systems.
- Refine incident management procedures to enable rapid resolution and conduct thorough post-incident reviews for continuous improvement.
- Establish and monitor service level objectives, indicators, and agreements to uphold reliability standards across engineering groups.
- Lead automation efforts to eliminate manual tasks, enhance system observability, and minimize operational burden.
- Collaborate with software engineering, platform, and security teams to design robust, secure, and scalable system architectures.
- Forecast infrastructure demands and optimize system growth to support business needs while managing cost efficiency.
- Promote proactive reliability strategies including chaos engineering, failure testing, and capacity modeling.
- Oversee a sustainable on-call schedule that ensures reliable coverage while supporting team members' work-life balance
Team
SRE team reporting to the SRE Manager