About the Role

The role involves maintaining and enhancing the reliability of production systems by combining engineering expertise with operational focus. You will design automated solutions, respond to incidents, and contribute to scalable infrastructure.

Responsibilities

Monitor system performance and respond to alerts promptly
Design and implement automation to reduce manual operations
Collaborate with development teams to improve code deployability
Maintain high availability and uptime for critical services
Troubleshoot complex production issues across distributed systems
Develop tools to streamline incident response and resolution
Participate in on-call rotations with support from the team
Optimize system performance and resource utilization
Enforce best practices in configuration management
Contribute to disaster recovery planning and execution
Ensure systems meet defined service level objectives
Drive improvements in observability and monitoring coverage
Support secure and reliable deployment pipelines
Document system architecture and operational procedures
Evaluate new technologies for operational efficiency
Implement proactive alerting to prevent outages
Conduct root cause analysis after incidents
Promote a blameless post-mortem culture
Assist in capacity planning for future growth
Integrate security practices into system design
Maintain cloud infrastructure configurations
Ensure compliance with operational standards
Collaborate on scalability challenges
Refine incident escalation protocols
Support platform audits and reviews

Nice to Have

Experience in an education technology environment
Background in large-scale distributed systems
Familiarity with Kubernetes in production
Knowledge of Terraform or similar IaC tools
Experience with service mesh technologies
Exposure to zero-downtime deployment strategies
Understanding of SRE principles and error budgets
Prior work with observability platforms like Datadog
Involvement in platform security initiatives
Track record of mentoring junior engineers

Compensation

Competitive salary and benefits package

Work Arrangement

Hybrid or remote options available

Team

Collaborative engineering team focused on platform stability and performance

Our Engineering Culture

We value transparency, ownership, and continuous improvement in our technical practices.
Engineers are encouraged to propose and lead infrastructure initiatives.

Growth Opportunities

You will have access to learning budgets and time for professional development.
Opportunities to grow into technical leadership roles are supported.

Available for qualified candidates

Arbor Education is hiring a Site Reliability Engineer