About the Role
The role involves maintaining and enhancing the reliability of production systems by combining engineering expertise with operational focus. You will design automated solutions, respond to incidents, and contribute to scalable infrastructure.
Responsibilities
- Monitor system performance and respond to alerts promptly
- Design and implement automation to reduce manual operations
- Collaborate with development teams to improve code deployability
- Maintain high availability and uptime for critical services
- Troubleshoot complex production issues across distributed systems
- Develop tools to streamline incident response and resolution
- Participate in on-call rotations with support from the team
- Optimize system performance and resource utilization
- Enforce best practices in configuration management
- Contribute to disaster recovery planning and execution
- Ensure systems meet defined service level objectives
- Drive improvements in observability and monitoring coverage
- Support secure and reliable deployment pipelines
- Document system architecture and operational procedures
- Evaluate new technologies for operational efficiency
- Implement proactive alerting to prevent outages
- Conduct root cause analysis after incidents
- Promote a blameless post-mortem culture
- Assist in capacity planning for future growth
- Integrate security practices into system design
- Maintain cloud infrastructure configurations
- Ensure compliance with operational standards
- Collaborate on scalability challenges
- Refine incident escalation protocols
- Support platform audits and reviews
Nice to Have
- Experience in an education technology environment
- Background in large-scale distributed systems
- Familiarity with Kubernetes in production
- Knowledge of Terraform or similar IaC tools
- Experience with service mesh technologies
- Exposure to zero-downtime deployment strategies
- Understanding of SRE principles and error budgets
- Prior work with observability platforms like Datadog
- Involvement in platform security initiatives
- Track record of mentoring junior engineers
Compensation
Competitive salary and benefits package
Work Arrangement
Hybrid or remote options available
Team
Collaborative engineering team focused on platform stability and performance
Our Engineering Culture
- We value transparency, ownership, and continuous improvement in our technical practices.
- Engineers are encouraged to propose and lead infrastructure initiatives.
Growth Opportunities
- You will have access to learning budgets and time for professional development.
- Opportunities to grow into technical leadership roles are supported.
Available for qualified candidates