About the Role
The role involves combining software engineering and systems operations to build and maintain reliable, scalable systems. The engineer will focus on automation, monitoring, incident response, and improving system performance.
Responsibilities
- Design and maintain infrastructure to ensure high availability and performance
- Implement automated deployment and configuration management systems
- Monitor systems to detect and resolve issues proactively
- Respond to incidents and lead resolution efforts
- Conduct post-incident reviews to identify root causes and prevent recurrence
- Optimize system reliability and operational efficiency
- Collaborate with development teams to improve service resilience
- Develop tools and scripts to streamline operations
- Manage cloud infrastructure and services
- Ensure systems meet security and compliance standards
- Participate in on-call rotations for critical systems
- Troubleshoot complex production issues
- Improve monitoring and alerting systems
- Support disaster recovery planning and testing
- Drive adoption of best practices in reliability engineering
- Work on capacity planning and performance tuning
- Integrate reliability into the development lifecycle
- Maintain documentation for systems and procedures
- Evaluate new technologies for operational improvements
- Promote a culture of continuous improvement and learning
Compensation
Competitive salary based on experience
Work Arrangement
Hybrid work model with on-site and remote options
Team
Collaborative engineering environment focused on reliability and scalability
Why Join Us
- Opportunity to work on large-scale systems with real impact
- Supportive team culture that values innovation and ownership
Technology Stack
- Uses modern cloud infrastructure and automation tools
- Leverages Kubernetes, Terraform, and observability platforms
Not available