About the Role
The role involves maintaining high availability and performance of production systems by combining software engineering and operational expertise to build resilient infrastructure.
Responsibilities
- Monitor system performance and respond to incidents
- Design and implement automated deployment pipelines
- Troubleshoot and resolve infrastructure issues
- Collaborate with development teams to improve code deployability
- Maintain system security and compliance standards
- Optimize resource utilization and system efficiency
- Develop tools for monitoring and alerting
- Participate in on-call rotations
- Ensure disaster recovery procedures are tested and effective
- Improve incident response workflows
- Manage configuration and version control for infrastructure
- Support capacity planning initiatives
- Enforce observability best practices across services
- Contribute to post-mortem analyses after outages
- Drive adoption of reliability best practices
- Scale systems to meet growing demand
- Reduce technical debt in operational systems
- Implement self-healing mechanisms in production environments
- Work with distributed systems and cloud platforms
- Ensure service level objectives are met
Nice to Have
- Master's degree in a technical field
- Experience with large-scale distributed systems
- Contributions to open-source projects
- Certifications in cloud or systems engineering
- Prior work in high-availability environments
Compensation
Competitive salary and benefits package
Work Arrangement
Hybrid work model with flexibility for remote work
Team
Collaborative engineering team focused on system reliability and performance
Technology Stack
- Uses modern cloud-native technologies including Kubernetes, Prometheus, and Terraform
- Leverages managed services on Google Cloud Platform
Growth Opportunities
- Engineers are encouraged to lead initiatives and mentor others
- Opportunities for advancement in technical and leadership tracks
Available for qualified candidates


