About the Role
The role involves improving system reliability through automation, monitoring, and incident response, while collaborating closely with development and operations teams to deliver resilient infrastructure and services.
Responsibilities
- Design and maintain scalable, reliable, and secure production systems
- Implement automated solutions for deployment, monitoring, and recovery
- Respond to and resolve critical system incidents promptly
- Collaborate with engineering teams to improve service reliability
- Develop tools and scripts to streamline operations
- Monitor system performance and proactively address issues
- Optimize system architecture for high availability
- Manage incident response processes and post-mortem analyses
- Ensure systems meet defined service level objectives
- Support CI/CD pipelines with reliability-focused practices
- Troubleshoot complex production problems across distributed systems
- Enforce best practices in configuration management
- Maintain comprehensive documentation of systems and procedures
- Evaluate and integrate new technologies to improve system performance
- Participate in on-call rotations for incident support
- Drive improvements in system observability and metrics collection
- Work on capacity planning and resource optimization
- Implement and manage logging infrastructure
- Support cloud infrastructure operations and migration efforts
- Promote a culture of reliability across engineering teams
Nice to Have
- Master’s degree in a technical field
- Experience in large-scale logistics or e-commerce environments
- Certifications in cloud or systems administration
- Prior work with observability platforms like Prometheus or Grafana
- Involvement in building high-traffic web applications
- Knowledge of service mesh technologies
- Experience with disaster recovery planning
- Familiarity with regulatory compliance standards
Compensation
Competitive salary and benefits package
Work Arrangement
Hybrid work model available
Team
Part of the engineering division focused on system reliability and performance
Why Join Us
- Opportunity to work on large-scale systems impacting millions of users
- Supportive environment that values innovation and technical excellence
What We Offer
- Professional development opportunities
- Flexible working arrangements
- Modern tech stack and infrastructure
Visa sponsorship available for qualified candidates