About the Role
This role focuses on maintaining system uptime, improving operational efficiency, and supporting scalable services through automation and proactive monitoring.
Responsibilities
- Design and maintain reliable and scalable systems
- Implement automation to reduce manual operations
- Monitor system performance and respond to incidents
- Collaborate with engineering teams to improve service reliability
- Develop tools for operational efficiency
- Troubleshoot and resolve production issues
- Optimize system performance and resource usage
- Support deployment pipelines and CI/CD workflows
- Ensure systems meet reliability and availability targets
- Participate in on-call incident response rotations
- Contribute to post-incident reviews and action plans
- Maintain documentation for systems and procedures
- Enforce best practices in configuration management
- Work on capacity planning and forecasting
- Improve monitoring coverage and alerting accuracy
- Integrate reliability into the development lifecycle
- Support cloud infrastructure operations
- Drive initiatives to reduce system downtime
- Evaluate and implement new operational tools
- Promote a culture of continuous improvement
- Ensure compliance with security and operational standards
- Scale systems in response to growing demand
- Analyze system metrics to identify trends
- Assist in disaster recovery planning
- Contribute to system architecture discussions
Compensation
Competitive salary and benefits package commensurate with experience
Work Arrangement
Remote
Team
Part of a global engineering team focused on system reliability and operational excellence
Why This Role Matters
This position plays a critical role in ensuring the stability and performance of large-scale systems that support global operations.
What We Offer
Opportunities for professional growth, a collaborative remote work environment, and access to cutting-edge technologies.
Not available for this position


