About the Role
This role involves designing and maintaining highly available systems, improving operational workflows, and implementing automation to support large-scale infrastructure.
Responsibilities
- Design and manage scalable infrastructure across distributed environments
- Monitor system performance and respond to incidents with urgency
- Develop automation tools to reduce manual intervention
- Collaborate with development teams to improve service reliability
- Implement and maintain CI/CD pipelines
- Troubleshoot complex production issues across multiple layers
- Optimize system performance and resource utilization
- Enforce security and compliance standards in infrastructure
- Lead post-mortem analyses after critical incidents
- Drive reliability improvements through proactive monitoring
- Maintain documentation for systems and procedures
- Support capacity planning and system forecasting
- Ensure high availability and disaster recovery readiness
- Integrate observability into services and platforms
- Promote best practices in configuration management
- Work closely with product teams during major releases
- Evaluate and adopt new technologies for operational efficiency
- Contribute to on-call rotation with rapid response protocols
- Improve deployment safety through automated checks
- Reduce technical debt in legacy systems
- Implement scalable logging and alerting frameworks
- Support cloud infrastructure management and optimization
- Ensure infrastructure as code principles are followed
- Drive incident response improvements through data analysis
- Foster a culture of operational excellence
Compensation
Competitive salary and benefits package
Work Arrangement
Remote with flexible hours
Team
Collaborative engineering team focused on scalable systems
Why This Role Matters
- The systems you build and maintain directly impact the reliability of core services used by thousands of users daily.
- You will play a key role in shaping how engineering teams approach scalability, resilience, and operational rigor.
- Your work ensures that failures are minimized and recovery is fast, reducing business impact during outages.
What You’ll Build
- Automated recovery systems that reduce downtime without human intervention.
- Monitoring dashboards that provide actionable insights across services.
- Self-service tools that empower developers to deploy safely and efficiently.
Available for qualified candidates


