About the Role
Role details below.
Responsibilities
- Ensure high availability, reliability, and performance of applications and infrastructure.
- Define and monitor SLIs, SLOs, and SLAs to maintain service reliability.
- Implement automation to reduce manual operations and improve system efficiency.
- Monitor systems, detect anomalies, and respond to incidents in a timely manner.
- Lead incident management, root cause analysis (RCA), and post-mortem processes.
- Collaborate with development and DevOps teams to improve system resilience and scalability.
- Manage observability tools (monitoring, logging, tracing) to gain system insights.
- Optimize system performance, capacity planning, and cost efficiency.
- Implement reliability best practices, including redundancy, failover, and disaster recovery.
- Continuously improve system reliability through proactive engineering initiatives.
Benefits
- Contractor model
- Remote model
- Salary in $USD
- Paid Vacations
- Day off for birthdays
- Benefits courses and/or certifications
- Opportunity to work with top-tier U.S. clients.
- Entrepreneurial, multicultural team culture.
Compensation
Salary in $USD
Work Arrangement
Remote (Worldwide)
Additional Information
- Must have experience working for US clients
- Advanced English proficiency skills (C1) required