About the Role
This role is responsible for maintaining high availability, performance, and resilience of production systems through automation, monitoring, and incident response.
Responsibilities
- Design and implement reliable and scalable systems
- Monitor system performance and troubleshoot issues
- Automate operational tasks to reduce manual intervention
- Respond to and resolve production incidents
- Collaborate with development teams to improve system design
- Maintain system documentation and runbooks
- Ensure systems meet security and compliance standards
- Optimize infrastructure for cost and performance
- Support deployment pipelines and CI/CD processes
- Implement proactive alerting and monitoring solutions
- Conduct root cause analysis for critical incidents
- Drive improvements in system reliability and uptime
- Participate in on-call rotations
- Evaluate and integrate new technologies
- Promote best practices in infrastructure as code
- Work closely with IT and security teams
- Manage cloud-based infrastructure services
- Ensure disaster recovery readiness
- Support capacity planning and forecasting
- Improve incident response procedures
- Foster a culture of continuous improvement
- Maintain consistency across development, staging, and production environments
- Contribute to post-mortem reviews
- Ensure compliance with internal policies
- Assist in onboarding and mentoring junior engineers
Compensation
Competitive salary and benefits package
Work Arrangement
Remote
Team
Collaborative engineering team focused on reliability and scalable systems
Why This Role Matters
This position plays a key role in ensuring the stability and performance of critical systems that support patient-facing applications and internal operations.
What We Offer
Opportunities for professional growth, a supportive remote work culture, access to cutting-edge technologies, and comprehensive health and wellness benefits.
Not available