About the Role
The role involves building and maintaining reliable, scalable systems with a focus on automation, monitoring, and deployment pipelines within a distributed environment.
Responsibilities
- Design and manage cloud infrastructure for high availability and performance
- Implement and maintain CI/CD pipelines to streamline software delivery
- Monitor system health and respond to incidents promptly
- Automate operational workflows to reduce manual intervention
- Collaborate with development teams to improve deployment reliability
- Ensure infrastructure complies with security and compliance standards
- Troubleshoot production issues across multiple environments
- Optimize resource usage and reduce operational costs
- Maintain documentation for systems and processes
- Support on-call rotations for critical system alerts
- Integrate monitoring and alerting tools across services
- Manage configuration and provisioning using infrastructure-as-code tools
- Deploy and manage containerized applications
- Work with distributed systems and microservices architecture
- Improve system resilience through proactive testing and failover planning
Nice to Have
- Experience with large-scale distributed systems
- Background in site reliability engineering
- Knowledge of service mesh technologies
- Experience with infrastructure-as-code tools like Terraform
- Familiarity with logging and tracing systems
- Understanding of compliance frameworks
- Experience in fast-paced startup environments
Compensation
Competitive salary and benefits package
Work Arrangement
Remote, US night shift hours required
Team
Collaborative engineering team focused on scalable infrastructure and deployment systems
Why This Role Matters
This position plays a key role in ensuring the stability and scalability of core systems that support real-time AI services.
What We Expect
Proactive ownership of system health, a focus on automation, and commitment to operational excellence are essential.
No visa sponsorship available