Responsibilities
- Ensure consistent performance, reliability, and operational integrity of live AI services
- Improve existing systems through refactoring to enhance robustness, readability, and long-term maintainability
- Troubleshoot and resolve technical issues across distributed systems, data processing pipelines, and storage components
- Build monitoring, alerting, and diagnostic tools to support highly available production environments
- Collaborate with researchers and development teams to deploy scalable, production-grade predictive models
- Define and implement standards for testing, deployment, capacity forecasting, and incident management
- Participate in incident response efforts and post-incident reviews to drive systemic improvements
Benefits
- Fully remote-friendly work environment
- Comprehensive health coverage including medical, dental, vision, and accident insurance
- Basic and optional supplemental life and accidental death & dismemberment insurance
- Flexible Spending Account for eligible healthcare expenses
- 401(k) retirement plan with employer matching contributions
- Paid public holidays and flexible paid time off policy
Work Arrangement
Hybrid
Other
- The company covers approved travel expenses.
- Candidates must confirm their ability to travel within the United States.


