Responsibilities
- Participate in on-call rotation (Pagerduty) to respond to production incidents
- Build and run our infrastructure with Ansible, Terraform, and Kubernetes to enable scaling to a massive number of concurrent users
- Build monitoring systems to ensure the highest quality service for our customers
- Design and implement operational processes (such as deployments and upgrades)
- Debug production issues across all services and levels of the stack
- Identify improvements for the product architecture from the reliability, performance and availability perspectives
- Plan the growth of Together AI's infrastructure
Requirements
- 5+ years of professional AI Infra or related experience
- Bachelor's degree in Computer Science or a related field or equivalent work experience
- Knowledge of Ansible (roles, playbooks), Terraform, and Kubernetes
- Proficiency in programming/scripting languages
- Direct experience in monitoring and observability practices
- Knowledge of cloud services
- Ability to thrive in a collaborative environment involving different stakeholders and subject matter experts
Benefits
- startup equity
- health insurance
- other competitive benefits