Responsibilities
- Create and manage robust, scalable, and highly available infrastructure systems
- Monitor, operate, and resolve issues in live environments, including on-call duties and user support
- Enhance monitoring, alerting, and incident response mechanisms to reduce system outages
- Develop and maintain CI/CD pipelines, containerization, orchestration, and observability tools for APIs and large-scale training jobs
- Occasionally join on-call rotations to address incidents and conduct root cause investigations
- Advance automation in infrastructure deployment, scaling, and orchestration
- Work with software engineers to build reliable, repeatable model-training workflows
- Assist in developing a cloud platform that abstracts infrastructure for science and engineering teams
- Build tools and workflows to boost system reliability, performance, and availability
- Partner with security specialists to uphold infrastructure compliance and security standards
- Maintain clear documentation for operational processes and team knowledge sharing
- Support external contributions through open-source projects, research, blogs, or conference participation