Site Reliability Engineer at Mistral AI (Expired)

Create and manage robust, scalable, and highly available infrastructure systems
Monitor, operate, and resolve issues in live environments, including on-call duties and user support
Enhance monitoring, alerting, and incident response mechanisms to reduce system outages
Develop and maintain CI/CD pipelines, containerization, orchestration, and observability tools for APIs and large-scale training jobs
Occasionally join on-call rotations to address incidents and conduct root cause investigations
Advance automation in infrastructure deployment, scaling, and orchestration
Work with software engineers to build reliable, repeatable model-training workflows
Assist in developing a cloud platform that abstracts infrastructure for science and engineering teams
Build tools and workflows to boost system reliability, performance, and availability
Partner with security specialists to uphold infrastructure compliance and security standards
Maintain clear documentation for operational processes and team knowledge sharing
Support external contributions through open-source projects, research, blogs, or conference participation

Mistral AI was looking for a Site Reliability Engineer