Responsibilities
- Architect and manage scalable cloud systems on GCP with Kubernetes to support demanding machine learning workloads.
- Develop automated workflows for model training, evaluation, and deployment using platforms such as Jenkins, GitHub Actions, or Airflow.
- Integrate monitoring solutions to detect model drift, performance drops, and accuracy issues in live environments.
- Facilitate collaboration between data, machine learning, backend, and frontend teams to ensure seamless production operations.
- Establish monitoring systems that track both infrastructure health and ML-specific metrics including data distribution changes and prediction quality.
- Provide monitoring tools that enable individual engineering teams to oversee their own services and workloads.
- Support on-call duties and contribute to maintaining compliance with security standards like SOC.
Other
- Participate in on-call rotation
- Help manage posture to ensure compliance with standards such as SOC
- At least an upper-intermediate level of spoken and written English