Responsibilities
- Design and manage scalable cloud infrastructure on GCP with Kubernetes to support high-performance machine learning workloads.
- Develop automated workflows for training, evaluating, and releasing ML models using tools such as Jenkins, GitHub Actions, or Airflow.
- Set up observability systems to detect model drift, accuracy changes, latency issues, and performance degradation in live environments.
- Facilitate communication and coordination between data, machine learning, backend, and frontend engineering teams for seamless operations.
- Establish monitoring solutions covering both system health metrics and ML-specific signals like feature drift and data distribution changes.
- Enable individual engineering teams to monitor their own services through self-service tooling and platforms.
- Take part in on-call duties and contribute to maintaining compliance with security standards such as SOC.