Responsibilities
- Lead the MLOps function, providing technical guidance, mentorship, task prioritization, and hands-on execution support for engineers and platform contributors.
- Design, develop, and maintain scalable MLOps platforms, services, and workflows that enable model experimentation, training, validation, deployment, monitoring, and lifecycle retirement.
- Define and implement practical standards for model CI/CD pipelines, feature engineering pipelines, model registries, artifact storage, reproducible training, environment control, and release promotion across development, staging, and production environments.
- Collaborate with Data Science teams to transition prototypes into robust, production-grade systems, including batch and real-time inference, model APIs, decision services, and data-driven features.
- Work with Product leadership to establish measurable outcomes for ML-powered features, such as user adoption, forecast accuracy, system reliability, operational cost, and post-deployment validation.
- Develop and enhance monitoring systems for machine learning models and services, covering service health, latency, throughput, data quality, concept drift, model performance, cost metrics, and operational alerts.
- Design and implement safe deployment and rollback strategies for ML models, ensuring observability, repeatability, and auditability through canary releases, shadow deployments, A/B testing, and model versioning.
- Partner with Data Engineering and Platform teams to ensure reliable feature pipelines, data contracts, workflow orchestration, scheduling, data lineage, and dependency tracking for ML workloads.
- Support cloud-native ML infrastructure on AWS and related environments, including containerization, Kubernetes, orchestration tools, storage, networking, IAM policies, and cost-efficient compute patterns for training and inference.
- Collaborate with Security, Compliance, and Engineering leaders to establish controls for access management, model governance, audit trails, data handling, secrets protection, and responsible AI practices.
- Lead incident response, operational readiness planning, runbook development, postmortem analysis, and corrective actions for production ML services and platform components.
- Assess and integrate appropriate MLOps tools, balancing operational efficiency, developer experience, security, cost, and long-term maintainability.
Work Arrangement
Hybrid — Bucharest, Romania
Team
Reports to: Senior Director of Platform Technology
Working Hours
10 AM - 7 PM
Schedule
Full-time; 2-3 days in the office