Responsibilities
- Evaluate the current AI/ML pipeline by auditing training and serving systems to determine components worth retaining or requiring redesign, with clear justification for decisions.
- Develop a unified AI/ML platform encompassing training infrastructure, experiment tracking, model registry, serving mechanisms, and monitoring tools for broad organizational use.
- Manage the complete machine learning lifecycle, from data ingestion and feature engineering to annotation processes, enabling seamless and rapid model iteration for data science teams.
- Lead training infrastructure on Databricks with Unity Catalog, prioritizing speed, reproducibility, and data lineage tracking.
- Design and implement model serving solutions including low-latency APIs, batch scoring pipelines, and caching strategies integrated with Java/Spring backend services.
- Establish observability frameworks to detect data drift, model drift, accuracy degradation, and shifts in business KPIs using Grafana, Prometheus, and alerting systems.
- Optimize ML compute usage to control costs, define efficient patterns for training and inference, and report infrastructure spend and ROI to financial stakeholders.
- Act as a technical mentor to AI/ML and infrastructure engineers, raising engineering standards and promoting best practices across teams.
- Promote adoption of modern AI development tools by demonstrating effective use of AI-assisted coding, agentic workflows, and AI-powered incident response.
Work Arrangement
Hybrid — Prague