Responsibilities
- Design and manage automated workflows for continuous integration, delivery, and training of machine learning models.
- Package and deploy ML models as scalable microservices or batch processes with high uptime requirements.
- Implement centralized monitoring, logging, and alerting to track model behavior, system performance, and data or concept drift.
- Configure and optimize cloud-hosted ML environments, including GPU-accelerated clusters, using Infrastructure as Code methods.
- Collaborate with product teams to enhance infrastructure efficiency through APIs, SDKs, and automation tools.
- Ensure full traceability of data, code, and model versions to support compliance, security, and reproducibility.
- Partner with data engineers to strengthen data pipelines using streaming technologies for real-time inference.
- Guide teams on best practices in ML engineering, scalable operations, and AI-native development patterns.
Other
- Approach unclear challenges methodically, defining problem scope before selecting tools or solutions.
- Communicate technical decisions clearly, collaborate across disciplines, and respect diverse viewpoints.
- Take end-to-end ownership of systems from design through deployment and ongoing operations.
- Integrate AI development tools as active collaborators—delegating specific coding tasks, validating outputs critically, and managing multiple concurrent workflows.