Zyte is seeking a Core & ML Ops Team Lead to build the bedrock infrastructure that powers our services at scale. In this hands-on technical leadership role, you will lead a cross-functional squad responsible for designing and maintaining the scalable foundation for all Zyte services.
What You'll Do
- Design and evolve the core platform, including Kubernetes, Mesos, GPU scheduling/autoscaling, and distributed compute.
- Own the model platform: registry, experiment tracking, training orchestration, evaluation, serving, and monitoring.
- Build the Golden Path, including reference repos, a scaffold CLI, opinionated CI/CD pipelines, runtime contracts, high-performance clients, and production‑ready defaults.
- Operate a secure, multi‑tenant model registry and training platform with standardized experiment/evaluation harnesses.
- Provide turnkey serving patterns, drift/quality monitoring, and rollback playbooks.
- Integrate public/open‑source AI capabilities as managed platform services with cost and data‑governance guardrails.
- Run the squad: own roadmap/prioritization, delivery, mentoring, and high engineering standards.
- Partner with product engineering, Prod Ops, and Security on adoption and rollout plans.
- Own container orchestration, GPU provisioning & autoscaling, environment & secret management.
- Own operators, sidecars, and internal SDKs/libraries (Go/Rust/Python/Java) that enforce the golden path contract.
- Own observability: logging/metrics/tracing pipelines.
- Own billing pipeline: metering/events/cost tracking abstractions.
- Own Golden Path: Java, Python, ML templates, CI/CD blueprints, docs, and scaffold CLI.
- Own reliability enablement (SRE practices), cost governance, and supply‑chain security.
What We're Looking For
- 5+ years experience building distributed systems.
- 3+ years in MLOps/ML platform engineering, or equivalent impact.
- Knowledge of Linux/OS internals, networking (TCP/IP, HTTP/2), concurrency, and performance profiling.
- Deep understanding of Kubernetes.
- Proficiency developing high-performance services in Java, Rust, Go or C++.
- Strong Python skills.
- Experience with GPU infrastructure (scheduling, containerization, optimization).
- Track record of designing and operating model platforms in production.
- Demonstrated success leading technical teams and implementing organization-wide platform solutions.
Nice to Have
- Streaming & workflows: Kafka plus Argo/Temporal/Airflow or equivalents.
- eBPF‑based observability, perf tooling, or io_uring experience.
- Cost optimization for ML/AI; multi‑tenant quotas and fairness.
- Hands‑on experience authoring Golden Paths (service chassis/templates, CI/CD blueprints, CLI scaffolds).
- SRE practices (SLIs/SLOs, incident management).
Technical Stack
- Platform: Kubernetes, Mesos, GPU infrastructure
- Languages: Java, Rust, Go, C++, Python
- Frameworks: vert.x, Netty
- Streaming & Workflow: Kafka, Argo, Temporal, Airflow
- Systems: eBPF, io_uring
Team & Environment
The Core & MLOps Squad is part of a globally distributed team of over 250.
Benefits & Compensation
- Work from anywhere in a completely remote company.
- Work with a wide range of open-source technologies and tools.
- Be part of a self-motivated, progressive, multi-cultural team.
- Foster and nourish new ideas and bring them to market.
Work Mode
This is a global, fully remote role. You can work from over 28 countries.
Zyte is an equal opportunity employer.



