Responsibilities
- Own the design, implementation, and evolution of core MLOps systems across Hyperstack — including the infrastructure and workflows that underpin AI Studio
- Build and improve systems that orchestrate model training, fine-tuning, evaluation, and deployment — engineered for long-running, resource-intensive GPU workloads
- Own production readiness across ML infrastructure — monitoring, alerting, incident response, and continuous improvement based on real-world usage
- Define and embed strong MLOps practices across teams — model versioning, reproducibility, deployment safety, rollback strategies, and environment management
- Provide technical leadership through architecture decisions, implementation guidance, and shared standards — working closely with Product, Engineering, and cross-functional teams
Requirements
- Proven experience designing, building, and operating production ML infrastructure, platform systems, or MLOps workflows in cloud environments
- Hands-on Python development skills, with experience building backend systems, automation, and developer or platform tooling
- Experience supporting LLM, generative AI, or fine-tuning workflows in production — including training, evaluation, deployment, inference, and lifecycle management
- Production-grade experience with Docker, Kubernetes, CI/CD, and infrastructure-as-code in real, operational environments
- Experience owning complex, asynchronous, or resource-intensive workloads end to end — including orchestration, reliability, observability, and incident response
- Ability to work cross-functionally and provide technical leadership through influence — shaping standards, direction, and ways of working across engineering teams
Nice to Have
- Exposure to GPU-intensive, distributed, or performance-sensitive ML workloads
- Experience building internal developer platforms or tooling that improve experimentation, reproducibility, and delivery speed for ML teams
- Background in cloud infrastructure, platform products, or technically complex B2B software
Benefits
- Competitive salary and annual discretionary bonus scheme
- Employee wellbeing benefits
- 25 days of holiday, plus public holidays
- Flexible working arrangements (remote or hybrid, depending on role and location)
- Real ownership and autonomy, with the trust to take initiative and experiment
- The opportunity to make a visible, meaningful impact as we scale
- Clear career progression and growth opportunities in a fast-growing company
- A collaborative, international culture built on trust, transparency, and ownership
- The chance to help shape NexGen Cloud's team, culture, and future alongside ambitious, mission-driven colleagues


