Remote (Global) Full-time

Flower Labs is hiring a Founding ML Engineer in the Flower Frontier Model Team (all seniority levels welcome)

About the Role

Flower Labs is building the Flower Frontier Model Team, a small, high-impact group dedicated to creating state-of-the-art LLMs and foundation models. We are hiring a Founding ML Engineer at all seniority levels to play a critical role in inventing and building the training paradigms that will define the next decade of AI, blending cutting-edge techniques with Flower's pioneering decentralized learning methods.

What You'll Do

  • Design, implement and optimize core components across the full spectrum of frontier model building: data curation, evals, pre-training, and post-training.
  • Build a reliable, maintainable, and scalable software stack to produce world-leading models that are open-sourced and integrated into new Flower Lab products.
  • Diagnose and resolve GPU/kernel issues, memory/storage bottlenecks, and multi-node failures at scale; collaborate on debugging training instabilities.
  • Devise surrounding infrastructure, tooling, monitoring, and observability essential for large-scale LLM development.
  • Assume technical leadership as the training system scales in complexity and capability.

What We're Looking For

  • Exceptional software engineering skills in Python, deep learning frameworks, testing, profiling, refactoring, and reproducibility.
  • Expertise with modern ML training stacks: PyTorch, JAX or equivalent; experience implementing model architectures from scratch and working within libraries like DeepSpeed, Megatron or equivalent.
  • Ability to tune, debug, and profile large-scale training runs.
  • Hands-on experience working with large GPU clusters, including job orchestration, scheduling, multi-node runs, NCCL/RDMA issues, and GPU performance optimization.
  • Ability to collaborate effectively with both research-oriented and engineering-oriented colleagues; comfortable turning research ideas into robust, maintainable implementations.
  • Good engineering hygiene: modular design, code reviews, documentation, reproducibility, versioning of data/models/configurations.
  • Familiarity with common tools: Linux command line, git, Docker.
  • Openness to adopting new tooling.
  • Solid understanding of distributed systems and networking.
  • Strong written English and open, honest, transparent communication skills.

Nice to Have

  • PhD or Masters degree in a relevant discipline.
  • Familiarity with various components and stages relevant to building LLMs and foundation models, such as architectures, pre-training, data curation, post-training, and evaluation.
  • Experience with post-training methods like SFT, RLHF, DPO, or reward modeling.
  • Ability to read, implement, and extend cutting-edge research papers quickly.
  • Prior track-record in advanced distributed training frameworks and concepts.
  • Strong grasp of optimization and training techniques: mixed precision, curriculum/data strategies, LR schedules, checkpointing.
  • Background writing high-performance kernels (CUDA, Triton).
  • Experience in developing components within systems used by thousands of users.
  • Track record of working in open-source projects.
  • Ability to adapt to different HPC configurations and GPU architectures (e.g., AMD/NVIDIA).

Technical Stack

  • Python, PyTorch, JAX
  • DeepSpeed, Megatron
  • Linux, git, Docker
  • CUDA, Triton

Team & Environment

You will be a founding member of the new Flower Frontier Model Team, a small, high-impact group composed of contributors with a mix of research and engineering backgrounds. We operate in a collaborative, fast-paced, and demanding start-up environment. It's a team of experts where everyone learns something new every day, with the opportunity to contribute ideas, be heard, and influence the direction of the company across the board.

Work Mode

This is a fully remote position open to candidates globally.

Flower Labs is an equal opportunity employer.

Required Skills
PythonPyTorchJAXDeepSpeedMegatronLinuxgitDockerCUDATritonMachine LearningDistributed TrainingModel OptimizationGPU ProgrammingResearch
Ready to relocate and code from paradise?

Thailand or Vietnam — your office, your rules

Iglu offers relocation to Bangkok, Chiang Mai, Ho Chi Minh City, or Hong Kong. Full employment, legal setup, and a community of 200+ digital professionals.

Relocation to 5 countries
Full legal work setup
Developer community access
Work-life balance culture
Explore locations
Relocation support included
About company
Flower Labs

Flower Labs is the world-class AI startup best known for being behind the most popular open-source framework in the world for training AI on distributed data and compute resources using decentralized and federated methods. Trusted by industry leaders such as Mozilla, JP Morgan, Owkin, Banking Circle and Temenos.

Visit website
Job Details
Category data
Posted 3 months ago