Responsibilities
- Design, implement, and maintain SFT and RL post-training pipelines for multi-step coding agents.
- Train and adapt LLMs for agent workflows, including planning, tool use, and multi-step interactions inside JetBrains IDEs.
- Build and develop evaluation and simulation environments where coding agents can act, be measured, and compared on realistic developer tasks.
- Design evaluation frameworks and metrics for agent behavior, analyze traces and logs, and close the loop from evaluation back into training, data, and reward design.
- Analyze training and evaluation results to propose and implement improvements to model architectures, training recipes, and datasets.
- Work with large-scale infrastructure, including distributed training on GPU clusters and large MapReduce-style data processing for pre-training and fine-tuning datasets.
- Collaborate closely with research, product, and infrastructure teams to turn high-level product visions into concrete models, experiments, and shipped features.
Requirements
- Extensive hands-on experience training LLMs (pre-training, fine-tuning, or post-training) in a research or production setting.
- Deep expertise in modern deep learning frameworks such as PyTorch, and specialized LLM training stacks (e.g. Megatron, NeMo, verl, or similar).
- Strong theoretical and practical understanding of LLM fundamentals: architectures, tokenization, data pipelines, batching, mixed precision, distributed training, and debugging unstable runs.
- The ability to own projects end to end, starting from a high-level problem or product pain point and overseeing it through the design, experimentation, implementation, and iteration phases.
- A product-aware mindset – you care about how developers actually use agents and can translate product needs and failure modes into modeling and evaluation work.
- At least 3 years of Python experience writing clean, maintainable code in modern ML codebases.
Nice to Have
- ML orchestrators and workflow tools such as Kubeflow, Dagster, Airflow, ZenML, and/or job schedulers like Kubernetes or SLURM.
- Large-scale data and training pipelines, e.g. MapReduce-style clusters, multi-node GPU training, or workloads on the order of 1M+ CPU/GPU hours.
- Designing and maintaining evaluation pipelines for LLMs or agents, including metrics, dashboards, experiment tracking, and automated regression checks.
- AI agent development, such as tool-using agents, planners, or multi-step coding workflows, and familiarity with agentic frameworks or patterns.
- Experiment tracking and observability using tools like Weights & Biases, MLflow, Langfuse, or similar.
- Inference optimization and serving optimized models in production.