Protege is hiring a Senior Machine Learning Researcher / Principal Scientist to lead the evaluation and optimization of large-scale datasets used to train state-of-the-art AI models. You'll define what 'high-quality data' means in practice using statistical, computational, and ML-driven methods. This role is central to solving AI's data problem—a generational opportunity.
What You'll Do
- Design and apply statistical and machine learning methods to curate, filter, and enrich large-scale unstructured datasets.
- Develop frameworks to assess data diversity, duplication, and informativeness.
- Design statistical approaches to de-risk training datasets.
- Collaborate with model training teams to identify data bottlenecks and optimize dataset performance.
- Provide leadership on data quality strategy and shape internal best practices.
- Evaluate external datasets for integration, focusing on scalability, quality, and relevance to model performance.
- Help build data scorecards.
- Contribute to research and development of tools that automate data preprocessing and validation.
What We're Looking For
- PhD or equivalent Master's Degree + 4+ years industry experience in machine learning, economics, mathematics, engineering, computer science, statistics, or a related quantitative field.
- Strong understanding of AI model training pipelines, including pre-processing and evaluation.
- Experience working with large, unstructured datasets, especially text.
- Background in statistical analysis, bias detection, and data validation.
- Able to identify high-impact problems and drive independent solutions.
Nice to Have
- Experience with synthetic data generation or augmentation strategies.
- Publications or open-source contributions in data-centric AI or related areas.
- Experience developing evaluation frameworks or performance metrics for training data.
- Cross-functional collaboration with product, infrastructure, or partnership teams.
Team & Environment
This role is part of the Core Data Team.
Protege is an equal opportunity employer.





