Protege is looking for a Senior Applied Research Scientist to join our Core Data Team. You will play a leading role in defining what 'high-quality data' means in practice for training state-of-the-art AI models. Your work will focus on using statistical, computational, and machine learning methods to ensure our datasets are diverse, representative, and high-impact.
What You'll Do
- Design and apply statistical and machine learning methods to curate, filter, and enrich large-scale unstructured datasets.
- Develop frameworks to assess data diversity, duplication, and informativeness.
- Design statistical approaches to de-risk training datasets.
- Collaborate with model training teams to identify data bottlenecks and optimize dataset performance.
- Provide leadership on data quality strategy and shape internal best practices.
- Evaluate external datasets for integration, focusing on scalability, quality, and relevance to model performance.
- Help build data scorecards.
- Contribute to research and development of tools that automate data preprocessing and validation.
What We're Looking For
- A PhD or an equivalent Master's Degree + 4+ years of industry experience in machine learning, economics, mathematics, engineering, computer science, statistics, or a related quantitative field.
- Strong understanding of AI model training pipelines, including pre-processing and evaluation.
- Experience working with large, unstructured datasets, especially text.
- Background in statistical analysis, bias detection, and data validation.
- Able to identify high-impact problems and drive independent solutions.
Nice to Have
- Experience with synthetic data generation or augmentation strategies.
- Publications or open-source contributions in data-centric AI or related areas.
- Experience developing evaluation frameworks or performance metrics for training data.
- Cross-functional collaboration with product, infrastructure, or partnership teams.
Team & Environment
You'll be a senior member of the Core Data Team, working within a lean, fast-moving, high-trust team of builders who are obsessed with velocity and impact. Our culture is built for people who thrive on ambiguity, own outcomes, and want to shape the future of data and AI.






