Protege is looking for a Senior Data Scientist to be at the heart of how we curate, assess, and prepare the training data that powers real-world AI systems. You'll lead the evaluation and optimization of large-scale datasets used to train state-of-the-art AI models.
What You'll Do
- Design and apply statistical and machine learning methods to curate, filter, and enrich large-scale unstructured datasets
- Develop frameworks to assess data diversity, duplication, and informativeness
- Design statistical approaches to de-risk training datasets
- Collaborate with model training teams to identify data bottlenecks and optimize dataset performance
- Provide leadership on data quality strategy and shape internal best practices
- Evaluate external datasets for integration, focusing on scalability, quality, and relevance to model performance
- Help build data scorecards
- Contribute to research and development of tools that automate data preprocessing and validation
What We're Looking For
- PhD or equivalent Master's Degree + 4+ years industry experience in machine learning, economics, mathematics, engineering, computer science, statistics, or a related quantitative field
- Strong understanding of AI model training pipelines, including pre-processing and evaluation
- Experience working with large, unstructured datasets, especially text
- Background in statistical analysis, bias detection, and data validation
- Able to identify high-impact problems and drive independent solutions
Nice to Have
- Experience with synthetic data generation or augmentation strategies
- Publications or open-source contributions in data-centric AI or related areas
- Experience developing evaluation frameworks or performance metrics for training data
- Cross-functional collaboration with product, infrastructure, or partnership teams
Team & Environment
You will collaborate with research and engineering teams within our lean, fast-moving, high-trust environment. Our culture is built for people who thrive on ambiguity, own outcomes, and want to shape the future of data and AI.






