About the Role
The position involves building and refining data pipelines, sourcing and curating diverse text corpora, and collaborating with research and engineering teams to support the development of foundational language models.
Responsibilities
- Design and implement data collection strategies for large-scale text datasets
- Evaluate and filter raw text sources for quality and relevance
- Develop tools and automation for data preprocessing and cleaning
- Collaborate with research teams to align data composition with model goals
- Ensure data diversity and representativeness across domains and languages
- Monitor data pipeline performance and troubleshoot issues
- Maintain documentation for data sources and processing steps
- Work with engineers to scale data ingestion systems
- Identify and mitigate risks related to data bias and contamination
- Stay current with advancements in data curation for language models
- Support compliance with data usage policies and ethical guidelines
- Optimize data throughput and storage efficiency
- Contribute to benchmarking and evaluation of training data effectiveness
- Integrate feedback from model performance into data refinement
- Assist in sourcing underrepresented language data
- Collaborate on data versioning and reproducibility practices
- Participate in cross-team discussions on data strategy
- Analyze statistical properties of text corpora
- Ensure consistency in data formatting and encoding
- Support audits and validation of training datasets
- Develop heuristics for detecting low-quality or synthetic text
- Work with legal and policy teams on data rights and licensing
- Contribute to internal tools for data inspection and labeling
- Help define best practices for data lifecycle management
- Engage in peer review of data-related research
Nice to Have
- Master’s or PhD in a technical field
- Direct experience with pre-training data for language models
- Contributions to open-source data or ML projects
- Experience with multilingual data processing
- Background in computational linguistics
- Publications in relevant research areas
- Experience with data labeling platforms
- Knowledge of data provenance tracking
- Familiarity with legal aspects of data licensing
- Work history in AI safety or alignment
Compensation
Competitive salary and equity package
Work Arrangement
Hybrid work model
Team
Part of the core research and data infrastructure team focused on large language models
Our Impact
- Our work directly shapes the capabilities and reliability of large language models used by developers and enterprises worldwide.
- We prioritize responsible data practices to build models that are robust, fair, and trustworthy.
Life at the Company
- We foster a collaborative, research-driven environment with regular knowledge sharing and technical discussions.
- Engineers are encouraged to lead initiatives and contribute to long-term data strategy.
Available for qualified candidates