About the Role

The position involves building and refining data pipelines, sourcing and curating diverse text corpora, and collaborating with research and engineering teams to support the development of foundational language models.

Responsibilities

Design and implement data collection strategies for large-scale text datasets
Evaluate and filter raw text sources for quality and relevance
Develop tools and automation for data preprocessing and cleaning
Collaborate with research teams to align data composition with model goals
Ensure data diversity and representativeness across domains and languages
Monitor data pipeline performance and troubleshoot issues
Maintain documentation for data sources and processing steps
Work with engineers to scale data ingestion systems
Identify and mitigate risks related to data bias and contamination
Stay current with advancements in data curation for language models
Support compliance with data usage policies and ethical guidelines
Optimize data throughput and storage efficiency
Contribute to benchmarking and evaluation of training data effectiveness
Integrate feedback from model performance into data refinement
Assist in sourcing underrepresented language data
Collaborate on data versioning and reproducibility practices
Participate in cross-team discussions on data strategy
Analyze statistical properties of text corpora
Ensure consistency in data formatting and encoding
Support audits and validation of training datasets
Develop heuristics for detecting low-quality or synthetic text
Work with legal and policy teams on data rights and licensing
Contribute to internal tools for data inspection and labeling
Help define best practices for data lifecycle management
Engage in peer review of data-related research

Nice to Have

Master’s or PhD in a technical field
Direct experience with pre-training data for language models
Contributions to open-source data or ML projects
Experience with multilingual data processing
Background in computational linguistics
Publications in relevant research areas
Experience with data labeling platforms
Knowledge of data provenance tracking
Familiarity with legal aspects of data licensing
Work history in AI safety or alignment

Compensation

Competitive salary and equity package

Work Arrangement

Hybrid work model

Team

Part of the core research and data infrastructure team focused on large language models

Our Impact

Our work directly shapes the capabilities and reliability of large language models used by developers and enterprises worldwide.
We prioritize responsible data practices to build models that are robust, fair, and trustworthy.

Life at the Company

We foster a collaborative, research-driven environment with regular knowledge sharing and technical discussions.
Engineers are encouraged to lead initiatives and contribute to long-term data strategy.

Available for qualified candidates

Cohere is hiring a Member of Technical Staff, Pre-Training Data

About the Role

Responsibilities

Nice to Have

Compensation

Work Arrangement

Team

Our Impact

Life at the Company

Similar Jobs

Project Manager - RTSM

Inside Sales - Account Manager (Bilingual)

Medior Presales Engineer

Sales Engineer I, SE Desk Southeast

Member of Technical Staff, Next Generation Agents

Senior Systems Engineer