Toronto; London; Montreal; New York; Paris; San Francisco Remote (Global) Employment

Cohere is hiring a Member of Technical Staff, Pre-Training Data

About the Role

The position involves building and refining data pipelines, sourcing and curating diverse text corpora, and collaborating with research and engineering teams to support the development of foundational language models.

Responsibilities

  • Design and implement data collection strategies for large-scale text datasets
  • Evaluate and filter raw text sources for quality and relevance
  • Develop tools and automation for data preprocessing and cleaning
  • Collaborate with research teams to align data composition with model goals
  • Ensure data diversity and representativeness across domains and languages
  • Monitor data pipeline performance and troubleshoot issues
  • Maintain documentation for data sources and processing steps
  • Work with engineers to scale data ingestion systems
  • Identify and mitigate risks related to data bias and contamination
  • Stay current with advancements in data curation for language models
  • Support compliance with data usage policies and ethical guidelines
  • Optimize data throughput and storage efficiency
  • Contribute to benchmarking and evaluation of training data effectiveness
  • Integrate feedback from model performance into data refinement
  • Assist in sourcing underrepresented language data
  • Collaborate on data versioning and reproducibility practices
  • Participate in cross-team discussions on data strategy
  • Analyze statistical properties of text corpora
  • Ensure consistency in data formatting and encoding
  • Support audits and validation of training datasets
  • Develop heuristics for detecting low-quality or synthetic text
  • Work with legal and policy teams on data rights and licensing
  • Contribute to internal tools for data inspection and labeling
  • Help define best practices for data lifecycle management
  • Engage in peer review of data-related research

Nice to Have

  • Master’s or PhD in a technical field
  • Direct experience with pre-training data for language models
  • Contributions to open-source data or ML projects
  • Experience with multilingual data processing
  • Background in computational linguistics
  • Publications in relevant research areas
  • Experience with data labeling platforms
  • Knowledge of data provenance tracking
  • Familiarity with legal aspects of data licensing
  • Work history in AI safety or alignment

Compensation

Competitive salary and equity package

Work Arrangement

Hybrid work model

Team

Part of the core research and data infrastructure team focused on large language models

Our Impact

  • Our work directly shapes the capabilities and reliability of large language models used by developers and enterprises worldwide.
  • We prioritize responsible data practices to build models that are robust, fair, and trustworthy.

Life at the Company

  • We foster a collaborative, research-driven environment with regular knowledge sharing and technical discussions.
  • Engineers are encouraged to lead initiatives and contribute to long-term data strategy.

Available for qualified candidates

Earn more as a remote developer

Performance pay that rewards your skills

Iglu's revenue-sharing model means top performers earn significantly more than traditional salaries. Choose your projects, deliver great work, and see it reflected in your pay.

Revenue-sharing compensation
Project choice & autonomy
International client base
Career growth support
Check compensation
Top earners exceed market rate
About company
Cohere
Cohere trains and deploys frontier AI models for developers and enterprises building systems for content generation, semantic search, RAG, and agents.
All jobs at Cohere Visit website
Job Details
Department Modeling
Category other
Posted 10 months ago