Responsibilities
- Lead and contribute to research projects investigating new methods for detecting misuse of Claude, identifying malicious organizations and accounts, strengthening model safeguards, and other safety needs.
- Design and run offline analyses over model usage data to surface abuse patterns, build classifiers and detection systems, and evaluate their effectiveness.
- Develop and iterate on prototypes that could eventually feed signals into the real-time safeguards path, partnering with engineers on tech transfer.
- Contribute to a broader research portfolio investigating methods for detecting abusive behavior in chat-based or agentive workflows, and for training the model to robustly refrain from dangerous responses or behaviors without over-refusing.
- Build evaluations and methodologies for measuring whether safeguards actually work, including in agentic settings.
- Write up findings clearly so they inform decisions across Trust & Safety, research, and product teams.
Requirements
- Have a track record of independently driving research projects from ambiguous problem statements to concrete results, ideally in AI, ML, security, integrity, or a related technical field.
- Are comfortable scoping your own work and switching between research, engineering, and analysis as a project demands.
- Have working familiarity with how large language models operate — sampling, prompting, training — even if LLMs aren't your primary background.
- Are proficient in Python and comfortable working with large datasets.
- Care about the societal impacts of AI and want your work to directly reduce real-world harm.
Nice to Have
- Experience building and training machine learning models, including classifiers for abuse, fraud, integrity, or security applications.
- Knowledge of evaluation methodologies for language models and experience designing evals.
- Experience with agentic environments and evaluating model behavior in them.
- Background in trust and safety, integrity, fraud detection, threat intelligence, or adversarial ML.
- Experience with red teaming, jailbreak research, or interpretability methods like steering vectors.
- A history of taking research prototypes and transferring them into production systems.
Team
Team size: small. Structure: 3:1 mix of researchers to software engineers
