Design, build, and operate scalable, fault-tolerant infrastructure for LLM Research: distributed compute, data orchestration, and storage across modalities.
Develop high-throughput systems for data ingestion, processing, and transformation — including training data catalogs, deduplication, quality checks, and search.
Build systems for traceability, reproducibility, and robust quality control at every stage of the data lifecycle.
Implement and maintain monitoring and alerting to support platform reliability and performance.
Collaborate with research teams to unlock new features, improve data quality, and accelerate training cycles.

Bachelor’s degree or equivalent experience in computer science, engineering, or similar.
Proficiency in at least one backend language (we use Python or Rust).
Are fluent in distributed compute frameworks such as Apache Spark or Ray.
Are deeply familiar with cloud infrastructure, data lake architectures, and batch and streaming pipelines.
Comfort operating across the stack and owning projects end-to-end.
Thrive in a highly collaborative environment involving many, different cross-functional partners and subject matter experts.
A bias for action with a mindset to take initiative to work across different stacks and different teams where you spot the opportunity to make sure something ships.

Have hands-on experience with Kafka, dbt, Terraform, and Airflow.
Have experience building a web crawler.
Have extensive experience understanding and scaling deduplication, data mining, and search.
Have strong knowledge of file formats and storage systems (e.g., Parquet, Delta Lake, etc.) and how they impact performance and scalability.
Are proactive about documentation, testing, and empowering your teammates with good tooling.

On-site — San Francisco, California

This is an 'evergreen role' that we keep open on an on-going basis to express interest.
We continuously review applications and reach out to applicants as new opportunities open.
You may reapply if you gain more experience, but please avoid applying more than once every 6 months.
You are welcome to apply to project or team specific roles in addition to this evergreen role.

Visa sponsorship available

Thinking Machines Lab is hiring a Software Engineer, Data Infrastructure