The Senior Data Engineer will architect and lead the design of end-to-end data systems for large-scale biomedical datasets at Absentia Labs. This role is central to shaping the data infrastructure of an AI-driven biomedical platform, making long-term architectural decisions, and ensuring data is modeled, validated, versioned, and served reliably across scientific and machine learning workflows.
What You'll Do
- Architect and lead the design of end-to-end data systems for large-scale biomedical datasets (chemical, biological, toxicology, omics, assay, clinical, and experimental data).
- Define and evolve schema-driven data models that reconcile noisy, semi-structured, and heterogeneous sources into coherent, interoperable representations.
- Establish best practices for data quality, validation, provenance, lineage, and versioning suitable for scientific and ML workflows.
- Build and maintain cloud-native data infrastructure (data lakes, warehouses, object storage, streaming systems) with an emphasis on scalability and reliability.
- Design pipelines that support both batch and streaming access for ML training, evaluation, and inference.
- Partner closely with ML engineers, scientists, and product leads to translate research needs into durable data abstractions.
- Make principled trade-offs around performance, cost, flexibility, and correctness in production systems.
- Provide technical leadership through design reviews, architectural guidance, and mentorship of other engineers.
- Identify and proactively address systemic risks in data integrity, scalability, and operational complexity.
What We're Looking For
- 5+ years of experience in data engineering, platform engineering, or ML infrastructure roles, with clear ownership of production systems.
- Proven experience designing and operating large-scale, production-grade data pipelines.
- Strong proficiency in Python and data-centric software engineering practices.
- Deep experience with cloud platforms (AWS, GCP, or Azure), including storage, compute, and security primitives.
- Familiarity with distributed data processing and orchestration systems (e.g., Spark, Beam, Ray, Airflow, Dagster).
- Experience supporting ML/AI workloads, including dataset generation, feature pipelines, and reproducible training workflows.
- Strong architectural judgment and the ability to communicate technical decisions clearly across disciplines.
Nice to Have
- Prior work with biomedical or life-science data (e.g., omics, assays, molecular representations, clinical or toxicology data).
- Experience with streaming platforms (Kafka, Pub/Sub, Kinesis).
- Exposure to ontology-aware data modeling or schema evolution in scientific domains.
- Infrastructure-as-code and systems experience (Terraform, Docker, Kubernetes).
- Experience in early-stage startups or research-heavy environments.
- Open-source contributions or technical publications.
Technical Stack
- Python, AWS, GCP, Azure, Spark, Beam, Ray, Airflow, Dagster, Kafka, Pub/Sub, Kinesis, Terraform, Docker, Kubernetes
Benefits & Compensation
- A chance to architect the data backbone of an AI-driven biomedical platform.
- Direct impact on how scientific data is translated into machine intelligence.
- High autonomy, high trust, and ownership over critical systems.
- Flexible remote or hybrid work arrangements.
- A deeply technical, low-ego culture focused on learning and rigor.
- Competitive compensation
- meaningful equity participation
Work Mode
- Flexible remote or hybrid work arrangements
Absentia Labs is an equal opportunity employer. We value diversity and are committed to creating an inclusive environment for all employees.






