About the Role
The role involves building and maintaining scalable data pipelines that power machine learning systems, with a focus on reliability, performance, and integration within cloud environments.
Responsibilities
- Develop and optimize data pipelines for machine learning applications
- Design and implement data workflows on AWS infrastructure
- Ensure data reliability, quality, and accessibility across systems
- Collaborate with data scientists to operationalize ML models
- Monitor pipeline performance and troubleshoot production issues
- Improve data processing efficiency and scalability
- Maintain data architecture documentation and best practices
- Support data ingestion from multiple sources and formats
- Implement automated testing for data workflows
- Integrate new data sources into existing pipelines
- Optimize ETL processes for speed and cost
- Work with streaming and batch data systems
- Ensure compliance with data governance policies
- Contribute to system reliability and uptime
- Participate in code reviews and technical design discussions
- Use Python for data processing and pipeline orchestration
- Deploy infrastructure using IaC tools
- Support reproducibility and versioning of data pipelines
- Enhance monitoring and alerting for data systems
- Collaborate across engineering and research teams
Nice to Have
- Experience with MLOps frameworks
- Knowledge of feature store implementations
- Background in real-time data processing
- Familiarity with data mesh architectures
- Experience with large-scale data platforms
- Contributions to open-source data projects
- Advanced degrees in computer science or related field
- Published work in data engineering or ML systems
Compensation
Competitive salary and equity package
Work Arrangement
Remote with flexible hours
Team
Collaborative team of data scientists, engineers, and ML researchers
Tech Stack
Python, AWS (S3, Lambda, Glue, Redshift), Airflow, Docker, Kubernetes, Terraform, PostgreSQL, Apache Parquet, Pandas, NumPy
Impact
Your work will directly enable faster iteration on machine learning models by ensuring clean, reliable, and timely data delivery across research and production systems.
Available for qualified candidates