Responsibilities
- Pipeline Development & Data Integration
- Build, maintain, and optimize ETL/ELT pipelines using Python, SQL, or Scala
- Orchestrate workflows using Airflow, Prefect, Dagster, or similar orchestration tools
- Ingest structured and unstructured data from APIs, SaaS platforms, databases, files, and streaming systems
- Develop scalable connectors and automated ingestion workflows
- Data Warehousing & Modeling
- Manage and optimize cloud data warehouses such as Snowflake, BigQuery, or Redshift
- Design scalable schemas using star and snowflake modeling techniques
- Implement partitioning, clustering, indexing, and performance optimization strategies
- Build clean, analytics-ready datasets for business intelligence and reporting use cases
- Data Quality, Governance & Reliability
- Implement validation checks, anomaly detection, logging, and monitoring to ensure data integrity
- Enforce naming conventions, lineage tracking, and documentation standards using tools such as dbt or Great Expectations
- Maintain audit-ready data processes and ensure compliance with GDPR, HIPAA, or industry-specific requirements
- Monitor pipeline health and proactively resolve failures or inconsistencies
- Streaming & Real-Time Data Processing
- Build and manage real-time data pipelines using Kafka, Kinesis, Pub/Sub, or similar platforms
- Support low-latency ingestion and event-driven architectures for time-sensitive applications
- Monitor streaming infrastructure and optimize throughput and reliability
- Collaboration & Analytics Enablement
- Partner closely with analysts, data scientists, and business stakeholders to deliver reliable datasets
- Support dashboard and reporting initiatives across Tableau, Looker, or Power BI
- Translate business requirements into scalable data solutions and models
- Maintain clear technical documentation for pipelines, schemas, and workflows
- Infrastructure, DevOps & Automation
- Containerize data services using Docker and manage deployments through Kubernetes when applicable
- Automate deployments using CI/CD pipelines such as GitHub Actions, Jenkins, or GitLab CI
- Manage cloud infrastructure using Terraform, CloudFormation, or similar Infrastructure-as-Code tools
- Continuously optimize performance, scalability, reliability, and cloud costs
Requirements
- 3+ years of experience in Data Engineering, Back-End Engineering, or Data Infrastructure roles
- Strong proficiency in Python and SQL
- Experience with at least one modern data warehouse (Snowflake, Redshift, BigQuery)
- Hands-on experience with orchestration tools such as Airflow or Prefect
- Strong understanding of ETL/ELT pipelines, data modeling, and data transformation workflows
- Familiarity with cloud platforms such as AWS, GCP, or Azure
Nice to Have
- Experience with dbt for data modeling and transformation management
- Streaming and event-driven data pipeline experience (Kafka, Kinesis, Pub/Sub)
- Experience with cloud-native data services such as AWS Glue, GCP Dataflow, or Azure Data Factory
- Familiarity with Docker, Kubernetes, Terraform, or CI/CD workflows
- Background in regulated industries such as healthcare, fintech, or enterprise SaaS
- Experience optimizing warehouse costs and query performance at scale
Additional Information
- U.S. client business hours (with flexibility for pipeline monitoring, deployments, and data refresh cycles)