Requirements
- Extensive Data Engineering Experience: 8–12+ years in data engineering or backend engineering, including senior/lead roles. Experience designing end-to-end data systems, solving scale/performance challenges, integrating diverse sources, and operating pipelines in production.
- Big Data & Cloud Expertise: Strong skills in Python and/or Java/Scala. Deep experience with Spark, Hadoop, Hive/Impala, and Airflow. Hands-on work with AWS, Azure, or GCP using cloud-native processing and storage services (e.g., S3, Glue, EMR, Data Factory). Ability to design scalable, cost-efficient workloads for experimental and variable R&D environments.
- AI/ML Data Lifecycle Knowledge: Understanding of data needs for machine learning—dataset preparation, feature/label management, and supporting real-time or batch training pipelines. Experience with feature stores or streaming data is useful.
- Leadership & Mentorship: Ability to translate ambiguous goals into clear plans, guide engineers, and lead technical execution.
- Problem-Solving Mindset: Approach issues systematically, using analysis and data to select scalable, maintainable solutions.
- Education & Background: Bachelor’s degree in Computer Science, Engineering, or related field. 8-12+ years of proven experience architecting and operating production-grade data systems, especially those supporting analytics or ML workloads.
- Pipeline Development: Expert in ETL/ELT design and implementation, working with diverse data sources, transformations, and targets. Strong experience scheduling and orchestrating pipelines using Airflow or similar tools.
- Programming & Databases: Advanced Python and/or Scala/Java skills and strong software engineering fundamentals (version control, CI, code reviews). Excellent SQL abilities, including performance tuning on large datasets.
- Big Data Technologies: Hands-on Spark experience (RDDs/DataFrames, optimization). Familiar with Hadoop components (HDFS, YARN), Hive/Impala, and streaming systems like Kafka or Kinesis.
- Cloud Infrastructure: Experience deploying data systems on AWS/Azure/GCP. Familiar with cloud data lakes, warehouses (Redshift, BigQuery, Snowflake), and cloud-based processing engines (EMR, Dataproc, Glue, Synapse). Comfortable with Linux and shell scripting.
- Data Governance & Security: Knowledge of data privacy regulations, PII handling, access controls, encryption/masking, and data quality validation. Experience with metadata management or data cataloging tools is a plus.
- Collaboration & Agile Delivery: Strong communication skills and experience working with cross-functional teams. Ability to document designs clearly and deliver iteratively using agile practices.
Nice to Have
- Advanced Cloud & Data Platform Expertise: Experience with AWS data engineering services, Databricks, and Lakehouse/Delta Lake architectures (including bronze/silver/gold layers).
- Modern Data Stack: Familiarity with dbt, Great Expectations, containerization (Docker/Kubernetes), and monitoring tools like Grafana or cloud-native monitoring.
- DevOps & CI/CD for Data: Experience implementing CI/CD pipelines for data workflows and using IaC tools like Terraform or CloudFormation. Knowledge of data versioning (e.g., Delta Lake time-travel) and supporting continuous delivery for ML systems.
- Continuous Learning: Motivation to explore emerging technologies, especially in AI and generative AI data workflows.
