The Data Quality Engineer is responsible for establishing and maintaining data trust, governance, and certification across the enterprise Data Lakehouse platform. This role involves developing automated data quality frameworks, semantic modeling, and governance processes that enable reliable data consumption for business intelligence and machine learning applications on technologies including Databricks, Iceberg, AWS, Dremio, Atlan, and Power BI.
Responsibilities
- Design and manage automated data validation systems across data pipeline layers from ingestion to curated datasets.
- Create and execute tests for schema changes, data anomalies, record reconciliation, timeliness, and referential integrity.
- Integrate data quality checks into Databricks environments using Delta Lake, Delta Live Tables, and Unity Catalog.
- Implement Iceberg-based pipeline validations with support for schema evolution and time travel capabilities.
- Define and manage data certification processes to ensure only approved datasets are used for analytics and AI.
- Use metadata tools like Atlan and AWS Glue Catalog for managing data lineage, business glossaries, and access policies.
- Develop a governed semantic layer on high-quality data to serve BI and AI/ML workloads.
- Support Power BI reporting with certified metrics and self-service data access.
- Work with data stewards to align data models with business-defined terminology in Atlan.
- Certify datasets used in conversational analytics and natural language query systems.
- Collaborate with AI teams to connect LLM-based query interfaces with Dremio, Databricks SQL, and Power BI.
- Ensure LLM-generated insights are based on verified, high-integrity datasets to prevent inaccurate outputs.
- Produce and maintain feature-ready datasets for machine learning training and inference in SageMaker Studio.
- Partner with ML engineers to validate that input data meets all nine data quality dimensions.
- Monitor for data drift and maintain model performance reliability over time.
- Enforce continuous compliance with the nine data quality dimensions: accuracy, completeness, consistency, timeliness, validity, uniqueness, integrity, conformity, and reliability.
Requirements
- Extensive experience in data engineering, data quality, or data governance roles.
- Proficiency in Python, PySpark, and SQL for data processing and validation.
- Hands-on experience with Databricks, including Delta Lake, Unity Catalog, and Delta Live Tables.
- Practical knowledge of Apache Iceberg and its integration into data pipelines.
- Strong familiarity with AWS data services such as S3, Glue ETL, Glue Catalog, Athena, EMR, Redshift, and SageMaker Studio.
- Experience with Power BI, including semantic model design, DAX, and dataset certification.
- Working knowledge of query engines like Trino or Presto.
- Experience using data quality frameworks such as Great Expectations, Deequ, or Soda.
Nice to Have
- Exposure to conversational analytics or natural language query systems over data lakehouses or Power BI.
- Experience integrating LLM pipelines using tools like LangChain, OpenAI, or AWS Bedrock with enterprise data platforms.
- Familiarity with data observability platforms such as Monte Carlo, Bigeye, DataDog, or Grafana.
- Understanding of data compliance standards including GDPR, CCPA, and HIPAA.
- Hold cloud certifications such as AWS Data Analytics Specialty or Databricks Certified Data Engineer.
Tech Stack
Databricks, Apache Iceberg, Amazon S3, AWS Glue ETL, AWS Glue Catalog, Amazon Athena, Amazon EMR, Amazon Redshift, SageMaker Studio, Dremio, Atlan, Power BI, Delta Lake, Delta Live Tables, Unity Catalog, Python, PySpark, SQL, Trino, Presto, Great Expectations, Deequ, Soda, Monte Carlo, Bigeye, DataDog, Grafana, LangChain, OpenAI, AWS Bedrock
Benefits
- Equal employment opportunity and non-discrimination policy
- Inclusive workplace culture that values diversity, belonging, respect, and individual contribution
- Accommodation support for applicants with disabilities
- Commitment to global innovation and technological advancement
- Valuing the power and potential of diversity
- Fostering an inclusive environment where every individual can thrive
- Promoting belonging, respect, and meaningful contribution
- Leading in global innovation and technology development
- Centering problem-solving in organizational approach
- Advancing social impact through technology
Additional Information
- The application deadline is expected to be October 25, 2024, though the position may close earlier if a suitable candidate is identified.
- The organization does not require payment as a condition of applying or receiving a job offer.
- Accommodation requests can be submitted to jobs.accommodations@wdc.com with a description of the need and the relevant job title or requisition number.
- Harassment and discrimination based on legally protected characteristics are strictly prohibited.
- Compliance with Equal Employment Opportunity laws and regulations is required.
- Candidates are encouraged to report unethical recruitment practices to the WD Ethics Helpline or compliance@wdc.com.
