About the Role
The role involves building and optimizing data extraction workflows, ensuring data accuracy and reliability, and supporting analytics initiatives through robust pipeline architecture.
Responsibilities
- Develop and manage automated web scraping frameworks for diverse online sources
- Design scalable ETL pipelines to process unstructured and semi-structured data
- Ensure data integrity and consistency across ingestion and transformation stages
- Monitor and troubleshoot data workflows for performance and reliability
- Collaborate with data analysts and scientists to understand data requirements
- Optimize data storage solutions for efficient querying and access
- Implement error handling and retry mechanisms in data collection systems
- Maintain documentation for data pipelines and source configurations
- Evaluate new data sources for integration potential
- Apply data validation techniques to ensure quality standards
- Support compliance with website terms of service and data usage policies
- Work with security teams to ensure ethical data collection practices
- Improve data processing efficiency through automation and tooling
- Respond to data quality incidents with root cause analysis
- Participate in code reviews and system design discussions
- Integrate third-party APIs into existing data workflows
- Scale infrastructure to handle increasing data volume and velocity
- Use version control for pipeline development and deployment
- Stay current with changes in website structures and anti-bot measures
- Contribute to data governance and metadata management practices
Nice to Have
- Master’s degree in a technical discipline
- Experience with large-scale distributed data processing tools like Spark
- Background in natural language processing or text extraction
- Knowledge of browser automation tools such as Puppeteer or Selenium
- Experience with proxy rotation and IP management for scraping
- Familiarity with CAPTCHA-solving techniques and tools
- Contributions to open-source data engineering projects
- Published work or projects involving public web data analysis
Compensation
Competitive salary with performance-based bonuses
Work Arrangement
Hybrid remote with office availability in major cities
Team
Collaborative data engineering team within a growing technology division
Technology Stack
- Primary languages: Python, SQL
- Frameworks: Scrapy, BeautifulSoup, Selenium
- Cloud: AWS (S3, EC2, Lambda, CloudWatch)
- Orchestration: Apache Airflow
- Databases: PostgreSQL, MongoDB
- Containerization: Docker, Kubernetes
- Monitoring: Prometheus, Grafana
Data Ethics Policy
- All data collection must comply with website terms of service
- Respect for robots.txt and crawl-delay directives is mandatory
- No personal data collection without explicit consent
- Regular audits of data sources for compliance
- Transparency in data usage and retention practices
Available for qualified candidates