The Role:
We are looking for an experienced Data Engineer with expertise in web scraping, data extraction, data transformation, and large-scale data processing. In this role, you will be responsible for designing, developing, and maintaining a robust and scalable web scraping infrastructure to gather, transform, and process data from diverse online sources. You will collaborate with data scientists, ML engineers, and software engineers to ensure clean, structured, and high-quality data for our AI-driven commerce solutions.
Key Responsibilities:
Design, build, and maintain scalable and distributed web scraping pipelines to extract structured and unstructured data from multiple online sources.
Develop solutions to handle anti-scraping mechanisms (CAPTCHAs, IP blocking, JavaScript-rendered content, etc.) using proxies, headless browsers, and other techniques.
Implement best practices for data extraction, transformation, and loading (ETL), ensuring high accuracy and consistency.
Work with large-scale data processing frameworks (e.g., Apache Spark, Dask, or similar) to clean, transform, and analyze scraped data efficiently.
Optimize scraping strategies for performance, reliability, and maintainability.
Monitor and maintain scraping infrastructure, ensuring system uptime and quick response to website changes.
Collaborate with data science and engineering teams to integrate transformed web-scraped data into data lakes, warehouses, and ML pipelines.
Ensure compliance with ethical and legal guidelines regarding web scraping and data usage.
Requirements:
5+ years of experience in data engineering, with a strong focus on web scraping, data extraction, and data transformation.
Expertise in Python and relevant libraries/frameworks (e.g., Scrapy, BeautifulSoup, Selenium, Playwright, Puppeteer, or similar).
Strong understanding of ETL processes, data transformation functions, and data pipeline architecture.
Experience with headless browsers and JavaScript-rendered content extraction.
Strong understanding of proxy management, rotating IPs, and anti-scraping evasion techniques.
Experience working with cloud-based solutions ( GCP) for distributed data processing.
Proficiency in SQL and NoSQL databases (e.g., PostgreSQL, BigQuery, MongoDB, or similar) for data storage and retrieval.
Experience with large-scale data processing tools (e.g., Apache Spark, Dask, or Hadoop) is a plus.
Familiarity with CI/CD pipelines and containerization (Docker, Kubernetes).
Strong problem-solving skills, attention to detail, and ability to adapt to evolving technical challenges.
Experience in e-commerce, search, or product catalog data scraping is a plus.