Remote JobsRemote CompanyBlog
Sign In
Sign Up
Back to all jobs

Data Engineer (Web Scraping) - Sage Labs

Data
web scraping
data extraction
data transformation
data processing
ETL

The Role:
We are looking for an experienced Data Engineer with expertise in web scraping, data extraction, data transformation, and large-scale data processing. In this role, you will be responsible for designing, developing, and maintaining a robust and scalable web scraping infrastructure to gather, transform, and process data from diverse online sources. You will collaborate with data scientists, ML engineers, and software engineers to ensure clean, structured, and high-quality data for our AI-driven commerce solutions.


Key Responsibilities:

  • Design, build, and maintain scalable and distributed web scraping pipelines to extract structured and unstructured data from multiple online sources.

  • Develop solutions to handle anti-scraping mechanisms (CAPTCHAs, IP blocking, JavaScript-rendered content, etc.) using proxies, headless browsers, and other techniques.

  • Implement best practices for data extraction, transformation, and loading (ETL), ensuring high accuracy and consistency.

  • Work with large-scale data processing frameworks (e.g., Apache Spark, Dask, or similar) to clean, transform, and analyze scraped data efficiently.

  • Optimize scraping strategies for performance, reliability, and maintainability.

  • Monitor and maintain scraping infrastructure, ensuring system uptime and quick response to website changes.

  • Collaborate with data science and engineering teams to integrate transformed web-scraped data into data lakes, warehouses, and ML pipelines.

  • Ensure compliance with ethical and legal guidelines regarding web scraping and data usage.


Requirements:

  • 5+ years of experience in data engineering, with a strong focus on web scraping, data extraction, and data transformation.

  • Expertise in Python and relevant libraries/frameworks (e.g., Scrapy, BeautifulSoup, Selenium, Playwright, Puppeteer, or similar).

  • Strong understanding of ETL processes, data transformation functions, and data pipeline architecture.

  • Experience with headless browsers and JavaScript-rendered content extraction.

  • Strong understanding of proxy management, rotating IPs, and anti-scraping evasion techniques.

  • Experience working with cloud-based solutions ( GCP) for distributed data processing.

  • Proficiency in SQL and NoSQL databases (e.g., PostgreSQL, BigQuery, MongoDB, or similar) for data storage and retrieval.

  • Experience with large-scale data processing tools (e.g., Apache Spark, Dask, or Hadoop) is a plus.

  • Familiarity with CI/CD pipelines and containerization (Docker, Kubernetes).

  • Strong problem-solving skills, attention to detail, and ability to adapt to evolving technical challenges.

  • Experience in e-commerce, search, or product catalog data scraping is a plus.


 Apply this job
Please mention that you found this job on remotewlb.com. Thanks & good luck!
 Apply
 Save
Share to :

d-Matrix

New Job Alert

COMING SOON~
Follow us on
Give a ⭐ on
Similar Jobs
Find more remote jobs
Do you love using our product?

Share a testimonial/suggestion.We'd love to hear about it!

Click to submit✍️
logo of sitemark

Copyright © RemoteWLB 2025

Remote Dev JobsRemote Support JobsRemote Design JobsRemote Sales JobsRemote Product JobsRemote Business JobsRemote Data JobsRemote Devops JobsRemote Finance JobsRemote Legal JobsRemote HR JobsRemote QA JobsRemote Write JobsRemote Edu JobsRemote Market JobsRemote Management JobsRemote Others Jobs