Building Image Datasets via Web Scraping: Best Practices for Machine Learning using ScrapeGraphAI

In machine learning, especially in the field of computer vision, a high-quality dataset is more valuable than a complex model architecture. If you're building a system that needs to classify products, identify logos, recognize fashion trends, or understand packaging design, then you need a dataset that matches your domain exactly. Public datasets often do not contain the specificity or freshness required. This guide explains how to build large-scale, high-quality image datasets by scraping e-commerce websites using ScrapeGraphAI API, filtering by resolution and aspect ratio, deduplicating images using perceptual hashing, and preprocessing for model readiness using OpenCV.

Why Use ScrapeGraphAI Instead of Traditional Scraping?

ScrapeGraphAI is a language model-powered scraping framework that lets you extract structured data from websites using prompts and schemas. Unlike traditional methods such as BeautifulSoup or XPath, ScrapeGraphAI interprets web content contextually. This means it can adapt to different page layouts and extract data that aligns with your goals.

Key advantages:

No brittle HTML parsing or CSS selectors
Schema-first structured output
LLM-based understanding of content
Ideal for modern JavaScript-heavy websites
Works well with OpenAI, Groq, Mistral, and other providers

Whether you're targeting images, titles, prices, or user reviews, ScrapeGraphAI can retrieve data in a single call using its public API.

Step 1: Prepare Your API Call

Instead of using an SDK, you can call the API directly using Python's requests.

Sample Code (Using ScrapeGraphAI API):


python
import requests

headers = {
    "SGAI-APIKEY": "your-api-key",
    "Content-Type": "application/json"
}

payload = {
    "website_url": "https://example.com/product-page",
    "user_prompt": "Extract product title and image URL from this e-commerce product page.",
    "output_schema": {
        "product_title": "string",
        "image_url": "string"
    }
}

response = requests.post("https://api.scrapegraphai.com/v1/smartscraper", json=payload, headers=headers)
data = response.json()
print(data["result"])

Loop through a list of URLs or paginate to collect thousands of records.

Step 2: Download and Filter Images by Quality

Once you have your image URLs, download them and filter out low-quality samples.


python
import os
import urllib.request
import cv2

def download_and_filter(urls, folder="dataset"):
    os.makedirs(folder, exist_ok=True)
    for i, url in enumerate(urls):
        try:
            path = os.path.join(folder, f"{i}.jpg")
            urllib.request.urlretrieve(url, path)
            img = cv2.imread(path)
            if img is None:
                os.remove(path)
                continue
            h, w = img.shape[:2]
            ratio = w / h
            if h < 300 or w < 300 or ratio < 0.75 or ratio > 1.8:
                os.remove(path)
        except:
            continue

Step 3: Deduplicate with Perceptual Hashing


python
from PIL import Image
import imagehash

def deduplicate_images(folder="dataset"):
    seen = {}
    for fname in os.listdir(folder):
        path = os.path.join(folder, fname)
        try:
            img = Image.open(path)
            hash_val = imagehash.phash(img)
            if hash_val in seen:
                os.remove(path)
            else:
                seen[hash_val] = fname
        except:
            continue

Step 4: Preprocess Images for Deep Learning Models


python
def preprocess_images(input_folder="dataset", output_folder="processed"):
    os.makedirs(output_folder, exist_ok=True)
    for fname in os.listdir(input_folder):
        try:
            img = cv2.imread(os.path.join(input_folder, fname))
            img = cv2.resize(img, (224, 224))
            img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
            cv2.imwrite(os.path.join(output_folder, fname), img)
        except:
            continue

Best Practices

Ready to Scale Your Data Collection?

Join thousands of businesses using ScrapeGrapAI to automate their web scraping needs. Start your journey today with our powerful API.

Get Started For Free View Documentation

Use consistent schemas for all URLs
Validate schema fields (e.g., URLs, file formats)
Log failures and retry problematic pages
Deduplicate before training
Use JSON or CSV alongside image datasets

Example Use Cases

Visual product classifiers for fashion or electronics
Competitor price and image comparison systems
Dataset creation for GAN training (generative models)
Image search engines using embeddings

Frequently Asked Questions

Can I scrape 1000+ pages with the API?

Yes, the ScrapeGraphAI API supports batching via pagination. Just loop through your URLs and call the API with the same schema.

How is the API better than using SDKs?

Using the API directly means fewer dependencies, language independence, and more control over how and when requests are made. You also get clearer logging and retry logic.

Can I extract price, name, rating, and image together?

Yes. Define all those keys in your schema and include them in the prompt. ScrapeGraphAI maps data accordingly.

Is rate limiting enforced?

Yes. Respect rate limits (check headers in API response). Use exponential backoff or delays for scale jobs.

What format is the result returned in?

Standard JSON, with keys mapped to your schema fields. You can write directly to CSV or load into Pandas.

References

ScrapeGraphAI Docs: https://docs.scrapegraphai.com
ScrapeGraphAI API Reference: https://docs.scrapegraphai.com/api
OpenCV: https://docs.opencv.org
Pillow: https://pillow.readthedocs.io
imagehash: https://github.com/JohannesBuchner/imagehash

Conclusion

With ScrapeGraphAI's API, you can now go from URL lists to structured data and preprocessed images—all within a few hundred lines of code. No fragile parsing. No manual guesswork. If you're building a visual ML system, this approach lets you stay lean, fast, and accurate from day one.

Want to learn more about data innovation and AI-powered analysis? Explore these guides:

Web Scraping 101 - Master the basics of data collection
AI Agent Web Scraping - Learn about AI-powered data extraction
LlamaIndex Integration) - Discover advanced data analysis techniques
Building Intelligent Agents - Learn how to build AI agents for data analysis
Pre-AI to Post-AI Scraping - See how AI has transformed data collection
Structured Output - Master handling structured data
Stock Analysis with AI - Learn about AI-powered financial analysis
LinkedIn Lead Generation with AI - Discover AI-driven business intelligence
Web Scraping Legality - Understand the legal aspects of data collection

These resources will help you understand how to leverage AI and modern tools for innovative data collection and analysis.