Blog/Building Image Datasets via Web Scraping: Best Practices for Machine Learning using ScrapeGraphAI

Building Image Datasets via Web Scraping: Best Practices for Machine Learning using ScrapeGraphAI

Learn how to build image datasets via web scraping using ScrapeGraphAI.

Tutorials5 min read min readMohammad Ehsan AnsariBy Mohammad Ehsan Ansari
Building Image Datasets via Web Scraping: Best Practices for Machine Learning using ScrapeGraphAI

In machine learning, especially in the field of computer vision, a high-quality dataset is more valuable than a complex model architecture. If you're building a system that needs to classify products, identify logos, recognize fashion trends, or understand packaging design, then you need a dataset that matches your domain exactly. Public datasets often do not contain the specificity or freshness required. This guide explains how to build large-scale, high-quality image datasets by scraping e-commerce websites using ScrapeGraphAI API, filtering by resolution and aspect ratio, deduplicating images using perceptual hashing, and preprocessing for model readiness using OpenCV.

Why Use ScrapeGraphAI Instead of Traditional Scraping?

ScrapeGraphAI is a language model-powered scraping framework that lets you extract structured data from websites using prompts and schemas. Unlike traditional methods such as BeautifulSoup or XPath, ScrapeGraphAI interprets web content contextually. This means it can adapt to different page layouts and extract data that aligns with your goals.

Key advantages:

  • No brittle HTML parsing or CSS selectors
  • Schema-first structured output
  • LLM-based understanding of content
  • Ideal for modern JavaScript-heavy websites
  • Works well with OpenAI, Groq, Mistral, and other providers

Whether you're targeting images, titles, prices, or user reviews, ScrapeGraphAI can retrieve data in a single call using its public API.

Step 1: Prepare Your API Call

Instead of using an SDK, you can call the API directly using Python's requests.

Sample Code (Using ScrapeGraphAI API):

python
import requests

headers = {
    "SGAI-APIKEY": "your-api-key",
    "Content-Type": "application/json"
}

payload = {
    "website_url": "https://example.com/product-page",
    "user_prompt": "Extract product title and image URL from this e-commerce product page.",
    "output_schema": {
        "product_title": "string",
        "image_url": "string"
    }
}

response = requests.post("https://api.scrapegraphai.com/v1/smartscraper", json=payload, headers=headers)
data = response.json()
print(data["result"])

Loop through a list of URLs or paginate to collect thousands of records.

Step 2: Download and Filter Images by Quality

Once you have your image URLs, download them and filter out low-quality samples.

python
import os
import urllib.request
import cv2

def download_and_filter(urls, folder="dataset"):
    os.makedirs(folder, exist_ok=True)
    for i, url in enumerate(urls):
        try:
            path = os.path.join(folder, f"{i}.jpg")
            urllib.request.urlretrieve(url, path)
            img = cv2.imread(path)
            if img is None:
                os.remove(path)
                continue
            h, w = img.shape[:2]
            ratio = w / h
            if h < 300 or w < 300 or ratio < 0.75 or ratio > 1.8:
                os.remove(path)
        except:
            continue

Step 3: Deduplicate with Perceptual Hashing

python
from PIL import Image
import imagehash

def deduplicate_images(folder="dataset"):
    seen = {}
    for fname in os.listdir(folder):
        path = os.path.join(folder, fname)
        try:
            img = Image.open(path)
            hash_val = imagehash.phash(img)
            if hash_val in seen:
                os.remove(path)
            else:
                seen[hash_val] = fname
        except:
            continue

Step 4: Preprocess Images for Deep Learning Models

python
def preprocess_images(input_folder="dataset", output_folder="processed"):
    os.makedirs(output_folder, exist_ok=True)
    for fname in os.listdir(input_folder):
        try:
            img = cv2.imread(os.path.join(input_folder, fname))
            img = cv2.resize(img, (224, 224))
            img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
            cv2.imwrite(os.path.join(output_folder, fname), img)
        except:
            continue

Best Practices

Ready to Scale Your Data Collection?

Join thousands of businesses using ScrapeGrapAI to automate their web scraping needs. Start your journey today with our powerful API.

  • Use consistent schemas for all URLs
  • Validate schema fields (e.g., URLs, file formats)
  • Log failures and retry problematic pages
  • Deduplicate before training
  • Use JSON or CSV alongside image datasets

Example Use Cases

  • Visual product classifiers for fashion or electronics
  • Competitor price and image comparison systems
  • Dataset creation for GAN training (generative models)
  • Image search engines using embeddings

Frequently Asked Questions

Can I scrape 1000+ pages with the API?

Yes, the ScrapeGraphAI API supports batching via pagination. Just loop through your URLs and call the API with the same schema.

How is the API better than using SDKs?

Using the API directly means fewer dependencies, language independence, and more control over how and when requests are made. You also get clearer logging and retry logic.

Can I extract price, name, rating, and image together?

Yes. Define all those keys in your schema and include them in the prompt. ScrapeGraphAI maps data accordingly.

Is rate limiting enforced?

Yes. Respect rate limits (check headers in API response). Use exponential backoff or delays for scale jobs.

What format is the result returned in?

Standard JSON, with keys mapped to your schema fields. You can write directly to CSV or load into Pandas.

References

Conclusion

With ScrapeGraphAI's API, you can now go from URL lists to structured data and preprocessed images—all within a few hundred lines of code. No fragile parsing. No manual guesswork. If you're building a visual ML system, this approach lets you stay lean, fast, and accurate from day one.

Want to learn more about data innovation and AI-powered analysis? Explore these guides:

These resources will help you understand how to leverage AI and modern tools for innovative data collection and analysis.