Building Image Datasets via Web Scraping: Best Practices for Machine Learning using ScrapeGraphAI
Learn how to build image datasets via web scraping using ScrapeGraphAI.


In machine learning, especially in the field of computer vision, a high-quality dataset is more valuable than a complex model architecture. If you're building a system that needs to classify products, identify logos, recognize fashion trends, or understand packaging design, then you need a dataset that matches your domain exactly. Public datasets often do not contain the specificity or freshness required. This guide explains how to build large-scale, high-quality image datasets by scraping e-commerce websites using ScrapeGraphAI API, filtering by resolution and aspect ratio, deduplicating images using perceptual hashing, and preprocessing for model readiness using OpenCV.
Why Use ScrapeGraphAI Instead of Traditional Scraping?
ScrapeGraphAI is a language model-powered scraping framework that lets you extract structured data from websites using prompts and schemas. Unlike traditional methods such as BeautifulSoup or XPath, ScrapeGraphAI interprets web content contextually. This means it can adapt to different page layouts and extract data that aligns with your goals.
Key advantages:
- No brittle HTML parsing or CSS selectors
- Schema-first structured output
- LLM-based understanding of content
- Ideal for modern JavaScript-heavy websites
- Works well with OpenAI, Groq, Mistral, and other providers
Whether you're targeting images, titles, prices, or user reviews, ScrapeGraphAI can retrieve data in a single call using its public API.
Step 1: Prepare Your API Call
Instead of using an SDK, you can call the API directly using Python's requests.
Sample Code (Using ScrapeGraphAI API):
pythonimport requests headers = { "SGAI-APIKEY": "your-api-key", "Content-Type": "application/json" } payload = { "website_url": "https://example.com/product-page", "user_prompt": "Extract product title and image URL from this e-commerce product page.", "output_schema": { "product_title": "string", "image_url": "string" } } response = requests.post("https://api.scrapegraphai.com/v1/smartscraper", json=payload, headers=headers) data = response.json() print(data["result"])
Loop through a list of URLs or paginate to collect thousands of records.
Step 2: Download and Filter Images by Quality
Once you have your image URLs, download them and filter out low-quality samples.
pythonimport os import urllib.request import cv2 def download_and_filter(urls, folder="dataset"): os.makedirs(folder, exist_ok=True) for i, url in enumerate(urls): try: path = os.path.join(folder, f"{i}.jpg") urllib.request.urlretrieve(url, path) img = cv2.imread(path) if img is None: os.remove(path) continue h, w = img.shape[:2] ratio = w / h if h < 300 or w < 300 or ratio < 0.75 or ratio > 1.8: os.remove(path) except: continue
Step 3: Deduplicate with Perceptual Hashing
pythonfrom PIL import Image import imagehash def deduplicate_images(folder="dataset"): seen = {} for fname in os.listdir(folder): path = os.path.join(folder, fname) try: img = Image.open(path) hash_val = imagehash.phash(img) if hash_val in seen: os.remove(path) else: seen[hash_val] = fname except: continue
Step 4: Preprocess Images for Deep Learning Models
pythondef preprocess_images(input_folder="dataset", output_folder="processed"): os.makedirs(output_folder, exist_ok=True) for fname in os.listdir(input_folder): try: img = cv2.imread(os.path.join(input_folder, fname)) img = cv2.resize(img, (224, 224)) img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) cv2.imwrite(os.path.join(output_folder, fname), img) except: continue
Best Practices
Ready to Scale Your Data Collection?
Join thousands of businesses using ScrapeGrapAI to automate their web scraping needs. Start your journey today with our powerful API.
- Use consistent schemas for all URLs
- Validate schema fields (e.g., URLs, file formats)
- Log failures and retry problematic pages
- Deduplicate before training
- Use JSON or CSV alongside image datasets
Example Use Cases
- Visual product classifiers for fashion or electronics
- Competitor price and image comparison systems
- Dataset creation for GAN training (generative models)
- Image search engines using embeddings
Frequently Asked Questions
Can I scrape 1000+ pages with the API?
Yes, the ScrapeGraphAI API supports batching via pagination. Just loop through your URLs and call the API with the same schema.
How is the API better than using SDKs?
Using the API directly means fewer dependencies, language independence, and more control over how and when requests are made. You also get clearer logging and retry logic.
Can I extract price, name, rating, and image together?
Yes. Define all those keys in your schema and include them in the prompt. ScrapeGraphAI maps data accordingly.
Is rate limiting enforced?
Yes. Respect rate limits (check headers in API response). Use exponential backoff or delays for scale jobs.
What format is the result returned in?
Standard JSON, with keys mapped to your schema fields. You can write directly to CSV or load into Pandas.
References
- ScrapeGraphAI Docs: https://docs.scrapegraphai.com
- ScrapeGraphAI API Reference: https://docs.scrapegraphai.com/api
- OpenCV: https://docs.opencv.org
- Pillow: https://pillow.readthedocs.io
- imagehash: https://github.com/JohannesBuchner/imagehash
Conclusion
With ScrapeGraphAI's API, you can now go from URL lists to structured data and preprocessed images—all within a few hundred lines of code. No fragile parsing. No manual guesswork. If you're building a visual ML system, this approach lets you stay lean, fast, and accurate from day one.
Related Resources
Want to learn more about data innovation and AI-powered analysis? Explore these guides:
- Web Scraping 101 - Master the basics of data collection
- AI Agent Web Scraping - Learn about AI-powered data extraction
- LlamaIndex Integration) - Discover advanced data analysis techniques
- Building Intelligent Agents - Learn how to build AI agents for data analysis
- Pre-AI to Post-AI Scraping - See how AI has transformed data collection
- Structured Output - Master handling structured data
- Stock Analysis with AI - Learn about AI-powered financial analysis
- LinkedIn Lead Generation with AI - Discover AI-driven business intelligence
- Web Scraping Legality - Understand the legal aspects of data collection
These resources will help you understand how to leverage AI and modern tools for innovative data collection and analysis.