Scrape Wine Websites: The Ultimate Guide to Web Scraping for Wine
Learn how to scrape wine websites using ScrapeGraphAI. Discover the best tools and techniques for web scraping wine data.


How to Build a Wine Dataset: Scraping CellarTracker with ScrapeGraphAI
Wine data is incredibly valuable for building recommendation systems, conducting market analysis, or creating wine discovery applications. CellarTracker, one of the world's largest wine databases, contains detailed information about thousands of wines including ratings, tasting notes, vintages, and pricing data.
In this tutorial, we'll build a comprehensive wine dataset by scraping CellarTracker using ScrapeGraphAI's powerful AI-driven extraction capabilities. By the end, you'll have a complete pipeline for gathering structured wine data at scale.
Why CellarTracker for Wine Data?
CellarTracker is a goldmine for wine enthusiasts and data scientists because it contains:
- Detailed wine profiles with producer, region, and vintage information
- Professional and user ratings from wine critics and enthusiasts
- Comprehensive tasting notes describing flavor profiles and characteristics
- Price tracking showing current and historical wine values
- Vintage comparisons across different years
- Food pairing recommendations and serving suggestions
Let's explore how to extract this rich dataset efficiently.
Setting Up ScrapeGraphAI
First, let's set up our environment with ScrapeGraphAI's Python client:
python# Install ScrapeGraphAI # pip install scrapegraph-py from scrapegraph_py import Client from scrapegraph_py.logger import sgai_logger import json import pandas as pd import time from typing import List, Dict, Any # Enable logging to track our scraping progress sgai_logger.set_logging(level="INFO") # Initialize the client with your API key sgai_client = Client(api_key="your-api-key-here")
Understanding CellarTracker's Structure
Let's examine a typical CellarTracker wine page using our example URL:
https://www.cellartracker.com/classic/wine.asp?iWine=547249
This page contains rich information about a specific wine including:
- Wine name, producer, and region
- Vintage year and alcohol content
- Professional critic scores and user ratings
- Detailed tasting notes and reviews
- Current market prices and availability
- Food pairing suggestions
- Drinking windows and aging potential
Creating Our First Wine Scraper
Let's start by scraping a single wine page to understand the data structure:
pythondef scrape_single_wine(wine_url: str) -> Dict[str, Any]: """ Scrape detailed information from a single CellarTracker wine page """ # Define comprehensive prompt for wine data extraction wine_prompt = """ Extract comprehensive wine information from this CellarTracker page: Basic Information: - Wine name and full producer name - Vintage year and region/appellation - Alcohol percentage and wine type (red, white, etc.) - Grape varieties/blend composition Ratings and Scores: - Professional critic scores (Wine Spectator, Robert Parker, etc.) - Average user rating and number of user ratings - Any special awards or accolades Tasting Profile: - Detailed tasting notes and flavor descriptors - Aroma characteristics - Structure notes (tannins, acidity, body) - Color and appearance Commercial Information: - Current average price and price range - Market availability status - Historical price trends if available - Best vintages or vintage variations Consumption Guidelines: - Recommended drinking window - Optimal serving temperature - Food pairing suggestions - Aging potential and cellar recommendations Return all data in structured JSON format with clear field names. """ try: # Execute the scraping request response = sgai_client.smartscraper( website_url=wine_url, user_prompt=wine_prompt ) print(f"Successfully scraped: {wine_url}") return response.result except Exception as e: print(f"Error scraping {wine_url}: {str(e)}") return None # Test with our example wine example_url = "https://www.cellartracker.com/classic/wine.asp?iWine=547249" wine_data = scrape_single_wine(example_url) print("Scraped Wine Data:") print(json.dumps(wine_data, indent=2))
Building a Scalable Wine Dataset
Now let's create a more robust system for scraping multiple wines and building a comprehensive dataset:
pythonclass WineDatasetBuilder: def __init__(self, api_key: str, delay: float = 2.0): """ Initialize the wine dataset builder Args: api_key: ScrapeGraphAI API key delay: Delay between requests to be respectful to the server """ self.client = Client(api_key=api_key) self.delay = delay self.scraped_wines = [] def scrape_wine_batch(self, wine_urls: List[str]) -> List[Dict[str, Any]]: """ Scrape multiple wine pages with proper rate limiting """ results = [] for i, url in enumerate(wine_urls): print(f"Processing wine {i+1}/{len(wine_urls)}: {url}") wine_data = self.scrape_comprehensive_wine_data(url) if wine_data: results.append(wine_data) self.scraped_wines.append(wine_data) # Respectful delay between requests if i < len(wine_urls) - 1: time.sleep(self.delay) return results def scrape_comprehensive_wine_data(self, wine_url: str) -> Dict[str, Any]: """ Extract comprehensive wine data with enhanced prompting """ enhanced_prompt = """ Extract ALL available wine information from this CellarTracker page and structure it as JSON: { "basic_info": { "name": "Full wine name", "producer": "Producer/winery name", "region": "Geographic region/appellation", "country": "Country of origin", "vintage": "Year as integer", "alcohol_content": "Alcohol percentage as float", "wine_type": "Red/White/RosΓ©/Sparkling/Dessert", "grape_varieties": ["List of grape varieties with percentages if available"] }, "ratings": { "professional_scores": { "wine_spectator": "Score if available", "robert_parker": "Score if available", "wine_advocate": "Score if available", "other_critics": ["Any other professional scores"] }, "user_rating": "Average user rating as float", "total_user_ratings": "Number of user ratings as integer", "awards": ["Any awards or accolades"] }, "tasting_notes": { "color": "Wine color description", "aroma": "Aroma characteristics", "palate": "Taste profile and flavors", "finish": "Finish description", "structure": { "tannins": "Tannin level description", "acidity": "Acidity level", "body": "Body description (light/medium/full)", "sweetness": "Sweetness level" } }, "commercial": { "current_price": "Current average price", "price_range": "Price range if available", "availability": "Market availability status", "value_rating": "Value for money assessment" }, "consumption": { "drink_from": "Start of drinking window", "drink_until": "End of drinking window", "peak_drinking": "Peak drinking period", "serving_temp": "Optimal serving temperature", "food_pairings": ["Recommended food pairings"], "aging_potential": "Aging recommendations" }, "additional_info": { "production_notes": "Any production method details", "vineyard_info": "Vineyard or terroir information", "vintage_notes": "Vintage-specific information", "similar_wines": ["Similar wine recommendations if available"] } } Extract as much information as available. If certain fields are not present, use null values. """ try: response = self.client.smartscraper( website_url=wine_url, user_prompt=enhanced_prompt ) # Add metadata result = response.result if result: result['scraped_url'] = wine_url result['scrape_timestamp'] = time.time() return result except Exception as e: print(f"Error scraping {wine_url}: {str(e)}") return None def save_dataset(self, filename: str = "wine_dataset.json"): """ Save the scraped dataset to a JSON file """ with open(filename, 'w', encoding='utf-8') as f: json.dump(self.scraped_wines, f, indent=2, ensure_ascii=False) print(f"Dataset saved to {filename} with {len(self.scraped_wines)} wines") def to_dataframe(self) -> pd.DataFrame: """ Convert the wine dataset to a pandas DataFrame for analysis """ if not self.scraped_wines: print("No wines scraped yet!") return pd.DataFrame() # Flatten nested structure for DataFrame flattened_data = [] for wine in self.scraped_wines: if not wine: continue flat_wine = {} # Basic info if 'basic_info' in wine: basic = wine['basic_info'] flat_wine.update({ 'name': basic.get('name'), 'producer': basic.get('producer'), 'region': basic.get('region'), 'country': basic.get('country'), 'vintage': basic.get('vintage'), 'alcohol_content': basic.get('alcohol_content'), 'wine_type': basic.get('wine_type'), 'grape_varieties': ', '.join(basic.get('grape_varieties', [])) }) # Ratings if 'ratings' in wine: ratings = wine['ratings'] flat_wine.update({ 'user_rating': ratings.get('user_rating'), 'total_user_ratings': ratings.get('total_user_ratings'), 'wine_spectator_score': ratings.get('professional_scores', {}).get('wine_spectator'), 'parker_score': ratings.get('professional_scores', {}).get('robert_parker') }) # Commercial info if 'commercial' in wine: commercial = wine['commercial'] flat_wine.update({ 'current_price': commercial.get('current_price'), 'availability': commercial.get('availability') }) # Add metadata flat_wine['scraped_url'] = wine.get('scraped_url') flat_wine['scrape_timestamp'] = wine.get('scrape_timestamp') flattened_data.append(flat_wine) return pd.DataFrame(flattened_data)
Discovering Wine URLs for Bulk Scraping
To build a large dataset, we need to discover wine URLs. Here's how to scrape wine search results or category pages:
pythondef discover_wine_urls(search_base_url: str, max_pages: int = 5) -> List[str]: """ Discover wine URLs from CellarTracker search results or category pages """ discovery_prompt = """ Find all individual wine page links on this CellarTracker page. Look for links that go to specific wine pages (usually containing 'wine.asp?iWine=' or similar). Return a JSON list of complete URLs: { "wine_urls": [ "https://www.cellartracker.com/classic/wine.asp?iWine=123456", "https://www.cellartracker.com/classic/wine.asp?iWine=789012" ] } Extract ALL wine links found on the page. """ try: response = sgai_client.smartscraper( website_url=search_base_url, user_prompt=discovery_prompt ) urls = response.result.get('wine_urls', []) if response.result else [] print(f"Discovered {len(urls)} wine URLs from {search_base_url}") return urls except Exception as e: print(f"Error discovering URLs from {search_base_url}: {str(e)}") return []
Ready to Scale Your Data Collection?
Join thousands of businesses using ScrapeGrapAI to automate their web scraping needs. Start your journey today with our powerful API.
Example: Discover wines from a search results page
search_urls = [ "https://www.cellartracker.com/list.asp?Table=List&szSearch=bordeaux", "https://www.cellartracker.com/list.asp?Table=List&szSearch=burgundy", "https://www.cellartracker.com/list.asp?Table=List&szSearch=champagne" ]
all_wine_urls = [] for search_url in search_urls: discovered_urls = discover_wine_urls(search_url) all_wine_urls.extend(discovered_urls)
print(f"Total wine URLs discovered: {len(all_wine_urls)}")
text## Complete Wine Dataset Building Pipeline Here's our complete pipeline that ties everything together: ```python def build_wine_dataset(api_key: str, wine_urls: List[str], output_file: str = "comprehensive_wine_dataset.json"): """ Complete pipeline to build a wine dataset from CellarTracker """ # Initialize dataset builder builder = WineDatasetBuilder(api_key=api_key, delay=2.0) print(f"Starting to scrape {len(wine_urls)} wines...") # Process wines in batches to manage memory and provide progress updates batch_size = 50 for i in range(0, len(wine_urls), batch_size): batch_urls = wine_urls[i:i + batch_size] batch_num = i // batch_size + 1 total_batches = (len(wine_urls) + batch_size - 1) // batch_size print(f"\nProcessing batch {batch_num}/{total_batches}") builder.scrape_wine_batch(batch_urls) # Save progress after each batch builder.save_dataset(f"wine_dataset_batch_{batch_num}.json") # Save final dataset builder.save_dataset(output_file) # Create DataFrame for analysis df = builder.to_dataframe() df.to_csv(output_file.replace('.json', '.csv'), index=False) print(f"\nβ Dataset completed!") print(f"π Total wines scraped: {len(builder.scraped_wines)}") print(f"πΎ Saved to: {output_file}") print(f"π CSV available: {output_file.replace('.json', '.csv')}") return builder, df # Example usage if __name__ == "__main__": # Sample wine URLs for demonstration sample_wine_urls = [ "https://www.cellartracker.com/classic/wine.asp?iWine=547249", "https://www.cellartracker.com/classic/wine.asp?iWine=123456", # Replace with real URLs "https://www.cellartracker.com/classic/wine.asp?iWine=789012", # Replace with real URLs ] # Build the dataset builder, df = build_wine_dataset( api_key="your-api-key-here", wine_urls=sample_wine_urls, output_file="cellartracker_wine_dataset.json" ) # Quick analysis print("\nπ Dataset Overview:") print(f"Shape: {df.shape}") print(f"Columns: {list(df.columns)}") print(f"\nWine types distribution:") print(df['wine_type'].value_counts()) print(f"\nTop regions:") print(df['region'].value_counts().head())
Advanced Data Analysis and Insights
Once you have your wine dataset, here are some advanced analysis techniques:
pythondef analyze_wine_dataset(df: pd.DataFrame): """ Perform comprehensive analysis on the wine dataset """ import matplotlib.pyplot as plt import seaborn as sns print("π· WINE DATASET ANALYSIS") print("=" * 50) # Basic statistics print(f"π Dataset Overview:") print(f" β’ Total wines: {len(df)}") print(f" β’ Unique producers: {df['producer'].nunique()}") print(f" β’ Countries represented: {df['country'].nunique()}") print(f" β’ Vintage range: {df['vintage'].min()} - {df['vintage'].max()}") # Rating analysis if 'user_rating' in df.columns: avg_rating = df['user_rating'].mean() print(f" β’ Average user rating: {avg_rating:.2f}/100") print(f" β’ Highest rated wine: {df.loc[df['user_rating'].idxmax(), 'name']}") # Price analysis (if available) if 'current_price' in df.columns: # Clean price data (remove currency symbols, convert to numeric) df['price_numeric'] = pd.to_numeric( df['current_price'].str.replace(r'[^\d.]', '', regex=True), errors='coerce' ) avg_price = df['price_numeric'].mean() # Regional analysis print(f"\nπ Top Wine Regions:") top_regions = df['region'].value_counts().head(10) for region, count in top_regions.items(): print(f" β’ {region}: {count} wines") # Vintage trends if 'vintage' in df.columns: vintage_counts = df['vintage'].value_counts().sort_index() print(f"\nπ Vintage Distribution (Top 10):") for vintage, count in vintage_counts.tail(10).items(): print(f" β’ {vintage}: {count} wines") return df # Run analysis analyzed_df = analyze_wine_dataset(df)
Best Practices for Wine Data Scraping
1. Respectful Scraping
python# Always include delays between requests time.sleep(2) # 2-second delay between requests # Monitor your request rate # Don't exceed 30 requests per minute
2. Data Validation and Cleaning
pythondef validate_wine_data(wine_data: Dict[str, Any]) -> bool: """ Validate scraped wine data for completeness and accuracy """ required_fields = ['basic_info', 'name', 'producer'] # Check for required fields for field in required_fields: if field not in wine_data or not wine_data[field]: return False # Validate vintage year vintage = wine_data.get('basic_info', {}).get('vintage') if vintage and (vintage < 1800 or vintage > 2025): return False return True
3. Error Handling and Retry Logic
pythonimport time from typing import Optional def scrape_with_retry(url: str, max_retries: int = 3) -> Optional[Dict[str, Any]]: """ Scrape with retry logic for robustness """ for attempt in range(max_retries): try: response = sgai_client.smartscraper( website_url=url, user_prompt=wine_prompt ) return response.result except Exception as e: print(f"Attempt {attempt + 1} failed for {url}: {str(e)}") if attempt < max_retries - 1: time.sleep(5) # Wait before retry else: print(f"Failed to scrape {url} after {max_retries} attempts") return None
Use Cases for Your Wine Dataset
Once you've built your comprehensive wine dataset, you can use it for:
1. Wine Recommendation System
pythondef find_similar_wines(target_wine: str, df: pd.DataFrame, top_n: int = 5): """ Find wines similar to a target wine based on characteristics """ # Implementation for wine similarity matching # Based on region, grape varieties, ratings, price range pass
2. Price Prediction Model
pythondef predict_wine_price(wine_features: Dict, model): """ Predict wine price based on characteristics """ # Machine learning model to predict wine prices # Based on producer, region, vintage, ratings pass
3. Market Analysis
pythondef analyze_wine_market_trends(df: pd.DataFrame): """ Analyze wine market trends and patterns """ # Price trends by region # Rating distributions by wine type # Seasonal availability patterns pass
Conclusion
Building a comprehensive wine dataset from CellarTracker using ScrapeGraphAI provides you with rich, structured data for analysis, machine learning, and application development. The AI-powered extraction ensures you capture not just basic information, but nuanced details like tasting notes, food pairings, and expert ratings.
Key advantages of using ScrapeGraphAI for wine data:
- Comprehensive extraction of complex, nested information
- Natural language prompts that adapt to different page layouts
- Structured JSON output perfect for databases and analysis
- Reliable performance with built-in error handling
- Scalable approach for large datasets
Remember to always scrape responsibly, respect website terms of service, and use appropriate delays between requests. Your wine dataset will be a valuable asset for wine discovery applications, market research, or building the next great wine recommendation system.
Ready to start building your wine dataset? Get your ScrapeGraphAI API key and begin scraping the world's wine knowledge today!