How to Build a Wine Dataset: Scraping CellarTracker with ScrapeGraphAI

Wine data is incredibly valuable for building recommendation systems, conducting market analysis, or creating wine discovery applications. CellarTracker, one of the world's largest wine databases, contains detailed information about thousands of wines including ratings, tasting notes, vintages, and pricing data.

In this tutorial, we'll build a comprehensive wine dataset by scraping CellarTracker using ScrapeGraphAI's powerful AI-driven extraction capabilities. By the end, you'll have a complete pipeline for gathering structured wine data at scale.

Why CellarTracker for Wine Data?

CellarTracker is a goldmine for wine enthusiasts and data scientists because it contains:

Detailed wine profiles with producer, region, and vintage information
Professional and user ratings from wine critics and enthusiasts
Comprehensive tasting notes describing flavor profiles and characteristics
Price tracking showing current and historical wine values
Vintage comparisons across different years
Food pairing recommendations and serving suggestions

Let's explore how to extract this rich dataset efficiently.

Setting Up ScrapeGraphAI

First, let's set up our environment with ScrapeGraphAI's Python client:

# Install ScrapeGraphAI
# pip install scrapegraph-py
 
from scrapegraph_py import Client
from scrapegraph_py.logger import sgai_logger
import json
import pandas as pd
import time
from typing import List, Dict, Any
 
# Enable logging to track our scraping progress
sgai_logger.set_logging(level="INFO")
 
# Initialize the client with your API key
sgai_client = Client(api_key="your-scrapegraph-api-key-here")

Understanding CellarTracker's Structure

Let's examine a typical CellarTracker wine page using our example URL: https://www.cellartracker.com/classic/wine.asp?iWine=547249

This page contains rich information about a specific wine including:

Wine name, producer, and region
Vintage year and alcohol content
Professional critic scores and user ratings
Detailed tasting notes and reviews
Current market prices and availability
Food pairing suggestions
Drinking windows and aging potential

Creating Our First Wine Scraper

Let's start by scraping a single wine page to understand the data structure:

def scrape_single_wine(wine_url: str) -> Dict[str, Any]:
    """
    Scrape detailed information from a single CellarTracker wine page
    """
    
    # Define comprehensive prompt for wine data extraction
    wine_prompt = """
    Extract comprehensive wine information from this CellarTracker page:
    
    Basic Information:
    - Wine name and full producer name
    - Vintage year and region/appellation
    - Alcohol percentage and wine type (red, white, etc.)
    - Grape varieties/blend composition
    
    Ratings and Scores:
    - Professional critic scores (Wine Spectator, Robert Parker, etc.)
    - Average user rating and number of user ratings
    - Any special awards or accolades
    
    Tasting Profile:
    - Detailed tasting notes and flavor descriptors
    - Aroma characteristics
    - Structure notes (tannins, acidity, body)
    - Color and appearance
    
    Commercial Information:
    - Current average price and price range
    - Market availability status
    - Historical price trends if available
    - Best vintages or vintage variations
    
    Consumption Guidelines:
    - Recommended drinking window
    - Optimal serving temperature
    - Food pairing suggestions
    - Aging potential and cellar recommendations
    
    Return all data in structured JSON format with clear field names.
    """
    
    try:
        # Execute the scraping request
        response = sgai_client.smartscraper(
            website_url=wine_url,
            user_prompt=wine_prompt
        )
        
        print(f"Successfully scraped: {wine_url}")
        return response.result
        
    except Exception as e:
        print(f"Error scraping {wine_url}: {str(e)}")
        return None
 
# Test with our example wine
example_url = "https://www.cellartracker.com/classic/wine.asp?iWine=547249"
wine_data = scrape_single_wine(example_url)
 
print("Scraped Wine Data:")
print(json.dumps(wine_data, indent=2))

Building a Scalable Wine Dataset

Now let's create a more robust system for scraping multiple wines and building a comprehensive dataset:

class WineDatasetBuilder:
    def __init__(self, api_key: str, delay: float = 2.0):
        """
        Initialize the wine dataset builder
        
        Args:
            api_key: ScrapeGraphAI API key
            delay: Delay between requests to be respectful to the server
        """
        self.client = Client(api_key=api_key)
        self.delay = delay
        self.scraped_wines = []
        
    def scrape_wine_batch(self, wine_urls: List[str]) -> List[Dict[str, Any]]:
        """
        Scrape multiple wine pages with proper rate limiting
        """
        results = []
        
        for i, url in enumerate(wine_urls):
            print(f"Processing wine {i+1}/{len(wine_urls)}: {url}")
            
            wine_data = self.scrape_comprehensive_wine_data(url)
            if wine_data:
                results.append(wine_data)
                self.scraped_wines.append(wine_data)
            
            # Respectful delay between requests
            if i < len(wine_urls) - 1:
                time.sleep(self.delay)
        
        return results
    
    def scrape_comprehensive_wine_data(self, wine_url: str) -> Dict[str, Any]:
        """
        Extract comprehensive wine data with enhanced prompting
        """
        enhanced_prompt = """
        Extract ALL available wine information from this CellarTracker page and structure it as JSON:
 
        {
          "basic_info": {
            "name": "Full wine name",
            "producer": "Producer/winery name", 
            "region": "Geographic region/appellation",
            "country": "Country of origin",
            "vintage": "Year as integer",
            "alcohol_content": "Alcohol percentage as float",
            "wine_type": "Red/White/Rosé/Sparkling/Dessert",
            "grape_varieties": ["List of grape varieties with percentages if available"]
          },
          "ratings": {
            "professional_scores": {
              "wine_spectator": "Score if available",
              "robert_parker": "Score if available", 
              "wine_advocate": "Score if available",
              "other_critics": ["Any other professional scores"]
            },
            "user_rating": "Average user rating as float",
            "total_user_ratings": "Number of user ratings as integer",
            "awards": ["Any awards or accolades"]
          },
          "tasting_notes": {
            "color": "Wine color description",
            "aroma": "Aroma characteristics",
            "palate": "Taste profile and flavors",
            "finish": "Finish description",
            "structure": {
              "tannins": "Tannin level description",
              "acidity": "Acidity level", 
              "body": "Body description (light/medium/full)",
              "sweetness": "Sweetness level"
            }
          },
          "commercial": {
            "current_price": "Current average price",
            "price_range": "Price range if available",
            "availability": "Market availability status",
            "value_rating": "Value for money assessment"
          },
          "consumption": {
            "drink_from": "Start of drinking window",
            "drink_until": "End of drinking window", 
            "peak_drinking": "Peak drinking period",
            "serving_temp": "Optimal serving temperature",
            "food_pairings": ["Recommended food pairings"],
            "aging_potential": "Aging recommendations"
          },
          "additional_info": {
            "production_notes": "Any production method details",
            "vineyard_info": "Vineyard or terroir information",
            "vintage_notes": "Vintage-specific information",
            "similar_wines": ["Similar wine recommendations if available"]
          }
        }
        
        Extract as much information as available. If certain fields are not present, use null values.
        """
        
        try:
            response = self.client.smartscraper(
                website_url=wine_url,
                user_prompt=enhanced_prompt
            )
            
            # Add metadata
            result = response.result
            if result:
                result['scraped_url'] = wine_url
                result['scrape_timestamp'] = time.time()
            
            return result
            
        except Exception as e:
            print(f"Error scraping {wine_url}: {str(e)}")
            return None
    
    def save_dataset(self, filename: str = "wine_dataset.json"):
        """
        Save the scraped dataset to a JSON file
        """
        with open(filename, 'w', encoding='utf-8') as f:
            json.dump(self.scraped_wines, f, indent=2, ensure_ascii=False)
        print(f"Dataset saved to {filename} with {len(self.scraped_wines)} wines")
    
    def to_dataframe(self) -> pd.DataFrame:
        """
        Convert the wine dataset to a pandas DataFrame for analysis
        """
        if not self.scraped_wines:
            print("No wines scraped yet!")
            return pd.DataFrame()
        
        # Flatten nested structure for DataFrame
        flattened_data = []
        
        for wine in self.scraped_wines:
            if not wine:
                continue
                
            flat_wine = {}
            
            # Basic info
            if 'basic_info' in wine:
                basic = wine['basic_info']
                flat_wine.update({
                    'name': basic.get('name'),
                    'producer': basic.get('producer'),
                    'region': basic.get('region'), 
                    'country': basic.get('country'),
                    'vintage': basic.get('vintage'),
                    'alcohol_content': basic.get('alcohol_content'),
                    'wine_type': basic.get('wine_type'),
                    'grape_varieties': ', '.join(basic.get('grape_varieties', []))
                })
            
            # Ratings
            if 'ratings' in wine:
                ratings = wine['ratings']
                flat_wine.update({
                    'user_rating': ratings.get('user_rating'),
                    'total_user_ratings': ratings.get('total_user_ratings'),
                    'wine_spectator_score': ratings.get('professional_scores', {}).get('wine_spectator'),
                    'parker_score': ratings.get('professional_scores', {}).get('robert_parker')
                })
            
            # Commercial info
            if 'commercial' in wine:
                commercial = wine['commercial']
                flat_wine.update({
                    'current_price': commercial.get('current_price'),
                    'availability': commercial.get('availability')
                })
            
            # Add metadata
            flat_wine['scraped_url'] = wine.get('scraped_url')
            flat_wine['scrape_timestamp'] = wine.get('scrape_timestamp')
            
            flattened_data.append(flat_wine)
        
        return pd.DataFrame(flattened_data)

Discovering Wine URLs for Bulk Scraping

To build a large dataset, we need to discover wine URLs. Here's how to scrape wine search results or category pages:

def discover_wine_urls(search_base_url: str, max_pages: int = 5) -> List[str]:
    """
    Discover wine URLs from CellarTracker search results or category pages
    """
    discovery_prompt = """
    Find all individual wine page links on this CellarTracker page.
    Look for links that go to specific wine pages (usually containing 'wine.asp?iWine=' or similar).
    
    Return a JSON list of complete URLs:
    {
      "wine_urls": [
        "https://www.cellartracker.com/classic/wine.asp?iWine=123456",
        "https://www.cellartracker.com/classic/wine.asp?iWine=789012"
      ]
    }
    
    Extract ALL wine links found on the page.
    """
    
    try:
        response = sgai_client.smartscraper(
            website_url=search_base_url,
            user_prompt=discovery_prompt
        )
        
        urls = response.result.get('wine_urls', []) if response.result else []
        print(f"Discovered {len(urls)} wine URLs from {search_base_url}")
        return urls
        
    except Exception as e:
        print(f"Error discovering URLs from {search_base_url}: {str(e)}")
        return []
 
# Example: Discover wines from a search results page
search_urls = [
    "https://www.cellartracker.com/list.asp?Table=List&szSearch=bordeaux",
    "https://www.cellartracker.com/list.asp?Table=List&szSearch=burgundy", 
    "https://www.cellartracker.com/list.asp?Table=List&szSearch=champagne"
]
 
all_wine_urls = []
for search_url in search_urls:
    discovered_urls = discover_wine_urls(search_url)
    all_wine_urls.extend(discovered_urls)
 
print(f"Total wine URLs discovered: {len(all_wine_urls)}")

Complete Wine Dataset Building Pipeline

Here's our complete pipeline that ties everything together:

def build_wine_dataset(api_key: str, wine_urls: List[str], 
                      output_file: str = "comprehensive_wine_dataset.json"):
    """
    Complete pipeline to build a wine dataset from CellarTracker
    """
    # Initialize dataset builder
    builder = WineDatasetBuilder(api_key=api_key, delay=2.0)
    
    print(f"Starting to scrape {len(wine_urls)} wines...")
    
    # Process wines in batches to manage memory and provide progress updates
    batch_size = 50
    for i in range(0, len(wine_urls), batch_size):
        batch_urls = wine_urls[i:i + batch_size]
        batch_num = i // batch_size + 1
        total_batches = (len(wine_urls) + batch_size - 1) // batch_size
        
        print(f"\nProcessing batch {batch_num}/{total_batches}")
        builder.scrape_wine_batch(batch_urls)
        
        # Save progress after each batch
        builder.save_dataset(f"wine_dataset_batch_{batch_num}.json")
    
    # Save final dataset
    builder.save_dataset(output_file)
    
    # Create DataFrame for analysis
    df = builder.to_dataframe()
    df.to_csv(output_file.replace('.json', '.csv'), index=False)
    
    print(f"\n✅ Dataset completed!")
    print(f"📊 Total wines scraped: {len(builder.scraped_wines)}")
    print(f"💾 Saved to: {output_file}")
    print(f"📈 CSV available: {output_file.replace('.json', '.csv')}")
    
    return builder, df
 
# Example usage
if __name__ == "__main__":
    # Sample wine URLs for demonstration
    sample_wine_urls = [
        "https://www.cellartracker.com/classic/wine.asp?iWine=547249",
        "https://www.cellartracker.com/classic/wine.asp?iWine=123456",  # Replace with real URLs
        "https://www.cellartracker.com/classic/wine.asp?iWine=789012",  # Replace with real URLs
    ]
    
    # Build the dataset
    builder, df = build_wine_dataset(
        api_key="your-scrapegraph-api-key-here",
        wine_urls=sample_wine_urls,
        output_file="cellartracker_wine_dataset.json"
    )
    
    # Quick analysis
    print("\n📊 Dataset Overview:")
    print(f"Shape: {df.shape}")
    print(f"Columns: {list(df.columns)}")
    print(f"\nWine types distribution:")
    print(df['wine_type'].value_counts())
    print(f"\nTop regions:")
    print(df['region'].value_counts().head())

Advanced Data Analysis and Insights

Once you have your wine dataset, here are some advanced analysis techniques:

def analyze_wine_dataset(df: pd.DataFrame):
    """
    Perform comprehensive analysis on the wine dataset
    """
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    print("🍷 WINE DATASET ANALYSIS")
    print("=" * 50)
    
    # Basic statistics
    print(f"📊 Dataset Overview:")
    print(f"  • Total wines: {len(df)}")
    print(f"  • Unique producers: {df['producer'].nunique()}")
    print(f"  • Countries represented: {df['country'].nunique()}")
    print(f"  • Vintage range: {df['vintage'].min()} - {df['vintage'].max()}")
    
    # Rating analysis
    if 'user_rating' in df.columns:
        avg_rating = df['user_rating'].mean()
        print(f"  • Average user rating: {avg_rating:.2f}/100")
        print(f"  • Highest rated wine: {df.loc[df['user_rating'].idxmax(), 'name']}")
    
    # Price analysis (if available)
    if 'current_price' in df.columns:
        # Clean price data (remove currency symbols, convert to numeric)
        df['price_numeric'] = pd.to_numeric(
            df['current_price'].str.replace(r'[^\d.]', '', regex=True), 
            errors='coerce'
        )
        avg_price = df['price_numeric'].mean()
 
    # Regional analysis
    print(f"\n🌍 Top Wine Regions:")
    top_regions = df['region'].value_counts().head(10)
    for region, count in top_regions.items():
        print(f"  • {region}: {count} wines")
    
    # Vintage trends
    if 'vintage' in df.columns:
        vintage_counts = df['vintage'].value_counts().sort_index()
        print(f"\n📅 Vintage Distribution (Top 10):")
        for vintage, count in vintage_counts.tail(10).items():
            print(f"  • {vintage}: {count} wines")
    
    return df
 
# Run analysis
analyzed_df = analyze_wine_dataset(df)

Best Practices for Wine Data Scraping

1. Respectful Scraping

# Always include delays between requests
time.sleep(2)  # 2-second delay between requests
 
# Monitor your request rate
# Don't exceed 30 requests per minute

2. Data Validation and Cleaning

def validate_wine_data(wine_data: Dict[str, Any]) -> bool:
    """
    Validate scraped wine data for completeness and accuracy
    """
    required_fields = ['basic_info', 'name', 'producer']
    
    # Check for required fields
    for field in required_fields:
        if field not in wine_data or not wine_data[field]:
            return False
    
    # Validate vintage year
    vintage = wine_data.get('basic_info', {}).get('vintage')
    if vintage and (vintage < 1800 or vintage > 2025):
        return False
    
    return True

3. Error Handling and Retry Logic

import time
from typing import Optional
 
def scrape_with_retry(url: str, max_retries: int = 3) -> Optional[Dict[str, Any]]:
    """
    Scrape with retry logic for robustness
    """
    for attempt in range(max_retries):
        try:
            response = sgai_client.smartscraper(
                website_url=url,
                user_prompt=wine_prompt
            )
            return response.result
        except Exception as e:
            print(f"Attempt {attempt + 1} failed for {url}: {str(e)}")
            if attempt < max_retries - 1:
                time.sleep(5)  # Wait before retry
            else:
                print(f"Failed to scrape {url} after {max_retries} attempts")
                return None

Use Cases for Your Wine Dataset

Once you've built your comprehensive wine dataset, you can use it for:

1. Wine Recommendation System

def find_similar_wines(target_wine: str, df: pd.DataFrame, top_n: int = 5):
    """
    Find wines similar to a target wine based on characteristics
    """
    # Implementation for wine similarity matching
    # Based on region, grape varieties, ratings, price range
    pass

2. Price Prediction Model

def predict_wine_price(wine_features: Dict, model):
    """
    Predict wine price based on characteristics
    """
    # Machine learning model to predict wine prices
    # Based on producer, region, vintage, ratings
    pass

3. Market Analysis

def analyze_wine_market_trends(df: pd.DataFrame):
    """
    Analyze wine market trends and patterns
    """
    # Price trends by region
    # Rating distributions by wine type
    # Seasonal availability patterns
    pass

Conclusion

Building a comprehensive wine dataset from CellarTracker using ScrapeGraphAI provides you with rich, structured data for analysis, machine learning, and application development. The AI-powered extraction ensures you capture not just basic information, but nuanced details like tasting notes, food pairings, and expert ratings.

Key advantages of using ScrapeGraphAI for wine data:

Comprehensive extraction of complex, nested information
Natural language prompts that adapt to different page layouts
Structured JSON output perfect for databases and analysis
Reliable performance with built-in error handling
Scalable approach for large datasets

Remember to always scrape responsibly, respect website terms of service, and use appropriate delays between requests. Your wine dataset will be a valuable asset for wine discovery applications, market research, or building the next great wine recommendation system.

Ready to start building your wine dataset? Get your ScrapeGraphAI API key and begin scraping the world's wine knowledge today!

Scrape Wine Websites: The Ultimate Guide to Web Scraping for Wine