The Art of Prompting: How to Write Perfect Prompts for SmartScraper and SearchScraper
Master the art of prompt engineering for AI web scraping. Learn how to write effective prompts, avoid common mistakes, and use schemas for structured data extraction.


In the world of AI-powered web scraping, the difference between mediocre results and exceptional data extraction often comes down to one crucial factor: how well you craft your prompts. Whether you're using SmartScraper to extract data from specific websites or SearchScraper to aggregate information from multiple sources, mastering the art of prompting is essential for getting the structured, accurate data you need.
This comprehensive guide will transform you from a prompting novice to an expert, showing you exactly how to write prompts that deliver precise, structured results every time. We'll explore real examples, common pitfalls, and advanced techniques including schema usage for type-safe data extraction.
Table of Contents
- Understanding the Endpoints
- The Anatomy of a Perfect Prompt
- Bad Prompts: What Not to Do
- Good Prompts: Best Practices in Action
- Using Schemas for Structured Data
- Testing Your Prompts
- Advanced Prompting Techniques
Understanding the Endpoints
Before diving into prompt engineering, it's crucial to understand when to use each endpoint:
SmartScraper
Purpose: Extract structured data from a specific URL
Use When: You have a target website and need specific information from it
Key Feature: Context-aware extraction from single sources
SearchScraper
Purpose: Search and aggregate data from multiple web sources
Use When: You need comprehensive information on a topic from various sources
Key Feature: Multi-source aggregation with attribution
The fundamental difference affects how you structure your prompts: SmartScraper prompts focus on what to extract, while SearchScraper prompts focus on what to find.
The Anatomy of a Perfect Prompt
A well-crafted prompt consists of four essential components:
- Clear Objective: What data do you need?
- Specific Context: Why do you need this data?
- Structured Requirements: How should the data be formatted?
- Constraints: Any limitations or specific criteria?
Let's see this in action:
python# Perfect prompt structure prompt = """ Extract product information for price comparison analysis. Focus on: - Product name and brand - Current price and any discounts - Availability status - Customer ratings (if available) - Key specifications Format as structured JSON with consistent field names. Only include products currently in stock. """
Bad Prompts: What Not to Do
Understanding why prompts fail is crucial for improvement. Let's examine common mistakes:
1. The Vague Request
❌ Bad Prompt:
python"Get product data"
Why it fails:
- No specification of which data points
- No structure definition
- No context for the AI to understand importance
- Results in inconsistent, unstructured output
2. The Everything Request
❌ Bad Prompt:
python"Extract all information from this page"
Why it fails:
- Overwhelming and unfocused
- Returns unnecessary data
- Difficult to process programmatically
- Wastes API resources
3. The Ambiguous Format
❌ Bad Prompt:
python"Find prices and names and descriptions and everything else important"
Why it fails:
- No clear structure
- "Everything else important" is subjective
- Run-on sentence structure confuses parsing
- No hierarchy of importance
4. The Context-Free Request
❌ Bad Prompt:
python"List all companies mentioned"
Why it fails:
- No context about which companies matter
- No format specification
- No filtering criteria
- May include irrelevant mentions
Good Prompts: Best Practices in Action
Now let's look at exemplary prompts for each endpoint:
SmartScraper Examples
E-commerce Product Extraction
✅ Good Prompt:
pythonprompt = """ Extract detailed product information for inventory management system. Required fields: 1. Product identification: - name (full product name) - sku (if available) - brand 2. Pricing information: - current_price (numeric value) - original_price (if on sale) - currency - discount_percentage (calculate if both prices available) 3. Availability: - in_stock (boolean) - stock_count (if displayed) 4. Product details: - main_image_url - description (first 200 characters) - key_features (list of up to 5 main features) Return as structured JSON. If a field is not found, use null. Focus only on the main product, ignore related or recommended items. """
Why it works:
- Clear purpose stated upfront
- Structured field requirements
- Specific data types indicated
- Handling for missing data
- Constraints to avoid noise
Real Estate Listing Extraction
✅ Good Prompt:
pythonprompt = """ Extract real estate listing data for market analysis database. Core information needed: - Property type (house/apartment/condo) - Price (numeric, in USD) - Bedrooms (number) - Bathrooms (number) - Square footage (number) - Address (full address if available) - Year built - Listing date - MLS number (if present) Additional features: - Amenities (list key amenities like pool, garage, etc.) - Property description (first paragraph only) - Agent name and contact - Monthly HOA fees (if applicable) Format as clean JSON with snake_case keys. Only extract if property is actively for sale. """
SearchScraper Examples
Market Research Query
✅ Good Prompt:
pythonprompt = """ Research the current state of AI-powered web scraping tools in 2024. I need: 1. Market leaders and their key features 2. Pricing models (subscription vs usage-based) 3. Technical capabilities (AI models used, supported sites) 4. Recent innovations or announcements 5. Common use cases and customer segments Focus on: - Tools launched or updated in 2023-2024 - Enterprise-grade solutions - Comparison of features across competitors Provide factual information with source attribution. Structure the response by tool/company for easy comparison. """
Why it works:
- Clear research objective
- Specific aspects to investigate
- Time-bound criteria
- Comparison-friendly structure
- Attribution requirement
Competitive Intelligence Query
✅ Good Prompt:
pythonprompt = """ Analyze competitive landscape for B2B SaaS email marketing platforms. Key information required: 1. Top 5 platforms by market share 2. Pricing tiers and included features 3. Recent product updates (last 6 months) 4. Integration ecosystems 5. Customer acquisition strategies 6. Reported growth metrics or funding Prioritize: - Direct feature comparisons - Publicly available pricing - Verified customer counts or revenue - Recent news or announcements Exclude platforms focused solely on B2C or with less than 1000 customers. Organize findings by platform with clear sections. """
Using Schemas for Structured Data
One of the most powerful features of ScrapeGraphAI is the ability to use schemas (like Pydantic) to enforce data structure and types. This ensures consistent, validated output every time.
Basic Schema Example
pythonfrom pydantic import BaseModel, Field from typing import List, Optional from scrapegraph_py import Client class ProductSchema(BaseModel): name: str = Field(description="Product name") price: float = Field(description="Current price as float") currency: str = Field(default="USD", description="Price currency") in_stock: bool = Field(description="Availability status") rating: Optional[float] = Field(None, description="Average rating 0-5") review_count: Optional[int] = Field(None, description="Number of reviews") # Use with SmartScraper client = Client(api_key="your-api-key") response = client.smartscraper( website_url="https://example-shop.com/product", user_prompt="Extract product information", output_schema=ProductSchema ) # Response will be validated against schema product = ProductSchema(**response['result']) print(f"Product: {product.name} - ${product.price}")
Advanced Schema with Nested Models
pythonfrom pydantic import BaseModel, Field, HttpUrl from typing import List, Optional from datetime import datetime class PriceInfo(BaseModel): current: float = Field(description="Current price") original: Optional[float] = Field(None, description="Original price before discount") discount_percentage: Optional[float] = Field(None, description="Calculated discount %") currency: str = Field(default="USD") class Specifications(BaseModel): key: str = Field(description="Specification name") value: str = Field(description="Specification value") class ProductListing(BaseModel): # Basic info name: str = Field(description="Product title") brand: str = Field(description="Brand or manufacturer") sku: Optional[str] = Field(None, description="Product SKU") # Pricing pricing: PriceInfo = Field(description="Price information") # Availability in_stock: bool = Field(description="Stock availability") stock_count: Optional[int] = Field(None, ge=0, description="Units in stock") # Details description: str = Field(description="Product description", max_length=500) main_image: HttpUrl = Field(description="Primary product image URL") specifications: List[Specifications] = Field( default_factory=list, description="Technical specifications" ) # Metadata scraped_at: datetime = Field(default_factory=datetime.now) source_url: HttpUrl # Craft a prompt that aligns with your schema prompt = """ Extract comprehensive product data matching our inventory system requirements. Focus on: 1. Complete product identification (name, brand, SKU) 2. All pricing information (current, original, calculate discount) 3. Stock availability with specific count if shown 4. Technical specifications as key-value pairs 5. Product description (limit to 500 chars) 6. Main product image URL Ensure all monetary values are numeric (not strings). For specifications, extract all technical details shown on the page. """ response = client.smartscraper( website_url="https://example.com/product", user_prompt=prompt, output_schema=ProductListing )
Schema Benefits
Ready to Scale Your Data Collection?
Join thousands of businesses using ScrapeGrapAI to automate their web scraping needs. Start your journey today with our powerful API.
- Type Safety: Automatic validation of data types
- Consistency: Same structure across all extractions
- Error Handling: Clear errors for missing required fields
- Documentation: Self-documenting code with field descriptions
- IDE Support: Autocomplete and type hints
Testing Your Prompts
The key to perfect prompts is iterative testing. Here's a systematic approach:
1. Start with the Playground
ScrapeGraphAI's playground (playground.scrapegraphai.com) is your best friend:
python# Test variations quickly test_prompts = [ "Extract product price", # Too simple "Extract product price as a number without currency symbols", # Better "Extract product price as float, currency as separate field", # Best ]
2. Test Edge Cases
python# Test different scenarios edge_cases = { "out_of_stock": "Extract price even if product is out of stock", "sale_price": "Extract both original and sale price, calculate discount", "multiple_variants": "Extract price range if multiple variants exist", "no_price": "Return null for price if not displayed (private/quote only)" }
3. Validate Consistency
Run the same prompt multiple times to ensure consistent results:
python# Consistency test for i in range(3): response = client.smartscraper( website_url=url, user_prompt=prompt, output_schema=ProductSchema ) print(f"Run {i+1}: {response['result']}")
4. Performance Optimization
python# Measure what matters import time # Specific prompt (faster, focused) start = time.time() specific_response = client.smartscraper( website_url=url, user_prompt="Extract only: product name, current price as float, stock status as boolean" ) specific_time = time.time() - start # General prompt (slower, more data) start = time.time() general_response = client.smartscraper( website_url=url, user_prompt="Extract all product information available on the page" ) general_time = time.time() - start print(f"Specific prompt: {specific_time:.2f}s") print(f"General prompt: {general_time:.2f}s")
Advanced Prompting Techniques
1. Conditional Extraction
pythonprompt = """ Extract product data with conditional logic: If product has reviews: - Extract rating (float) - Extract review count (integer) - Extract top positive review excerpt - Extract top critical review excerpt If product is on sale: - Extract original price - Extract sale price - Calculate discount percentage - Extract sale end date if shown If product has variants: - Extract all variant options (color, size, etc) - Extract price range (min-max) - Note which variant is default selected """
2. Multi-Stage Extraction
python# Stage 1: Identify page type identification_prompt = """ Identify the type of page: - Product listing page (multiple products) - Product detail page (single product) - Category page - Search results page Return only the page type. """ # Stage 2: Extract based on page type if page_type == "product detail page": extraction_prompt = "Extract detailed product information..." elif page_type == "product listing page": extraction_prompt = "Extract summary info for each product..."
3. Context-Aware Prompting
python# Industry-specific prompt prompt = """ Extract automotive parts data for inventory system: Required technical specifications: - Part number (OEM and aftermarket if both shown) - Compatibility (make, model, year range) - Fitment position (front/rear, left/right) - Material composition - Dimensions (with units) - Weight (with units) - Warranty period Cross-reference information: - OEM equivalent numbers - Superseded part numbers - Compatible vehicle list Use automotive industry standard terminology. Convert all measurements to metric if shown in imperial. """
4. Comparative Extraction
python# For comparison shopping prompt = """ Extract data optimized for price comparison: Standardize the following across products: 1. Product title (remove marketing fluff, keep: brand + model + key spec) 2. Price per unit (calculate if bulk pricing shown) 3. Shipping cost (separate from product price) 4. Total cost (product + shipping) 5. Availability (in-stock, pre-order, out-of-stock) 6. Seller rating (normalize to 0-5 scale) 7. Return policy summary Make prices directly comparable by: - Converting to USD if other currency - Including all fees in total cost - Noting if price is per item or per pack """
Pro Tips for Prompt Excellence
1. Be Explicit About Data Types
Instead of "extract the price", specify:
- "Extract price as a float"
- "Extract price as decimal number without currency symbols"
- "Extract price in cents as integer"
2. Handle Missing Data Gracefully
Always specify what to do when data isn't found:
python"If review count is not displayed, return 0" "If SKU is not found, return null" "If multiple prices shown, extract the lowest"
3. Use Examples in Complex Cases
pythonprompt = """ Extract phone numbers in standardized format. Examples of input -> output: - "(555) 123-4567" -> "+1-555-123-4567" - "555.123.4567" -> "+1-555-123-4567" - "Call us at 555-1234" -> "+1-555-555-1234" (assuming local area code) """
4. Specify Priority Order
pythonprompt = """ Extract product description with priority: 1. First check 'Product Details' section 2. If not found, check 'Description' tab 3. If still not found, use first paragraph under product title 4. As last resort, use meta description Limit to 200 characters. """
5. Version Your Prompts
pythonPROMPT_VERSIONS = { "v1": "Extract product name and price", "v2": "Extract product name (string) and price (float)", "v3": "Extract product: name (string), price (float), currency (string)", "v4": "Extract product: name (string), price (float), currency (ISO code), in_stock (boolean)" } # Track which version works best current_version = "v4"
Common Patterns by Industry
E-commerce
python"Extract product catalog data: name, price, availability, shipping info, return policy highlights"
Real Estate
python"Extract property listing: address, price, beds/baths, square footage, lot size, year built, days on market"
Job Listings
python"Extract job posting: title, company, location, salary range, required skills, experience level, application deadline"
News Articles
python"Extract article metadata: headline, author, publication date, summary (first paragraph), main topics/tags"
Financial Data
python"Extract stock information: current price, change (amount and percentage), volume, market cap, P/E ratio"
Troubleshooting Common Issues
Issue: Inconsistent Results
Solution: Add more structure and constraints
python# Instead of: "Get product reviews" # Use: "Extract up to 5 most recent reviews with: reviewer name, rating (1-5), review date (MM/DD/YYYY), review text (first 100 chars)"
Issue: Too Much Data Returned
Solution: Add filtering criteria
python"Extract only products that are currently in stock and priced under $100"
Issue: Wrong Data Format
Solution: Provide format examples
python"Extract date in ISO format (YYYY-MM-DD), price as decimal (e.g., 29.99), phone as E.164 format"
Issue: Missing Important Fields
Solution: Mark required vs optional clearly
python"Required fields: name, price, SKU. Optional fields: reviews, ratings, manufacturer"
Measuring Prompt Quality
A good prompt should score well on these metrics:
- Specificity: Does it clearly define what to extract?
- Structure: Does it specify the output format?
- Completeness: Does it handle edge cases?
- Efficiency: Does it avoid requesting unnecessary data?
- Consistency: Does it produce reliable results across different pages?
Conclusion
Mastering the art of prompting for ScrapeGraphAI is about finding the perfect balance between specificity and flexibility. The best prompts are:
- Clear in their objectives
- Specific about data requirements
- Structured in their output expectations
- Contextual about the use case
- Robust in handling edge cases
Remember, prompt engineering is an iterative process. Start with the playground, test thoroughly, and refine based on results. With practice and these guidelines, you'll be writing prompts that consistently deliver exactly the data you need.
The difference between amateur and professional web scraping often comes down to prompt quality. Invest time in crafting and testing your prompts—your future self (and your data pipeline) will thank you.
Next Steps
- Practice in the Playground: Start with simple extractions and gradually increase complexity
- Build a Prompt Library: Save successful prompts for reuse
- Implement Schemas: Add type safety with Pydantic schemas
- Monitor Performance: Track which prompts work best for your use cases
- Stay Updated: Follow ScrapeGraphAI updates for new features and capabilities
Happy scraping! 🚀
Related Resources
Want to dive deeper into AI-powered web scraping? Explore these guides:
- Mastering ScrapeGraphAI Endpoints - Deep dive into all endpoints
- Web Scraping with Pydantic - Advanced schema usage
- AI Agent Web Scraping - Building intelligent agents
- SearchScraper Deep Dive - Master multi-source extraction
- Web Scraping 101 - Fundamentals of web scraping
- LangChain Integration - Combine with LangChain
- Real-Time Data Extraction - Live data monitoring
- Building Full-Stack Apps - Complete applications
- Web Scraping Legality - Legal considerations