Blog/The Art of Prompting: How to Write Perfect Prompts for SmartScraper and SearchScraper

The Art of Prompting: How to Write Perfect Prompts for SmartScraper and SearchScraper

Master the art of prompt engineering for AI web scraping. Learn how to write effective prompts, avoid common mistakes, and use schemas for structured data extraction.

Tutorials13 min read min readLorenzo PadoanBy Lorenzo Padoan
The Art of Prompting: How to Write Perfect Prompts for SmartScraper and SearchScraper

In the world of AI-powered web scraping, the difference between mediocre results and exceptional data extraction often comes down to one crucial factor: how well you craft your prompts. Whether you're using SmartScraper to extract data from specific websites or SearchScraper to aggregate information from multiple sources, mastering the art of prompting is essential for getting the structured, accurate data you need.

This comprehensive guide will transform you from a prompting novice to an expert, showing you exactly how to write prompts that deliver precise, structured results every time. We'll explore real examples, common pitfalls, and advanced techniques including schema usage for type-safe data extraction.

Table of Contents

Understanding the Endpoints

Before diving into prompt engineering, it's crucial to understand when to use each endpoint:

SmartScraper

Purpose: Extract structured data from a specific URL
Use When: You have a target website and need specific information from it
Key Feature: Context-aware extraction from single sources

SearchScraper

Purpose: Search and aggregate data from multiple web sources
Use When: You need comprehensive information on a topic from various sources
Key Feature: Multi-source aggregation with attribution

The fundamental difference affects how you structure your prompts: SmartScraper prompts focus on what to extract, while SearchScraper prompts focus on what to find.

The Anatomy of a Perfect Prompt

A well-crafted prompt consists of four essential components:

  1. Clear Objective: What data do you need?
  2. Specific Context: Why do you need this data?
  3. Structured Requirements: How should the data be formatted?
  4. Constraints: Any limitations or specific criteria?

Let's see this in action:

python
# Perfect prompt structure
prompt = """
Extract product information for price comparison analysis.
Focus on:
- Product name and brand
- Current price and any discounts
- Availability status
- Customer ratings (if available)
- Key specifications

Format as structured JSON with consistent field names.
Only include products currently in stock.
"""

Bad Prompts: What Not to Do

Understanding why prompts fail is crucial for improvement. Let's examine common mistakes:

1. The Vague Request

Bad Prompt:

python
"Get product data"

Why it fails:

  • No specification of which data points
  • No structure definition
  • No context for the AI to understand importance
  • Results in inconsistent, unstructured output

2. The Everything Request

Bad Prompt:

python
"Extract all information from this page"

Why it fails:

  • Overwhelming and unfocused
  • Returns unnecessary data
  • Difficult to process programmatically
  • Wastes API resources

3. The Ambiguous Format

Bad Prompt:

python
"Find prices and names and descriptions and everything else important"

Why it fails:

  • No clear structure
  • "Everything else important" is subjective
  • Run-on sentence structure confuses parsing
  • No hierarchy of importance

4. The Context-Free Request

Bad Prompt:

python
"List all companies mentioned"

Why it fails:

  • No context about which companies matter
  • No format specification
  • No filtering criteria
  • May include irrelevant mentions

Good Prompts: Best Practices in Action

Now let's look at exemplary prompts for each endpoint:

SmartScraper Examples

E-commerce Product Extraction

Good Prompt:

python
prompt = """
Extract detailed product information for inventory management system.

Required fields:
1. Product identification:
   - name (full product name)
   - sku (if available)
   - brand
   
2. Pricing information:
   - current_price (numeric value)
   - original_price (if on sale)
   - currency
   - discount_percentage (calculate if both prices available)
   
3. Availability:
   - in_stock (boolean)
   - stock_count (if displayed)
   
4. Product details:
   - main_image_url
   - description (first 200 characters)
   - key_features (list of up to 5 main features)
   
Return as structured JSON. If a field is not found, use null.
Focus only on the main product, ignore related or recommended items.
"""

Why it works:

  • Clear purpose stated upfront
  • Structured field requirements
  • Specific data types indicated
  • Handling for missing data
  • Constraints to avoid noise

Real Estate Listing Extraction

Good Prompt:

python
prompt = """
Extract real estate listing data for market analysis database.

Core information needed:
- Property type (house/apartment/condo)
- Price (numeric, in USD)
- Bedrooms (number)
- Bathrooms (number) 
- Square footage (number)
- Address (full address if available)
- Year built
- Listing date
- MLS number (if present)

Additional features:
- Amenities (list key amenities like pool, garage, etc.)
- Property description (first paragraph only)
- Agent name and contact
- Monthly HOA fees (if applicable)

Format as clean JSON with snake_case keys.
Only extract if property is actively for sale.
"""

SearchScraper Examples

Market Research Query

Good Prompt:

python
prompt = """
Research the current state of AI-powered web scraping tools in 2024.

I need:
1. Market leaders and their key features
2. Pricing models (subscription vs usage-based)
3. Technical capabilities (AI models used, supported sites)
4. Recent innovations or announcements
5. Common use cases and customer segments

Focus on:
- Tools launched or updated in 2023-2024
- Enterprise-grade solutions
- Comparison of features across competitors

Provide factual information with source attribution.
Structure the response by tool/company for easy comparison.
"""

Why it works:

  • Clear research objective
  • Specific aspects to investigate
  • Time-bound criteria
  • Comparison-friendly structure
  • Attribution requirement

Competitive Intelligence Query

Good Prompt:

python
prompt = """
Analyze competitive landscape for B2B SaaS email marketing platforms.

Key information required:
1. Top 5 platforms by market share
2. Pricing tiers and included features
3. Recent product updates (last 6 months)
4. Integration ecosystems
5. Customer acquisition strategies
6. Reported growth metrics or funding

Prioritize:
- Direct feature comparisons
- Publicly available pricing
- Verified customer counts or revenue
- Recent news or announcements

Exclude platforms focused solely on B2C or with less than 1000 customers.
Organize findings by platform with clear sections.
"""

Using Schemas for Structured Data

One of the most powerful features of ScrapeGraphAI is the ability to use schemas (like Pydantic) to enforce data structure and types. This ensures consistent, validated output every time.

Basic Schema Example

python
from pydantic import BaseModel, Field
from typing import List, Optional
from scrapegraph_py import Client

class ProductSchema(BaseModel):
    name: str = Field(description="Product name")
    price: float = Field(description="Current price as float")
    currency: str = Field(default="USD", description="Price currency")
    in_stock: bool = Field(description="Availability status")
    rating: Optional[float] = Field(None, description="Average rating 0-5")
    review_count: Optional[int] = Field(None, description="Number of reviews")

# Use with SmartScraper
client = Client(api_key="your-api-key")
response = client.smartscraper(
    website_url="https://example-shop.com/product",
    user_prompt="Extract product information",
    output_schema=ProductSchema
)

# Response will be validated against schema
product = ProductSchema(**response['result'])
print(f"Product: {product.name} - ${product.price}")

Advanced Schema with Nested Models

python
from pydantic import BaseModel, Field, HttpUrl
from typing import List, Optional
from datetime import datetime

class PriceInfo(BaseModel):
    current: float = Field(description="Current price")
    original: Optional[float] = Field(None, description="Original price before discount")
    discount_percentage: Optional[float] = Field(None, description="Calculated discount %")
    currency: str = Field(default="USD")

class Specifications(BaseModel):
    key: str = Field(description="Specification name")
    value: str = Field(description="Specification value")

class ProductListing(BaseModel):
    # Basic info
    name: str = Field(description="Product title")
    brand: str = Field(description="Brand or manufacturer")
    sku: Optional[str] = Field(None, description="Product SKU")
    
    # Pricing
    pricing: PriceInfo = Field(description="Price information")
    
    # Availability
    in_stock: bool = Field(description="Stock availability")
    stock_count: Optional[int] = Field(None, ge=0, description="Units in stock")
    
    # Details
    description: str = Field(description="Product description", max_length=500)
    main_image: HttpUrl = Field(description="Primary product image URL")
    specifications: List[Specifications] = Field(
        default_factory=list,
        description="Technical specifications"
    )
    
    # Metadata
    scraped_at: datetime = Field(default_factory=datetime.now)
    source_url: HttpUrl

# Craft a prompt that aligns with your schema
prompt = """
Extract comprehensive product data matching our inventory system requirements.

Focus on:
1. Complete product identification (name, brand, SKU)
2. All pricing information (current, original, calculate discount)
3. Stock availability with specific count if shown
4. Technical specifications as key-value pairs
5. Product description (limit to 500 chars)
6. Main product image URL

Ensure all monetary values are numeric (not strings).
For specifications, extract all technical details shown on the page.
"""

response = client.smartscraper(
    website_url="https://example.com/product",
    user_prompt=prompt,
    output_schema=ProductListing
)

Schema Benefits

Ready to Scale Your Data Collection?

Join thousands of businesses using ScrapeGrapAI to automate their web scraping needs. Start your journey today with our powerful API.

  1. Type Safety: Automatic validation of data types
  2. Consistency: Same structure across all extractions
  3. Error Handling: Clear errors for missing required fields
  4. Documentation: Self-documenting code with field descriptions
  5. IDE Support: Autocomplete and type hints

Testing Your Prompts

The key to perfect prompts is iterative testing. Here's a systematic approach:

1. Start with the Playground

ScrapeGraphAI's playground (playground.scrapegraphai.com) is your best friend:

python
# Test variations quickly
test_prompts = [
    "Extract product price",  # Too simple
    "Extract product price as a number without currency symbols",  # Better
    "Extract product price as float, currency as separate field",  # Best
]

2. Test Edge Cases

python
# Test different scenarios
edge_cases = {
    "out_of_stock": "Extract price even if product is out of stock",
    "sale_price": "Extract both original and sale price, calculate discount",
    "multiple_variants": "Extract price range if multiple variants exist",
    "no_price": "Return null for price if not displayed (private/quote only)"
}

3. Validate Consistency

Run the same prompt multiple times to ensure consistent results:

python
# Consistency test
for i in range(3):
    response = client.smartscraper(
        website_url=url,
        user_prompt=prompt,
        output_schema=ProductSchema
    )
    print(f"Run {i+1}: {response['result']}")

4. Performance Optimization

python
# Measure what matters
import time

# Specific prompt (faster, focused)
start = time.time()
specific_response = client.smartscraper(
    website_url=url,
    user_prompt="Extract only: product name, current price as float, stock status as boolean"
)
specific_time = time.time() - start

# General prompt (slower, more data)
start = time.time()
general_response = client.smartscraper(
    website_url=url,
    user_prompt="Extract all product information available on the page"
)
general_time = time.time() - start

print(f"Specific prompt: {specific_time:.2f}s")
print(f"General prompt: {general_time:.2f}s")

Advanced Prompting Techniques

1. Conditional Extraction

python
prompt = """
Extract product data with conditional logic:

If product has reviews:
  - Extract rating (float)
  - Extract review count (integer)
  - Extract top positive review excerpt
  - Extract top critical review excerpt
  
If product is on sale:
  - Extract original price
  - Extract sale price  
  - Calculate discount percentage
  - Extract sale end date if shown
  
If product has variants:
  - Extract all variant options (color, size, etc)
  - Extract price range (min-max)
  - Note which variant is default selected
"""

2. Multi-Stage Extraction

python
# Stage 1: Identify page type
identification_prompt = """
Identify the type of page:
- Product listing page (multiple products)
- Product detail page (single product)
- Category page
- Search results page

Return only the page type.
"""

# Stage 2: Extract based on page type
if page_type == "product detail page":
    extraction_prompt = "Extract detailed product information..."
elif page_type == "product listing page":
    extraction_prompt = "Extract summary info for each product..."

3. Context-Aware Prompting

python
# Industry-specific prompt
prompt = """
Extract automotive parts data for inventory system:

Required technical specifications:
- Part number (OEM and aftermarket if both shown)
- Compatibility (make, model, year range)
- Fitment position (front/rear, left/right)
- Material composition
- Dimensions (with units)
- Weight (with units)
- Warranty period

Cross-reference information:
- OEM equivalent numbers
- Superseded part numbers
- Compatible vehicle list

Use automotive industry standard terminology.
Convert all measurements to metric if shown in imperial.
"""

4. Comparative Extraction

python
# For comparison shopping
prompt = """
Extract data optimized for price comparison:

Standardize the following across products:
1. Product title (remove marketing fluff, keep: brand + model + key spec)
2. Price per unit (calculate if bulk pricing shown)
3. Shipping cost (separate from product price)
4. Total cost (product + shipping)
5. Availability (in-stock, pre-order, out-of-stock)
6. Seller rating (normalize to 0-5 scale)
7. Return policy summary

Make prices directly comparable by:
- Converting to USD if other currency
- Including all fees in total cost
- Noting if price is per item or per pack
"""

Pro Tips for Prompt Excellence

1. Be Explicit About Data Types

Instead of "extract the price", specify:

  • "Extract price as a float"
  • "Extract price as decimal number without currency symbols"
  • "Extract price in cents as integer"

2. Handle Missing Data Gracefully

Always specify what to do when data isn't found:

python
"If review count is not displayed, return 0"
"If SKU is not found, return null"
"If multiple prices shown, extract the lowest"

3. Use Examples in Complex Cases

python
prompt = """
Extract phone numbers in standardized format.
Examples of input -> output:
- "(555) 123-4567" -> "+1-555-123-4567"
- "555.123.4567" -> "+1-555-123-4567"  
- "Call us at 555-1234" -> "+1-555-555-1234" (assuming local area code)
"""

4. Specify Priority Order

python
prompt = """
Extract product description with priority:
1. First check 'Product Details' section
2. If not found, check 'Description' tab
3. If still not found, use first paragraph under product title
4. As last resort, use meta description

Limit to 200 characters.
"""

5. Version Your Prompts

python
PROMPT_VERSIONS = {
    "v1": "Extract product name and price",
    "v2": "Extract product name (string) and price (float)",
    "v3": "Extract product: name (string), price (float), currency (string)",
    "v4": "Extract product: name (string), price (float), currency (ISO code), in_stock (boolean)"
}

# Track which version works best
current_version = "v4"

Common Patterns by Industry

E-commerce

python
"Extract product catalog data: name, price, availability, shipping info, return policy highlights"

Real Estate

python
"Extract property listing: address, price, beds/baths, square footage, lot size, year built, days on market"

Job Listings

python
"Extract job posting: title, company, location, salary range, required skills, experience level, application deadline"

News Articles

python
"Extract article metadata: headline, author, publication date, summary (first paragraph), main topics/tags"

Financial Data

python
"Extract stock information: current price, change (amount and percentage), volume, market cap, P/E ratio"

Troubleshooting Common Issues

Issue: Inconsistent Results

Solution: Add more structure and constraints

python
# Instead of:
"Get product reviews"

# Use:
"Extract up to 5 most recent reviews with: reviewer name, rating (1-5), review date (MM/DD/YYYY), review text (first 100 chars)"

Issue: Too Much Data Returned

Solution: Add filtering criteria

python
"Extract only products that are currently in stock and priced under $100"

Issue: Wrong Data Format

Solution: Provide format examples

python
"Extract date in ISO format (YYYY-MM-DD), price as decimal (e.g., 29.99), phone as E.164 format"

Issue: Missing Important Fields

Solution: Mark required vs optional clearly

python
"Required fields: name, price, SKU. Optional fields: reviews, ratings, manufacturer"

Measuring Prompt Quality

A good prompt should score well on these metrics:

  1. Specificity: Does it clearly define what to extract?
  2. Structure: Does it specify the output format?
  3. Completeness: Does it handle edge cases?
  4. Efficiency: Does it avoid requesting unnecessary data?
  5. Consistency: Does it produce reliable results across different pages?

Conclusion

Mastering the art of prompting for ScrapeGraphAI is about finding the perfect balance between specificity and flexibility. The best prompts are:

  • Clear in their objectives
  • Specific about data requirements
  • Structured in their output expectations
  • Contextual about the use case
  • Robust in handling edge cases

Remember, prompt engineering is an iterative process. Start with the playground, test thoroughly, and refine based on results. With practice and these guidelines, you'll be writing prompts that consistently deliver exactly the data you need.

The difference between amateur and professional web scraping often comes down to prompt quality. Invest time in crafting and testing your prompts—your future self (and your data pipeline) will thank you.

Next Steps

  1. Practice in the Playground: Start with simple extractions and gradually increase complexity
  2. Build a Prompt Library: Save successful prompts for reuse
  3. Implement Schemas: Add type safety with Pydantic schemas
  4. Monitor Performance: Track which prompts work best for your use cases
  5. Stay Updated: Follow ScrapeGraphAI updates for new features and capabilities

Happy scraping! 🚀

Want to dive deeper into AI-powered web scraping? Explore these guides: