The Art of Prompting: How to Write Perfect Prompts for SmartScraper and SearchScraper

In the world of AI-powered web scraping, the difference between mediocre results and exceptional data extraction often comes down to one crucial factor: how well you craft your prompts. Whether you're using SmartScraper to extract data from specific websites or SearchScraper to aggregate information from multiple sources, mastering the art of prompting is essential for getting the structured, accurate data you need.

This comprehensive guide will transform you from a prompting novice to an expert, showing you exactly how to write prompts that deliver precise, structured results every time. We'll explore real examples, common pitfalls, and advanced techniques including schema usage for type-safe data extraction.

Understanding the Endpoints
The Anatomy of a Perfect Prompt
Bad Prompts: What Not to Do
Good Prompts: Best Practices in Action
Using Schemas for Structured Data
Testing Your Prompts
Advanced Prompting Techniques

Understanding the Endpoints

Before diving into prompt engineering, it's crucial to understand when to use each endpoint:

SmartScraper

Purpose: Extract structured data from a specific URL
Use When: You have a target website and need specific information from it
Key Feature: Context-aware extraction from single sources

SearchScraper

Purpose: Search and aggregate data from multiple web sources
Use When: You need comprehensive information on a topic from various sources
Key Feature: Multi-source aggregation with attribution

The fundamental difference affects how you structure your prompts: SmartScraper prompts focus on what to extract, while SearchScraper prompts focus on what to find.

The Anatomy of a Perfect Prompt

A well-crafted prompt consists of four essential components:

Clear Objective: What data do you need?
Specific Context: Why do you need this data?
Structured Requirements: How should the data be formatted?
Constraints: Any limitations or specific criteria?

Let's see this in action:

# Perfect prompt structure
prompt = """
Extract product information for price comparison analysis.
Focus on:
- Product name and brand
- Current price and any discounts
- Availability status
- Customer ratings (if available)
- Key specifications
 
Format as structured JSON with consistent field names.
Only include products currently in stock.
"""

Bad Prompts: What Not to Do

Understanding why prompts fail is crucial for improvement. Let's examine common mistakes:

1. The Vague Request

❌ Bad Prompt:

"Get product data"

Why it fails:

No specification of which data points
No structure definition
No context for the AI to understand importance
Results in inconsistent, unstructured output

2. The Everything Request

❌ Bad Prompt:

"Extract all information from this page"

Why it fails:

Overwhelming and unfocused
Returns unnecessary data
Difficult to process programmatically
Wastes API resources

3. The Ambiguous Format

❌ Bad Prompt:

"Find prices and names and descriptions and everything else important"

Why it fails:

No clear structure
"Everything else important" is subjective
Run-on sentence structure confuses parsing
No hierarchy of importance

4. The Context-Free Request

❌ Bad Prompt:

"List all companies mentioned"

Why it fails:

No context about which companies matter
No format specification
No filtering criteria
May include irrelevant mentions

Good Prompts: Best Practices in Action

Now let's look at exemplary prompts for each endpoint:

SmartScraper Examples

E-commerce Product Extraction

✅ Good Prompt:

prompt = """
Extract detailed product information for inventory management system.
 
Required fields:
1. Product identification:
   - name (full product name)
   - sku (if available)
   - brand
   
2. Pricing information:
   - current_price (numeric value)
   - original_price (if on sale)
   - currency
   - discount_percentage (calculate if both prices available)
   
3. Availability:
   - in_stock (boolean)
   - stock_count (if displayed)
   
4. Product details:
   - main_image_url
   - description (first 200 characters)
   - key_features (list of up to 5 main features)
   
Return as structured JSON. If a field is not found, use null.
Focus only on the main product, ignore related or recommended items.
"""

Why it works:

Clear purpose stated upfront
Structured field requirements
Specific data types indicated
Handling for missing data
Constraints to avoid noise

Real Estate Listing Extraction

✅ Good Prompt:

prompt = """
Extract real estate listing data for market analysis database.
 
Core information needed:
- Property type (house/apartment/condo)
- Price (numeric, in USD)
- Bedrooms (number)
- Bathrooms (number) 
- Square footage (number)
- Address (full address if available)
- Year built
- Listing date
- MLS number (if present)
 
Additional features:
- Amenities (list key amenities like pool, garage, etc.)
- Property description (first paragraph only)
- Agent name and contact
- Monthly HOA fees (if applicable)
 
Format as clean JSON with snake_case keys.
Only extract if property is actively for sale.
"""

SearchScraper Examples

Market Research Query

✅ Good Prompt:

prompt = """
Research the current state of AI-powered web scraping tools in 2024.
 
I need:
1. Market leaders and their key features
2. Pricing models (subscription vs usage-based)
3. Technical capabilities (AI models used, supported sites)
4. Recent innovations or announcements
5. Common use cases and customer segments
 
Focus on:
- Tools launched or updated in 2023-2024
- Enterprise-grade solutions
- Comparison of features across competitors
 
Provide factual information with source attribution.
Structure the response by tool/company for easy comparison.
"""

Why it works:

Clear research objective
Specific aspects to investigate
Time-bound criteria
Comparison-friendly structure
Attribution requirement

Competitive Intelligence Query

✅ Good Prompt:

prompt = """
Analyze competitive landscape for B2B SaaS email marketing platforms.
 
Key information required:
1. Top 5 platforms by market share
2. Pricing tiers and included features
3. Recent product updates (last 6 months)
4. Integration ecosystems
5. Customer acquisition strategies
6. Reported growth metrics or funding
 
Prioritize:
- Direct feature comparisons
- Publicly available pricing
- Verified customer counts or revenue
- Recent news or announcements
 
Exclude platforms focused solely on B2C or with less than 1000 customers.
Organize findings by platform with clear sections.
"""

Using Schemas for Structured Data

One of the most powerful features of ScrapeGraphAI is the ability to use schemas (like Pydantic) to enforce data structure and types. This ensures consistent, validated output every time.

Basic Schema Example

from pydantic import BaseModel, Field
from typing import List, Optional
from scrapegraph_py import Client
 
class ProductSchema(BaseModel):
    name: str = Field(description="Product name")
    price: float = Field(description="Current price as float")
    currency: str = Field(default="USD", description="Price currency")
    in_stock: bool = Field(description="Availability status")
    rating: Optional[float] = Field(None, description="Average rating 0-5")
    review_count: Optional[int] = Field(None, description="Number of reviews")
 
# Use with SmartScraper
client = Client(api_key="your-api-key")
response = client.smartscraper(
    website_url="https://example-shop.com/product",
    user_prompt="Extract product information",
    output_schema=ProductSchema
)
 
# Response will be validated against schema
product = ProductSchema(**response['result'])
print(f"Product: {product.name} - ${product.price}")

Advanced Schema with Nested Models

from pydantic import BaseModel, Field, HttpUrl
from typing import List, Optional
from datetime import datetime
 
class PriceInfo(BaseModel):
    current: float = Field(description="Current price")
    original: Optional[float] = Field(None, description="Original price before discount")
    discount_percentage: Optional[float] = Field(None, description="Calculated discount %")
    currency: str = Field(default="USD")
 
class Specifications(BaseModel):
    key: str = Field(description="Specification name")
    value: str = Field(description="Specification value")
 
class ProductListing(BaseModel):
    # Basic info
    name: str = Field(description="Product title")
    brand: str = Field(description="Brand or manufacturer")
    sku: Optional[str] = Field(None, description="Product SKU")
    
    # Pricing
    pricing: PriceInfo = Field(description="Price information")
    
    # Availability
    in_stock: bool = Field(description="Stock availability")
    stock_count: Optional[int] = Field(None, ge=0, description="Units in stock")
    
    # Details
    description: str = Field(description="Product description", max_length=500)
    main_image: HttpUrl = Field(description="Primary product image URL")
    specifications: List[Specifications] = Field(
        default_factory=list,
        description="Technical specifications"
    )
    
    # Metadata
    scraped_at: datetime = Field(default_factory=datetime.now)
    source_url: HttpUrl
 
# Craft a prompt that aligns with your schema
prompt = """
Extract comprehensive product data matching our inventory system requirements.
 
Focus on:
1. Complete product identification (name, brand, SKU)
2. All pricing information (current, original, calculate discount)
3. Stock availability with specific count if shown
4. Technical specifications as key-value pairs
5. Product description (limit to 500 chars)
6. Main product image URL
 
Ensure all monetary values are numeric (not strings).
For specifications, extract all technical details shown on the page.
"""
 
response = client.smartscraper(
    website_url="https://example.com/product",
    user_prompt=prompt,
    output_schema=ProductListing
)

Schema Benefits

Type Safety: Automatic validation of data types
Consistency: Same structure across all extractions
Error Handling: Clear errors for missing required fields
Documentation: Self-documenting code with field descriptions
IDE Support: Autocomplete and type hints

Testing Your Prompts

The key to perfect prompts is iterative testing. Here's a systematic approach:

1. Start with the Playground

ScrapeGraphAI's playground (playground.scrapegraphai.com) is your best friend:

# Test variations quickly
test_prompts = [
    "Extract product price",  # Too simple
    "Extract product price as a number without currency symbols",  # Better
    "Extract product price as float, currency as separate field",  # Best
]

2. Test Edge Cases

# Test different scenarios
edge_cases = {
    "out_of_stock": "Extract price even if product is out of stock",
    "sale_price": "Extract both original and sale price, calculate discount",
    "multiple_variants": "Extract price range if multiple variants exist",
    "no_price": "Return null for price if not displayed (private/quote only)"
}

3. Validate Consistency

Run the same prompt multiple times to ensure consistent results:

# Consistency test
for i in range(3):
    response = client.smartscraper(
        website_url=url,
        user_prompt=prompt,
        output_schema=ProductSchema
    )
    print(f"Run {i+1}: {response['result']}")

4. Performance Optimization

# Measure what matters
import time
 
# Specific prompt (faster, focused)
start = time.time()
specific_response = client.smartscraper(
    website_url=url,
    user_prompt="Extract only: product name, current price as float, stock status as boolean"
)
specific_time = time.time() - start
 
# General prompt (slower, more data)
start = time.time()
general_response = client.smartscraper(
    website_url=url,
    user_prompt="Extract all product information available on the page"
)
general_time = time.time() - start
 
print(f"Specific prompt: {specific_time:.2f}s")
print(f"General prompt: {general_time:.2f}s")

Advanced Prompting Techniques

1. Conditional Extraction

prompt = """
Extract product data with conditional logic:
 
If product has reviews:
  - Extract rating (float)
  - Extract review count (integer)
  - Extract top positive review excerpt
  - Extract top critical review excerpt
  
If product is on sale:
  - Extract original price
  - Extract sale price  
  - Calculate discount percentage
  - Extract sale end date if shown
  
If product has variants:
  - Extract all variant options (color, size, etc)
  - Extract price range (min-max)
  - Note which variant is default selected
"""

2. Multi-Stage Extraction

# Stage 1: Identify page type
identification_prompt = """
Identify the type of page:
- Product listing page (multiple products)
- Product detail page (single product)
- Category page
- Search results page
 
Return only the page type.
"""
 
# Stage 2: Extract based on page type
if page_type == "product detail page":
    extraction_prompt = "Extract detailed product information..."
elif page_type == "product listing page":
    extraction_prompt = "Extract summary info for each product..."

3. Context-Aware Prompting

# Industry-specific prompt
prompt = """
Extract automotive parts data for inventory system:
 
Required technical specifications:
- Part number (OEM and aftermarket if both shown)
- Compatibility (make, model, year range)
- Fitment position (front/rear, left/right)
- Material composition
- Dimensions (with units)
- Weight (with units)
- Warranty period
 
Cross-reference information:
- OEM equivalent numbers
- Superseded part numbers
- Compatible vehicle list
 
Use automotive industry standard terminology.
Convert all measurements to metric if shown in imperial.
"""

4. Comparative Extraction

# For comparison shopping
prompt = """
Extract data optimized for price comparison:
 
Standardize the following across products:
1. Product title (remove marketing fluff, keep: brand + model + key spec)
2. Price per unit (calculate if bulk pricing shown)
3. Shipping cost (separate from product price)
4. Total cost (product + shipping)
5. Availability (in-stock, pre-order, out-of-stock)
6. Seller rating (normalize to 0-5 scale)
7. Return policy summary
 
Make prices directly comparable by:
- Converting to USD if other currency
- Including all fees in total cost
- Noting if price is per item or per pack
"""

Pro Tips for Prompt Excellence

1. Be Explicit About Data Types

Instead of "extract the price", specify:

"Extract price as a float"
"Extract price as decimal number without currency symbols"
"Extract price in cents as integer"

2. Handle Missing Data Gracefully

Always specify what to do when data isn't found:

"If review count is not displayed, return 0"
"If SKU is not found, return null"
"If multiple prices shown, extract the lowest"

3. Use Examples in Complex Cases

prompt = """
Extract phone numbers in standardized format.
Examples of input -> output:
- "(555) 123-4567" -> "+1-555-123-4567"
- "555.123.4567" -> "+1-555-123-4567"  
- "Call us at 555-1234" -> "+1-555-555-1234" (assuming local area code)
"""

4. Specify Priority Order

prompt = """
Extract product description with priority:
1. First check 'Product Details' section
2. If not found, check 'Description' tab
3. If still not found, use first paragraph under product title
4. As last resort, use meta description
 
Limit to 200 characters.
"""

5. Version Your Prompts

PROMPT_VERSIONS = {
    "v1": "Extract product name and price",
    "v2": "Extract product name (string) and price (float)",
    "v3": "Extract product: name (string), price (float), currency (string)",
    "v4": "Extract product: name (string), price (float), currency (ISO code), in_stock (boolean)"
}
 
# Track which version works best
current_version = "v4"

Common Patterns by Industry

E-commerce

"Extract product catalog data: name, price, availability, shipping info, return policy highlights"

Real Estate

"Extract property listing: address, price, beds/baths, square footage, lot size, year built, days on market"

Job Listings

"Extract job posting: title, company, location, salary range, required skills, experience level, application deadline"

News Articles

"Extract article metadata: headline, author, publication date, summary (first paragraph), main topics/tags"

Financial Data

"Extract stock information: current price, change (amount and percentage), volume, market cap, P/E ratio"

Troubleshooting Common Issues

Issue: Inconsistent Results

Solution: Add more structure and constraints

# Instead of:
"Get product reviews"
 
# Use:
"Extract up to 5 most recent reviews with: reviewer name, rating (1-5), review date (MM/DD/YYYY), review text (first 100 chars)"

Issue: Too Much Data Returned

Solution: Add filtering criteria

"Extract only products that are currently in stock and priced under $100"

Issue: Wrong Data Format

Solution: Provide format examples

"Extract date in ISO format (YYYY-MM-DD), price as decimal (e.g., 29.99), phone as E.164 format"

Issue: Missing Important Fields

Solution: Mark required vs optional clearly

"Required fields: name, price, SKU. Optional fields: reviews, ratings, manufacturer"

Measuring Prompt Quality

A good prompt should score well on these metrics:

Specificity: Does it clearly define what to extract?
Structure: Does it specify the output format?
Completeness: Does it handle edge cases?
Efficiency: Does it avoid requesting unnecessary data?
Consistency: Does it produce reliable results across different pages?

Conclusion

Mastering the art of prompting for ScrapeGraphAI is about finding the perfect balance between specificity and flexibility. The best prompts are:

Clear in their objectives
Specific about data requirements
Structured in their output expectations
Contextual about the use case
Robust in handling edge cases

Remember, prompt engineering is an iterative process. Start with the playground, test thoroughly, and refine based on results. With practice and these guidelines, you'll be writing prompts that consistently deliver exactly the data you need.

The difference between amateur and professional web scraping often comes down to prompt quality. Invest time in crafting and testing your prompts—your future self (and your data pipeline) will thank you.

Next Steps

Practice in the Playground: Start with simple extractions and gradually increase complexity
Build a Prompt Library: Save successful prompts for reuse
Implement Schemas: Add type safety with Pydantic schemas
Monitor Performance: Track which prompts work best for your use cases
Stay Updated: Follow ScrapeGraphAI updates for new features and capabilities

Happy scraping! 🚀

Related Resources

Want to dive deeper into AI-powered web scraping? Explore these guides:

Mastering ScrapeGraphAI Endpoints - Deep dive into all endpoints
Web Scraping with Pydantic - Advanced schema usage
AI Agent Web Scraping - Building intelligent agents
SearchScraper Deep Dive - Master multi-source extraction
Web Scraping 101 - Fundamentals of web scraping
LangChain Integration - Combine with LangChain
Building Full-Stack Apps - Complete applications
Web Scraping Legality - Legal considerations

The Art of Prompting: How to Write Perfect Prompts for SmartScraper and SearchScraper

Table of Contents

Understanding the Endpoints

SmartScraper

SearchScraper

The Anatomy of a Perfect Prompt

Bad Prompts: What Not to Do

1. The Vague Request

2. The Everything Request

3. The Ambiguous Format

4. The Context-Free Request

Good Prompts: Best Practices in Action

SmartScraper Examples

E-commerce Product Extraction

Real Estate Listing Extraction

SearchScraper Examples

Market Research Query

Competitive Intelligence Query

Using Schemas for Structured Data

Basic Schema Example

Advanced Schema with Nested Models

Schema Benefits

Testing Your Prompts

1. Start with the Playground

2. Test Edge Cases

3. Validate Consistency

4. Performance Optimization

Advanced Prompting Techniques

1. Conditional Extraction

2. Multi-Stage Extraction

3. Context-Aware Prompting

4. Comparative Extraction

Pro Tips for Prompt Excellence

1. Be Explicit About Data Types

2. Handle Missing Data Gracefully

3. Use Examples in Complex Cases

4. Specify Priority Order

5. Version Your Prompts

Common Patterns by Industry

E-commerce

Real Estate

Job Listings

News Articles

Financial Data

Troubleshooting Common Issues

Issue: Inconsistent Results

Issue: Too Much Data Returned

Issue: Wrong Data Format

Issue: Missing Important Fields

Measuring Prompt Quality

Conclusion

Next Steps

Related Resources