TL;DR
AI web scraping has moved past "can I load the page?" The useful test is whether the tool returns records your product can trust.
- Raw HTML is unfinished work: you still need parsing, cleanup, validation, and retries.
- Markdown helps LLM ingestion: it is useful context, but it is not a database row.
- Browser agents are strongest for workflows: clicks, logins, forms, and multi-step tasks are their lane.
- Structured extraction is the last mile: URL, prompt, schema, JSON, with less glue code in between.
- ScrapeGraphAI is built for usable output: the API is judged by the data contract, not page access alone.
A status code is a terrible finish line. HTTP 200 can still mean your pipeline stored an empty array, missed every listing, or handed the next system a blob of markdown that needs another model call. That is fine for experiments. It is not fine for production data.
This guide looks at the shift from retrieval to structured extraction, then uses two practical tests, Realtor.com listings and Google Search result pages, to show where the difference becomes obvious. The goal is not to crown a universal winner for every scraping task. The goal is narrower and more useful: if your application needs structured JSON, compare tools on the JSON they return.
The Old Scraping Pipeline Had Too Many Places to Break
Classic scraping was a chain of small bets:
URL -> HTTP request -> HTML -> CSS selectors -> parser -> cleanup -> databaseThat worked when pages were mostly static. A little Python, requests, BeautifulSoup, and a few selectors could extract prices, titles, job posts, or property cards.
The problem is maintenance.
It only takes one class name to change and your scraper is broken. We have seen hundreds of scrapers fail for the same reasons: a site moves the price into a nested component and your extraction returns null, a page becomes JavaScript-rendered and your HTTP request only sees an empty shell, or a target adds anti-bot checks and your scraper starts collecting challenge pages instead of data.
You can add alerts when something breaks. You can keep patching selectors. But you still have to update the code, retest it, and redeploy it. Data gets lost in the gap, and eventually someone is awake at 2 a.m. fixing a scraper that technically returned HTTP 200.
AI Web Scraping Changed the Interface
So how do we fix it?
LLMs give us a more flexible extraction layer: a general function that can take page text in and return structured data out. The simple version looks like this:
// before
html -> selectors -> json
// now
html -> llms -> jsonSimple to explain, harder to develop well.
AI scraping changes what you ask for. Instead of telling the scraper where the field lives in the DOM, you tell it what field you want back.
URL + prompt + schema -> structured JSONFor example, a product extraction task should be able to start like this:
{
"url": "https://example.com/products",
"prompt": "Extract product names, prices, ratings, and product URLs",
"schema": {
"type": "object",
"properties": {
"products": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": { "type": "string" },
"price": { "type": "string" },
"rating": { "type": ["string", "null"] },
"product_url": { "type": "string" }
}
}
}
}
}
}The best systems do more than wrap a browser. They fetch, render when needed, clean noisy content, chunk large pages, extract against the prompt, validate the shape, and return JSON your application can inspect.
That is the line that matters. "AI scraping" is now used for everything from markdown conversion to browser automation. Those are useful, but they are not the same job.
Retrieval Is Not Extraction
Most confusion in this category comes from mixing up two separate tasks.
Retrieval means getting page content:
- HTML
- markdown
- screenshots
- links
- rendered DOM
- search result pages
Extraction means turning page content into typed records:
- product rows
- job listings
- property listings
- company profiles
- article metadata
- lead lists
- pricing tables
Retrieval asks, "Can I access this page?"
Extraction asks, "Can I use this data in my product?"
If the workflow is scrape -> markdown -> LLM -> JSON parser -> retry loop, the scraping layer only solved the first half of the problem. The expensive part is still sitting downstream.
Four Tool Categories, One Data Problem
The market looks crowded because several tools solve nearby problems.
Raw scraper APIs
ScrapingBee, ScraperAPI, Bright Data, and similar providers focus on infrastructure. Proxies, browser rendering, geolocation, retries, and unblocking are valuable when your team wants direct control over parsing.
The tradeoff is simple: you still own extraction quality. If the page loads but your parser misses the field, the API did its job and your pipeline still failed.
Markdown and crawl APIs
Firecrawl made a clean developer workflow popular: crawl a site, convert pages to markdown, then feed that content into an AI system.
That is useful for documentation ingestion, RAG, and content collection. But markdown is still an intermediate format. If the final destination is a database, someone still has to turn text into rows.
Browser agents
Browser Use, Anchor Browser, Browserbase, Stagehand, and Playwright-based agents shine when the task needs interaction:
- clicking filters
- logging in
- filling forms
- navigating multi-step flows
- using a web app like a person would
Anchor Browser also supports agentic browser tasks with structured output schemas. That makes it useful when extraction depends on browser state, clicks, or page interaction. For one-page structured extraction, it can still be more machinery than you need. If the output is JSON from a page that can be fetched directly, start by testing an extraction API.
Structured extraction APIs
Structured extraction APIs start from the final deliverable. You provide a URL, a prompt, and a schema. The API returns data in the shape your application expects.
That is the ScrapeGraphAI approach:
from scrapegraph_py import ScrapeGraphAI
sgai = ScrapeGraphAI(api_key="your-api-key")
response = sgai.extract(
url="https://www.realtor.com/realestateandhomes-search/San-Francisco_CA",
prompt="Extract visible real estate listings with title, price, address, beds, baths, and listing URL.",
schema={
"type": "object",
"properties": {
"listings": {
"type": "array",
"items": {
"type": "object",
"properties": {
"title": {"type": ["string", "null"]},
"price": {"type": ["string", "null"]},
"address_or_area": {"type": ["string", "null"]},
"beds": {"type": ["string", "null"]},
"baths": {"type": ["string", "null"]},
"listing_url": {"type": ["string", "null"]}
}
}
}
}
}
)
print(response.data.json_data)Example 1: Realtor.com Real Estate Listings
Real estate pages make a good smoke test because they are messy in normal ways. Listing cards change, prices move around, URLs matter, and the useful answer is not "here is some page text." The useful answer is a set of property records.
We tested this page:
https://www.realtor.com/realestateandhomes-search/San-Francisco_CAThe extraction task was direct:
Extract visible real estate listings from this page. Include title, price, address_or_area, beds, baths, and listing_url. Only include listings actually visible in the page content.
The target output was a normalized listings array:
{
"listings": [
{
"title": "House for sale - New construction",
"price": "$6,998,000",
"address_or_area": "45 Montclair Ter, San Francisco, CA 94109",
"beds": "6",
"baths": "9",
"listing_url": "https://www.realtor.com/realestateandhomes-detail/45-Montclair-Ter_San-Francisco_CA_94109_M28358-48059?from=srp_next"
},
{
"title": "Condo for sale - New construction",
"price": "$291,719",
"address_or_area": "285 Main St Unit 510, San Francisco, CA 94105",
"beds": "1",
"baths": "1",
"listing_url": "https://www.realtor.com/realestateandhomes-detail/285-Main-St-Unit-510_San-Francisco_CA_94105_M91008-43004?from=srp_next"
},
{
"title": "House for sale",
"price": "$1,098,000",
"address_or_area": "789 Arguello Blvd, San Francisco, CA 94118",
"beds": "3",
"baths": "1",
"listing_url": "https://www.realtor.com/realestateandhomes-detail/789-Arguello-Blvd_San-Francisco_CA_94118_M10830-22867?from=srp_next"
}
]
}The same request can be made from a terminal:
curl -X POST https://v2-api.scrapegraphai.com/api/extract \
-H "SGAI-APIKEY: $SGAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://www.realtor.com/realestateandhomes-search/San-Francisco_CA",
"prompt": "Extract visible real estate listings from this page. Include title, price, address_or_area, beds, baths, and listing_url. Only include listings actually visible in the page content.",
"schema": {
"type": "object",
"required": ["listings"],
"properties": {
"listings": {
"type": "array",
"items": {
"type": "object",
"properties": {
"title": { "type": ["string", "null"] },
"price": { "type": ["string", "null"] },
"address_or_area": { "type": ["string", "null"] },
"beds": { "type": ["string", "null"] },
"baths": { "type": ["string", "null"] },
"listing_url": { "type": ["string", "null"] }
}
}
}
}
}
}'Here is what happened in the final smoke test:
| Provider | HTTP result | Structured output | Time |
|---|---|---|---|
| ScrapeGraphAI | 200 | Passed, returned 6 listings | ~12.6s |
| Firecrawl | 200 | Empty structured output | ~6.4s |
| ScrapingBee generic AI extraction | 500 | Failed before returning structured data | ~72.2s |
| Anchor Browser async browser task | 200 async completion | Returned 9 listings, but listing URLs pointed to Zillow instead of Realtor.com | ~65s after async start |
Anchor Browser did produce structured rows. The issue was source grounding: for this Realtor.com task, the records looked plausible but the listing_url values pointed at Zillow pages. That is different from an empty result, but it is still not the data contract we asked for.
The important bit is not only timing. A page response can be technically successful and still be useless for a product that needs source-grounded listings.
Example 2: Google Search Result Pages
Search pages are familiar enough to validate by eye, but still dynamic enough to expose the difference between "fetch content" and "extract records."
We tested generic structured extraction on queries like these:
https://www.google.com/search?q=best%20ai%20lead%20generation%20software
https://www.google.com/search?q=top%20sales%20automation%20tools
https://www.google.com/search?q=best%20customer%20support%20ai%20tools
https://www.google.com/search?q=best%20data%20enrichment%20tools
https://www.google.com/search?q=best%20llm%20observability%20toolsThe task was to return visible organic results only:
Extract visible organic search results from this page. Include title, url, snippet, and source/domain. Do not include ads or navigation links.
Here is the request shape:
curl -X POST https://v2-api.scrapegraphai.com/api/extract \
-H "SGAI-APIKEY: $SGAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://www.google.com/search?q=top%20sales%20automation%20tools",
"prompt": "Extract visible organic search results from this page. Include title, url, snippet, and source/domain. Do not include ads or navigation links.",
"schema": {
"type": "object",
"required": ["results"],
"properties": {
"results": {
"type": "array",
"items": {
"type": "object",
"properties": {
"title": { "type": ["string", "null"] },
"url": { "type": ["string", "null"] },
"snippet": { "type": ["string", "null"] },
"source": { "type": ["string", "null"] }
}
}
}
}
}
}'The pattern repeated across multiple queries:
| Query | ScrapeGraphAI | Firecrawl | ScrapingBee generic AI extraction |
|---|---|---|---|
best ai lead generation software |
8 results | empty | HTTP 400 |
top sales automation tools |
8 results | empty | HTTP 400 |
best customer support ai tools |
8 results | empty | HTTP 400 |
best data enrichment tools |
8 results | empty | HTTP 400 |
best llm observability tools |
8 results | empty | HTTP 400 |
ScrapeGraphAI returned records in a shape that could go straight into a database:
{
"results": [
{
"title": "Best AI sales tools to crush outreach in 2026 - HeyReach",
"url": "https://www.heyreach.io/blog/best-ai-sales-tools",
"snippet": "Best AI sales tools to crush outreach in 2026 (without crushing your sales team) ...",
"source": "heyreach.io"
},
{
"title": "Best AI tools you've adopted? (Enterprise SaaS) - Reddit",
"url": "https://www.reddit.com/r/sales/comments/1i303w3/best_ai_tools_youve_adopted_enterprise_saas/",
"snippet": "Gong: Excellent for analyzing sales calls and spotting key trends in customer conversations. Mentionlytics: Great for tracking brand sentiment ...",
"source": "reddit.com"
}
]
}We also ran Anchor Browser on top sales automation tools as an async browser task. It completed and returned 13 structured organic results. That is a good result for a browser-agent workflow, but it is a different operating mode from a direct URL-to-schema extraction API.
Small but important note: ScrapingBee has dedicated Google products. This test used its generic AI extraction path, not a dedicated Google Search API. That is exactly the point. A one-call generic structured extraction workflow behaves differently from a provider-specific scraper or an agentic browser workflow.
Empty Output Is a Real Failure
The annoying failures are not always 500s.
Sometimes the response looks successful:
{
"listings": []
}If your application treats that as success, you just lost data quietly. Lead lists dry up. Market intelligence dashboards flatten. A real estate tracker thinks there are no listings. Monitoring says everything is fine because the request returned HTTP 200.
Production extraction needs checks that match the business outcome:
- schema validation
- non-empty result checks
- source-grounded fields
- retry behavior
- diagnostics for thin or blocked pages
- predictable output shape
This is where structured extraction earns its keep. The system should optimize for the data contract, not the page response.
Where Generic Tools Break Down
Access and extraction often get bundled together in marketing copy. In practice, they fail in different places.
| Tool type | What usually works | Where structured extraction breaks |
|---|---|---|
| ScrapeGraphAI | One-call structured extraction from URLs, HTML, or markdown | Built around the final schema and data contract |
| Firecrawl | Crawling, markdown generation, LLM content ingestion | Markdown or empty JSON can still require downstream extraction and validation |
| ScrapingBee | Scraping infrastructure, proxies, JS rendering, difficult fetches | Generic AI extraction is not the same as schema-first extraction |
| Bright Data | Enterprise unblocking, proxy networks, datasets | Infrastructure still needs an extraction layer for custom schemas |
| Browser Use / Anchor Browser / Browserbase | Interactive browser workflows, agentic tasks, and schema-backed browser extraction | Browser control adds moving parts when the deliverable is structured page data |
ScrapeGraphAI is the right fit when the application needs records, not page content. Product catalogs, property listings, job posts, search results, company profiles, pricing pages, lead lists, and news metadata all share the same basic shape: URL in, structured JSON out.
If the next step is a database, CRM, monitoring workflow, or data product, the extraction layer should be first-class. Otherwise you are just moving the brittle part later in the pipeline.
What This Means for Production Scraping
The old question was:
Can I get the HTML?
The better question is:
Can I get reliable structured data from this page?
Raw HTML, markdown, screenshots, and rendered DOM are useful intermediate formats. Your application still wants records. Titles, prices, URLs, snippets, addresses, dates, names, and IDs.
That is the practical shift in AI web scraping. Not more page access for its own sake. Better extraction, tighter schemas, clearer failure modes, and output that can move directly into production systems.
ScrapeGraphAI is built around that final output. You describe the data, provide a schema, and get JSON back. That is what AI web scraping should mean in 2026.
Related Articles
- Scraper API: The Definitive 2026 Comparison of Web Scraping API Services - A broader look at scraper APIs, pricing, code examples, and where structured extraction fits.
- ScrapeGraphAI vs Firecrawl: Which AI Scraper Wins in 2026? - A direct comparison for teams choosing between markdown-first and extraction-first workflows.
- Why AI Web Scraping Beats Search APIs for Data - A deeper argument for why structured web data beats search snippets in production systems.
- Real Estate Scraping: Listings, Prices & Trends - A practical real estate scraping guide that expands on the property listing example.
- Web Scraping with Pydantic: Structured Data Guide - A schema-focused tutorial for making scraped data safer and easier to validate.