TL;DR
ScrapeGraphAI uses short API verbs. Use scrape when you have a URL and want Markdown, HTML, links, images, screenshots, summaries, branding, or JSON in one response. Use extract when you want structured JSON from a URL, HTML, or Markdown with a prompt and optional schema. Use search when the workflow starts with a web query.
sgai.scrapereturns Markdown, HTML, links, images, summaries, screenshots, branding, or JSON.sgai.extractis the direct path for prompt-driven structured JSON.sgai.searchfinds pages and can extract structured data from the results.fetchConfigcontrols rendering, stealth, cookies, headers, waiting, scrolling, timeout, and country routing.- Async clients are available in Python for long-running or parallel workloads.
This guide keeps the slug because many pages already link here, but the API names below are the current ones.
Current API Names
ScrapeGraphAI groups web data workflows around single verbs.
| Service | Use it when | Docs |
|---|---|---|
scrape |
You have a URL and need page formats like Markdown, HTML, links, images, screenshots, summaries, branding, or JSON. | Scrape docs |
extract |
You need structured JSON from a URL, HTML, or Markdown using a prompt and optional schema. | Extract docs |
search |
You need to run a web query, fetch result pages, and optionally extract structured JSON from them. | Search docs |
The main split is simple:
- Use
scrapewhen the input URL is known and you care about page formats. - Use
extractwhen the output must be typed JSON. - Use
searchwhen ScrapeGraphAI needs to discover the source pages first.
For the broader product context, read ScrapeGraphAI V2: Better, Faster and Cheaper.
Setup
Install the Python SDK:
pip install scrapegraph-pySet your API key:
export SGAI_API_KEY="your-api-key"Then initialize the client:
from scrapegraph_py import ScrapeGraphAI
sgai = ScrapeGraphAI()For JavaScript, install scrapegraph-js and use Node 22 or newer:
bun add scrapegraph-jsimport { ScrapeGraphAI } from "scrapegraph-js";
const sgai = ScrapeGraphAI();The REST API base is:
https://v2-api.scrapegraphai.com/apiUse scrape For Markdown, HTML, Links, Screenshots, And More
The scrape service fetches a known URL and returns one or more formats at the same time.
The required fields are:
| Field | Required | Meaning |
|---|---|---|
url |
Yes | The page to scrape. |
formats |
Yes | An array of output format objects. |
contentType |
No | Override the detected content type. |
fetchConfig |
No | Fetch options such as render mode, stealth, headers, cookies, wait, scrolls, timeout, and country. |
Convert a page to Markdown:
from scrapegraph_py import MarkdownFormatConfig
page = sgai.scrape(
"https://example.com",
formats=[MarkdownFormatConfig()],
)
if page.status == "success":
markdown = page.data.results.get("markdown", {}).get("data", [])
print(markdown[0] if markdown else "")
else:
print(page.error)The same request in JavaScript:
const page = await sgai.scrape({
url: "https://example.com",
formats: [{ type: "markdown" }],
});
if (page.status === "success") {
console.log(page.data?.results.markdown?.data?.[0]);
} else {
console.error(page.error);
}And with cURL:
curl -X POST https://v2-api.scrapegraphai.com/api/scrape \
-H "SGAI-APIKEY: $SGAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"formats": [{ "type": "markdown" }]
}'The key improvement is the formats array. You can request Markdown, links, and a screenshot without scraping the same page three times:
const bundle = await sgai.scrape({
url: "https://example.com",
formats: [
{ type: "markdown", mode: "reader" },
{ type: "links" },
{ type: "screenshot", fullPage: false, width: 1280, height: 720 },
],
});Common formats:
| Format | Use it for |
|---|---|
markdown |
LLM context, RAG pipelines, summaries, and clean text storage. |
html |
Parser workflows that need markup. |
links |
Discovery, crawl planning, and site audits. |
images |
Media extraction. |
summary |
Quick page understanding. |
json |
Structured extraction inside a scrape call. |
screenshot |
Visual capture and QA. |
branding |
Brand colors, typography, and logos. |
If you need only clean page text, scrape with markdown is the right call. If you need fields like price, title, author, company, rating, or availability, use extract or the json format.
Use extract For Structured JSON
The extract service uses an LLM to pull structured data from a URL, raw HTML, or Markdown. Give it a prompt and, when the result must be stable, a JSON schema.
Simple extraction:
extraction = sgai.extract(
"Extract the company name, one-sentence description, and main call to action.",
url="https://scrapegraphai.com",
)
if extraction.status == "success":
print(extraction.data.json_data)
else:
print(extraction.error)Schema-backed extraction:
from pydantic import BaseModel, Field
class Product(BaseModel):
name: str = Field(description="Product name")
price: str | None = Field(default=None, description="Listed price")
class Products(BaseModel):
products: list[Product] = Field(default_factory=list)
products_result = sgai.extract(
"Extract product names and prices",
url="https://example.com/products",
schema=Products.model_json_schema(),
)
if products_result.status == "success":
parsed = Products.model_validate(products_result.data.json_data)
print(parsed.products)JavaScript:
const extraction = await sgai.extract({
url: "https://example.com/products",
prompt: "Extract product names and prices",
schema: {
type: "object",
properties: {
products: {
type: "array",
items: {
type: "object",
properties: {
name: { type: "string" },
price: { type: "string" },
},
required: ["name"],
},
},
},
},
});
if (extraction.status === "success") {
console.log(extraction.data?.json);
}Use search When You Need Discovery First
The search service starts with a query instead of a URL. It returns search results with fetched content. Add a prompt and schema when you want the results summarized into structured JSON.
Basic search:
results = sgai.search(
"AI web scraping API benchmarks 2026",
num_results=3,
)
if results.status == "success":
for item in results.data.results:
print(item.title, item.url)Search with extraction:
research = sgai.search(
"browser automation API pricing pages",
num_results=5,
prompt="Return company name, pricing URL, and cheapest paid plan",
schema={
"type": "object",
"properties": {
"companies": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"pricingUrl": {"type": "string"},
"cheapestPlan": {"type": "string"},
},
"required": ["name", "pricingUrl"],
},
},
},
},
)
if research.status == "success":
print(research.data.json_data)JavaScript:
const research = await sgai.search({
query: "browser automation API pricing pages",
numResults: 5,
prompt: "Return company name, pricing URL, and cheapest paid plan",
});
if (research.status === "success") {
console.log(research.data?.json ?? research.data?.results);
}Search accepts recency and location controls, which makes it useful for market research, competitive monitoring, and trend tracking. If your source URLs are already known, use scrape or extract instead.
FetchConfig: Rendering, Stealth, And Page Control
scrape, extract, and search can use fetchConfig when the page needs more than a default fetch.
const rendered = await sgai.scrape({
url: "https://example.com",
formats: [{ type: "markdown" }],
fetchConfig: {
mode: "js",
stealth: true,
wait: 2000,
scrolls: 3,
country: "us",
},
});Use it when:
- The page is JavaScript-rendered.
- Content appears after scrolling.
- You need a short wait after page load.
- The target requires a country-specific route.
- The site blocks simple fetches and needs stealth mode.
Keep the first request as small as possible. Add JavaScript rendering, stealth, scrolling, and screenshots only when the target needs them. That keeps latency and credits predictable.
Response Shapes
Every SDK call returns a status wrapper. Do not assume the payload exists before checking status. For Python, successful calls put service-specific data under result.data; failed calls put the error message under result.error. JavaScript follows the same pattern with status, data, error, and elapsedMs.
scrape returns a results object keyed by format. If you requested markdown, read results["markdown"]["data"] in Python or results.markdown.data in JavaScript. If you requested several formats, read each format key separately. This keeps a multi-format request explicit: the Markdown block, link list, screenshot URL, and JSON extraction are siblings, not one merged blob.
extract returns JSON output as the primary result. In Python examples, that is data.json_data; in JavaScript, it is data?.json. When you pass a schema, still validate the returned object in your application layer. Pages are external input, and validation is where you decide whether to retry, drop a field, or send the row to review.
search returns ranked web results. Without a prompt, use the result list for titles and URLs. With a prompt and schema, use the structured JSON output for the synthesized answer and keep the source URLs for auditability.
Which API Should You Choose?
| Goal | API | Example |
|---|---|---|
| Convert one URL to Markdown | scrape |
formats: [{ type: "markdown" }] |
| Get Markdown plus links | scrape |
formats: [{ type: "markdown" }, { type: "links" }] |
| Capture a screenshot | scrape |
formats: [{ type: "screenshot" }] |
| Extract product data | extract |
Prompt plus JSON schema. |
| Extract structured fields while also saving Markdown | scrape |
Include markdown and json formats. |
| Find relevant pages, then extract | search |
Query plus prompt and schema. |
| Process many known pages | scrape or extract with async clients |
Use parallelism and store results incrementally. |
The practical rule: if the output is a page format, reach for scrape. If the output is a business object, reach for extract. If the input is not known yet, reach for search.
Common Implementation Mistakes
The most common mistake is using extract when the only thing you need is clean Markdown. That costs more and adds an LLM step you do not need. Use scrape with formats: [{ type: "markdown" }] for RAG ingestion, article archiving, documentation importers, and simple page summaries.
Another mistake is making several calls to the same URL because the pipeline needs multiple assets. If you need Markdown, links, and a screenshot, put those format entries in one scrape request. That keeps the fetch behavior consistent and gives you one request ID to debug.
Avoid sending vague prompts to extract. "Extract product data" is weaker than "Extract every product with name, listed price, currency, availability, and product URL." Field names should match the object you want to store. If the output feeds a database or queue, pass a schema and validate it after the call.
Do not turn on JavaScript rendering, stealth mode, long waits, and scrolling by default. Those controls are useful, but they should be target-specific. Start with a normal fetch, inspect failed or incomplete pages, then add the smallest fetchConfig change that fixes the source.
For search, keep discovery and extraction goals separate in your prompt. First decide what pages should be found. Then describe what should be extracted from those pages. Mixing both into one vague sentence creates noisy results and makes retries harder to reason about.
Production Notes
Validate responses before writing to storage. Even schema-backed extraction should be checked at your boundary because scraped pages are third-party input.
Store the request ID, URL, prompt, schema version, status, and elapsed time with every result. That makes failed extractions easier to replay and lets you compare output quality when prompts change.
Use async calls for large batches, but cap concurrency around your plan limits and target-site tolerance. High parallelism is not automatically better when targets throttle or render slowly.
For a broader buying view, read Web Scraping API: How to Choose One in 2026 and the scraper API comparison.