ScrapeGraphAIScrapeGraphAI

ScrapeGraphAI + Cognee: Turn Live Web Data Into a Knowledge Graph

ScrapeGraphAI + Cognee: Turn Live Web Data Into a Knowledge Graph

Author 1

Written by Marco Vinciguerra

Standard RAG pipelines break on synthesis queries. When the answer requires correlating information across dozens of pages, or tracking how something changed over time, top-k retrieval over a vector index gives you chunks, not reasoning. The LLM fills the gaps with hallucinations.

The root cause is architectural: vector stores treat every document as an isolated bag of embeddings. There's no structure, no entity resolution, no relationship tracking. You can retrieve similar text, but you can't traverse a graph of what those texts actually mean.

Cognee fixes this by running entity extraction and relationship detection over your ingested content, building a property graph alongside the vector index, and combining both at query time. The result is a retrieval system that understands what the data says, not just how similar it is to your query.

ScrapeGraphAI fixes the data ingestion side by getting clean, structured content out of real websites without writing or maintaining scrapers.

This tutorial wires them together.


What you'll build

By the end of this tutorial you'll have a Python pipeline that:

  1. Takes a list of URLs
  2. Scrapes them with ScrapeGraphAI (JavaScript rendering, proxy handling, AI-powered extraction)
  3. Feeds the content into Cognee, which extracts entities and relationships and builds a knowledge graph
  4. Lets you ask natural language questions that synthesize information across all the scraped sources

Here's how the pieces connect:

URLs
 │
 ▼
ScrapeGraphAI SmartScraper
  ├─ renders JavaScript
  ├─ bypasses bot protection
  └─ extracts clean content via AI prompt
 │
 ▼
cognee.add()  ──→  raw text with source attribution
 │
 ▼
cognee.cognify()
  ├─ entity extraction (LLM)
  ├─ relationship detection
  ├─ vector embeddings
  └─ knowledge graph construction
 │
 ▼
cognee.search("your question")
  └─ graph traversal + LLM completion → grounded answer

The difference from standard RAG

Most RAG pipelines work like a library where someone threw all the books in a pile. You search by keyword proximity, grab the top-k chunks, and hope the answer is in there.

Cognee works more like an actual librarian who read every book, took notes on who did what, cross-referenced the characters, and can answer "how does Cognee compare to a plain vector store?" by actually reasoning over what it knows, not by finding the closest matching paragraph.

This is what makes Cognee a knowledge engine rather than another retrieval layer. It builds persistent, structured memory that your agents can reason over across sessions — not a flat index they search and forget. The graph it constructs isn't a cache of your documents; it's a living representation of what those documents mean and how the facts inside them connect.

Technically: Cognee runs entity extraction and relationship detection over your ingested text, builds a property graph, stores embeddings alongside it, and when you query it, it combines graph traversal with LLM completion. The answer is grounded in structured knowledge, not retrieved text fragments.


What ScrapeGraphAI contributes

Cognee can ingest text you hand it, but getting that text from real websites is not trivial. Sites have bot protection. They use JavaScript to render content. The HTML is full of noise: navigation menus, ads, footers, cookie banners.

ScrapeGraphAI handles all of that. You give it a URL and a natural language prompt describing what you want. It renders the page, runs the extraction, and gives you back structured, clean content. No CSS selectors, no maintenance when the site redesigns.

The integration packages the two together so you don't have to think about the handoff.


Setup

Requirements:

  • Python 3.10–3.13
  • ScrapeGraphAI API key → dashboard.scrapegraphai.com
  • An LLM API key — Cognee defaults to OpenAI for entity extraction and embeddings, but supports Anthropic, Google Gemini, Mistral, Ollama, AWS Bedrock, LLaMA.cpp, and any OpenAI-compatible endpoint (e.g. OpenRouter, vLLM). Set LLM_PROVIDER and EMBEDDING_PROVIDER to switch.

Install both packages:

pip install cognee-community-tasks-scrapegraph cognee

Or with uv (faster, cleaner):

uv pip install cognee-community-tasks-scrapegraph cognee

Export your keys:

export SGAI_API_KEY="your-scrapegraphai-api-key"
export LLM_API_KEY="your-openai-api-key"

The integration package lives in the cognee-community repository — a community-managed collection of plugins, adapters, and custom pipelines for Cognee. The cognee-community-tasks-scrapegraph package exposes two functions:

  • scrape_urls: scrapes a list of URLs and returns the extracted content
  • scrape_and_add: scrapes, ingests into Cognee, and runs cognify() in one call

Step 1: Scrape and inspect

Before building the full pipeline, it's worth looking at what ScrapeGraphAI actually extracts. scrape_urls gives you that directly:

import asyncio
import os
from cognee_community_tasks_scrapegraph import scrape_urls
 
os.environ["SGAI_API_KEY"] = "your-sgai-api-key"
 
async def main():
    results = await scrape_urls(
        urls=[
            "https://cognee.ai",
            "https://docs.cognee.ai",
        ],
        user_prompt="Extract the product description, key features, and target use cases",
    )
 
    for item in results:
        if item.get("error"):
            print(f"[!] {item['url']}: {item['error']}")
        else:
            print(f"\n=== {item['url']} ===")
            print(str(item["content"])[:500])
 
asyncio.run(main())

Sample output:

=== https://cognee.ai ===
Cognee is an open-source library for AI memory management. It converts raw
text, documents, and web pages into persistent knowledge graphs that AI agents
can query semantically. Core workflow: add data with cognee.add(), process
with cognee.cognify(), query with cognee.search(). Designed for agents that
need to reason across large, heterogeneous datasets rather than just retrieve
similar chunks...

=== https://docs.cognee.ai ===
Getting started requires three steps. First, cognee.add() ingests your data
into a named dataset. cognee.cognify() then processes the dataset: it runs
entity extraction via LLM, detects relationships between entities, generates
vector embeddings, and constructs a property graph. Finally, cognee.search()
queries the graph using hybrid graph traversal and vector similarity...

The user_prompt is the key lever here. It's sent to ScrapeGraphAI's SmartScraper and tells the AI what to focus on when extracting. Write it like you're describing what you want from the page to a smart research assistant.


Step 2: Build the knowledge graph

Now the full pipeline. scrape_and_add handles everything: scrape → ingest → cognify.

import asyncio
import os
import cognee
from cognee_community_tasks_scrapegraph import scrape_and_add
 
os.environ["SGAI_API_KEY"] = "your-sgai-api-key"
os.environ["LLM_API_KEY"] = "your-openai-api-key"
 
async def main():
    # Clear any previous state — important during development
    await cognee.prune.prune_data()
    await cognee.prune.prune_system(metadata=True)
 
    await scrape_and_add(
        urls=[
            "https://cognee.ai",
            "https://docs.cognee.ai",
            "https://github.com/topoteretes/cognee",
            "https://python.langchain.com/docs/introduction/",
            "https://docs.llamaindex.ai/en/stable/",
        ],
        user_prompt="Extract the product description, key features, and target use cases",
        dataset_name="cognee_research",
    )
 
    # The graph is ready. Ask anything.
    results = await cognee.search("What is Cognee and what problems does it solve?")
    for r in results:
        print(r)
 
asyncio.run(main())

The prune calls reset the local database. Skip them if you're building incrementally on top of an existing graph.

After cognify() runs, Cognee has extracted entities (products, features, concepts, people) and the relationships between them. The search() call traverses that graph and generates a grounded answer: a synthesized response built from the graph, not a retrieved chunk.


Step 3: Structured extraction with Pydantic

When you know exactly what you want from each page, pass a Pydantic schema to ScrapeGraphAI directly. This gives Cognee more consistent, well-structured input and generally produces a better graph.

import asyncio
import os
import cognee
from pydantic import BaseModel, Field
from scrapegraph_py import Client
 
os.environ["SGAI_API_KEY"] = "your-sgai-api-key"
os.environ["LLM_API_KEY"] = "your-openai-api-key"
 
class ProductPage(BaseModel):
    name: str = Field(description="Product or company name")
    tagline: str = Field(description="One-line value proposition")
    features: list[str] = Field(description="List of key features or capabilities")
    pricing_model: str = Field(description="How the product is priced: per-seat, usage-based, flat rate, etc.")
    target_audience: str = Field(description="Who the product is primarily aimed at")
 
async def scrape_structured(urls: list[str]) -> list[str]:
    client = Client(api_key=os.environ["SGAI_API_KEY"])
    texts = []
    try:
        for url in urls:
            response = client.smartscraper(
                website_url=url,
                user_prompt="Extract product information",
                output_schema=ProductPage,
            )
            result = response.get("result", {})
            # Format as structured text for Cognee
            text = f"""Source: {url}
Product: {result.get('name', 'N/A')}
Tagline: {result.get('tagline', 'N/A')}
Features: {', '.join(result.get('features', []))}
Pricing: {result.get('pricing_model', 'N/A')}
Audience: {result.get('target_audience', 'N/A')}"""
            texts.append(text)
    finally:
        client.close()
    return texts
 
async def main():
    await cognee.prune.prune_data()
    await cognee.prune.prune_system(metadata=True)
 
    urls = [
        "https://cognee.ai",
        "https://scrapegraphai.com",
        "https://firecrawl.dev",
    ]
 
    print("Scraping with structured schema...")
    texts = await scrape_structured(urls)
 
    combined = "\n\n".join(texts)
    await cognee.add(combined, dataset_name="product_landscape")
    await cognee.cognify()
 
    results = await cognee.search("How do these products differ in their approach to data extraction?")
    for r in results:
        print(r)
 
asyncio.run(main())

The structured output is more predictable and easier for Cognee to work with than raw paragraph text. If you know your schema upfront, this approach is worth the extra code.


Step 4: Competitive intelligence pipeline

Here's the use case that keeps coming up: monitoring a competitive landscape by scraping multiple product and pricing pages and then querying across them.

import asyncio
import os
import cognee
from cognee_community_tasks_scrapegraph import scrape_and_add
 
os.environ["SGAI_API_KEY"] = "your-sgai-api-key"
os.environ["LLM_API_KEY"] = "your-openai-api-key"
 
COMPETITORS = [
    "https://scrapegraphai.com/pricing",
    "https://apify.com/pricing",
    "https://brightdata.com/pricing",
    "https://firecrawl.dev/#pricing",
]
 
QUESTIONS = [
    "What pricing tiers does each product offer and what are their limits?",
    "Which products have free tiers and what are the restrictions?",
    "Which product is best positioned for enterprise use cases?",
    "How do these products differ in their scraping approach?",
]
 
async def build_intel():
    await cognee.prune.prune_data()
    await cognee.prune.prune_system(metadata=True)
 
    print(f"Scraping {len(COMPETITORS)} competitor pages...")
    await scrape_and_add(
        urls=COMPETITORS,
        user_prompt=(
            "Extract all pricing plans with their names, prices, included features, "
            "usage limits, and target customer type. Also extract the product's main "
            "value proposition and any enterprise or custom plan details."
        ),
        dataset_name="competitive_intel",
    )
 
    print("\nKnowledge graph ready. Querying...\n")
    for question in QUESTIONS:
        print(f"Q: {question}")
        results = await cognee.search(question, datasets=["competitive_intel"])
        for r in results:
            print(f"   {r}")
        print()
 
asyncio.run(build_intel())

The datasets parameter in cognee.search() scopes the query to a specific dataset. Use it when you have multiple knowledge graphs running in parallel for different projects.


Step 5: News monitoring with incremental updates

The real power of keeping a knowledge graph is that it gets better over time. This pattern scrapes a set of news or blog sources daily and adds to the existing graph rather than rebuilding from scratch:

import asyncio
import os
import cognee
from cognee_community_tasks_scrapegraph import scrape_and_add
from datetime import date
 
os.environ["SGAI_API_KEY"] = "your-sgai-api-key"
os.environ["LLM_API_KEY"] = "your-openai-api-key"
 
NEWS_SOURCES = [
    "https://techcrunch.com/category/artificial-intelligence/",
    "https://venturebeat.com/category/ai/",
    "https://www.thenewstack.io/category/machine-learning/",
]
 
async def daily_update():
    dataset = "ai_news"
    today = str(date.today())
 
    # Don't prune — add to the existing graph
    await scrape_and_add(
        urls=NEWS_SOURCES,
        user_prompt=(
            f"Extract all article headlines, brief summaries, publication dates, "
            f"and the main companies or technologies mentioned. Today is {today}."
        ),
        dataset_name=dataset,
    )
 
    print(f"Updated {dataset} for {today}")
 
    # Now query across everything accumulated
    results = await cognee.search(
        "What are the biggest AI industry trends in the last 7 days?",
        datasets=[dataset],
    )
    for r in results:
        print(r)
 
asyncio.run(daily_update())

Run this on a cron job and over time you build a knowledge base that understands how the space is evolving, not just what was written on any single day.


Search modes

Cognee exposes 14 SearchType options. Here are the ones most relevant to this workflow:

from cognee import SearchType
 
# GRAPH_COMPLETION (default)
# Traverses the knowledge graph, then generates an answer via LLM.
# Best for: synthesis questions, "how does X relate to Y", comparisons
results = await cognee.search(
    "What are the key differences between these products?",
    query_type=SearchType.GRAPH_COMPLETION,
)
 
# CHUNKS
# Returns raw text segments from the original ingested content.
# Best for: verifying what was scraped, citation, debugging
chunks = await cognee.search(
    "enterprise pricing",
    query_type=SearchType.CHUNKS,
)
 
# SUMMARIES
# Returns pre-generated document summaries.
# Best for: quick overviews, faster responses when depth isn't needed
summaries = await cognee.search(
    "product overview",
    query_type=SearchType.SUMMARIES,
)
 
# TRIPLET_COMPLETION
# Uses extracted (subject, predicate, object) triples for retrieval.
# Best for: factual lookups, relationship queries
triplets = await cognee.search(
    "What does Cognee integrate with?",
    query_type=SearchType.TRIPLET_COMPLETION,
)

GRAPH_COMPLETION is the right default for most use cases. Use CHUNKS when you need to trace exactly where an answer came from.

The search() function also accepts useful parameters beyond query_type:

results = await cognee.search(
    "What are the main AI trends?",
    query_type=SearchType.GRAPH_COMPLETION,
    datasets=["ai_news"],        # scope to specific datasets
    top_k=20,                    # number of results (default: 10)
    session_id="my-session",     # persist Q&A for conversation continuity
    system_prompt="Answer concisely in bullet points.",  # custom LLM instructions
    verbose=True,                # include detailed graph context in results
)

Parallel scraping for larger datasets

The integration's scrape_urls function processes URLs sequentially. For larger lists, you can parallelize using asyncio.gather with your own batching:

import asyncio
import os
import cognee
from scrapegraph_py import AsyncClient
 
os.environ["SGAI_API_KEY"] = "your-sgai-api-key"
os.environ["LLM_API_KEY"] = "your-openai-api-key"
 
async def scrape_parallel(urls: list[str], user_prompt: str, batch_size: int = 5) -> list[dict]:
    results = []
    async with AsyncClient(api_key=os.environ["SGAI_API_KEY"]) as client:
        for i in range(0, len(urls), batch_size):
            batch = urls[i:i + batch_size]
            tasks = [
                client.smartscraper(website_url=url, user_prompt=user_prompt)
                for url in batch
            ]
            responses = await asyncio.gather(*tasks, return_exceptions=True)
            for url, resp in zip(batch, responses):
                if isinstance(resp, Exception):
                    results.append({"url": url, "content": "", "error": str(resp)})
                else:
                    results.append({"url": url, "content": resp.get("result", "")})
    return results
 
async def main():
    urls = [f"https://example.com/product/{i}" for i in range(20)]
 
    scraped = await scrape_parallel(urls, "Extract product name, price, and description")
    successful = [r for r in scraped if not r.get("error")]
 
    combined = "\n\n".join(f"Source: {r['url']}\n{r['content']}" for r in successful)
    await cognee.add(combined, dataset_name="products")
    await cognee.cognify()
 
asyncio.run(main())

Keep batch sizes reasonable. Five concurrent requests is a safe default. Too many parallel calls will hit rate limits.


How the integration code works

If you want to extend or fork the integration, here's the full source. It's intentionally thin:

async def scrape_and_add(
    urls: list[str],
    user_prompt: str = "Extract the main content, title, and key information from this page",
    api_key: str | None = None,
    dataset_name: str = "scrapegraph",
):
    # 1. Scrape each URL sequentially via ScrapeGraphAI
    scraped = await scrape_urls(urls=urls, user_prompt=user_prompt, api_key=api_key)
 
    # 2. Drop failed URLs, keep successful ones
    successful = [item for item in scraped if not item.get("error")]
    if not successful:
        raise RuntimeError("No URLs were scraped successfully.")
 
    # 3. Concatenate into one text block, preserving source attribution
    combined_text = "\n\n".join(
        f"Source: {item['url']}\n{item['content']}" for item in successful
    )
 
    # 4. Hand off to Cognee
    await cognee.add(combined_text, dataset_name=dataset_name)
    return await cognee.cognify()

The source attribution (Source: url\n) matters. Cognee picks it up and associates entities back to their origin URLs, which helps with grounded answers.


Troubleshooting

RuntimeError: No URLs were scraped successfully

Check your SGAI_API_KEY is set and valid. Run scrape_urls first to see per-URL errors before using scrape_and_add.

cognify is slow

Expected. It's running LLM calls for entity extraction and embedding generation. For 5-10 pages with gpt-4o-mini it takes around 30-60 seconds. For larger datasets, consider running cognify in the background.

Empty or vague search results

Usually a prompt quality issue. Check what scrape_urls actually returned for your URLs. If the extracted content is thin, tighten the user_prompt. Also try query_type=SearchType.CHUNKS to see the raw material Cognee is working with.

Python version error

The package requires Python 3.10+. Check with python --version. The Cognee library itself also has native dependencies. If you hit build errors, try uv instead of pip.

Stale results from previous runs

Call cognee.prune.prune_data() and cognee.prune.prune_system(metadata=True) to reset. Only do this in development. In production you'll usually want to accumulate data, not wipe it.


When to use this combination

This stack is the right choice when:

  • You're querying across many sources: 5, 20, 100 scraped pages. The graph model starts winning decisively over vector search at this scale.
  • You need synthesis, not retrieval: "how do these products compare" rather than "find the paragraph that mentions pricing"
  • Your data changes: re-scraping on a schedule and adding to the graph keeps your knowledge base current
  • You're building agent memory: agents that need to reason across sessions benefit from persistent, structured knowledge rather than a rolling context window

It's overkill if you need a one-off data extraction from a single page. For that, SmartScraper alone is faster, cheaper, and simpler.


Cognee beyond this tutorial

This article covers one integration, but Cognee is a broader knowledge engine — it's designed to be the persistent memory layer for AI agents, not just a retrieval backend. Here's what's available if you go deeper:

Self-improving memory with memify(). After you build a knowledge graph with cognify(), calling memify() refines it over time: pruning stale nodes, strengthening frequently-accessed connections, and reweighting edges based on usage patterns. Memory isn't static storage — it evolves based on feedback and interaction.

Custom graph models. You can define your own Pydantic schemas extending DataPoint to control exactly what entities and relationships Cognee extracts. Pass a graph_model to cognify() and Cognee uses your schema instead of the default extraction — useful when you need domain-specific structure.

Bring your own infrastructure. Cognee isn't locked to any single database. Swap the graph backend (Kuzu default, Neo4j, AWS Neptune), vector store (LanceDB default, PGVector, ChromaDB), or relational database (SQLite default, PostgreSQL) via environment variables. The cognee-community repo adds adapters for Qdrant, Milvus, Weaviate, Pinecone, Redis, Memgraph, FalkorDB, and more.

Session and agent memory. Cognee tracks search interactions via session_id and persists conversation history into the knowledge graph with memify(). Agents get persistent, structured memory across sessions — not a rolling context window that forgets.

Temporal awareness. The temporal_cognify=True flag extracts events and timestamps from your data, building a time-aware graph that supports queries like "what happened before X" or "how did Y change over time."


Running the official example

The cognee-community repo has a working example that covers both scrape_urls and scrape_and_add:

git clone https://github.com/topoteretes/cognee-community
cd cognee-community/packages/task/scrapegraph_tasks
 
export SGAI_API_KEY="your-key"
export LLM_API_KEY="your-openai-key"
 
uv sync --all-extras
uv run python examples/example.py

Cost breakdown

Each scrape_urls call runs ScrapeGraphAI's SmartScraper on every URL, costing 10 credits per URL. Ten pages costs 100 credits. You get 100 free credits when you sign up, so this tutorial is on us.

Cognee's cognify() uses your configured LLM provider (OpenAI by default). For a small dataset of 10 pages with gpt-4o-mini, expect roughly $0.02–0.05 in LLM costs. The entity extraction and embedding generation scale linearly with content size.


What's next

We're looking at tighter integration between the two platforms, specifically around scheduled re-scraping that automatically updates the Cognee graph when source pages change, and richer metadata in the graph nodes (last-scraped timestamp, change detection, confidence scores).

If you build something interesting with this, I'd genuinely like to hear about it. Reach out on Twitter or open an issue in the cognee-community repo.


Resources

Give your AI Agent superpowers with lightning-fast web data!