ScrapeGraphAIScrapeGraphAI

Enhancing Your AI Agents with ScrapeGraphAI & LangChain: Advanced Prompt Examples

Enhancing Your AI Agents with ScrapeGraphAI & LangChain: Advanced Prompt Examples

In today's rapidly changing world, relying on static data is not enough. AI agents must be able to fetch live, relevant, and structured data from the web in real time. ScrapeGraphAI and LangChain together form a powerful combination to bring this capability to life. With LangChain's orchestration and ScrapeGraphAI's prompt-based intelligent scraping, developers can now build agents that think, act, and fetch.

This guide walks through setting up LangChain agents that utilize ScrapeGraphAI to perform real-time web scraping tasks, including dynamic data extraction, intelligent structuring via LLMs, and integrated reasoning. We'll also explore prompt engineering, schema definition, real examples, configuration tips, and measurable accuracy improvements.


Why Combine ScrapeGraphAI and LangChain?

LangChain provides a framework for building AI-powered chains and agents with memory, tools, and external data integrations. ScrapeGraphAI brings language model-powered web scraping capabilities that remove the need for brittle XPath, CSS selectors, or custom scripts.

Combining them results in:

  • Agents that retrieve fresh data on-the-fly
  • Tools for summarizing or transforming real-time scraped data
  • More accurate responses that reflect the current state of the internet

Common Use Cases

  • AI assistants retrieving latest stock prices
  • Market analysis bots comparing product prices
  • Academic research agents gathering fresh government stats
  • Customer service bots checking real-time inventory or news
  • Live fact-checking tools for journalists

Step 1: Install the SDKs

Use pip to install both libraries:

pip install scrapegraphai
pip install langchain

Step 2: Create a ScrapeGraphAI Wrapper as a Tool

We will now build a LangChain tool that wraps ScrapeGraphAI's SmartScraperGraph to scrape a webpage using prompts and return structured output.

from langchain.tools import Tool
from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import convert_to_json_schema
 
def scrape_with_prompt(data):
    url = data["url"]
    prompt = data["prompt"]
 
    schema = {
        "headline": "string",
        "summary": "string"
    }
 
    graph = SmartScraperGraph(
        prompt=prompt,
        source=url,
        schema=convert_to_json_schema(schema),
        config={
            "llm": {
                "provider": "openai",
                "model": "gpt-4",
                "api_key": "your-api-key"
            }
        }
    )
    return graph.run()["result"]
 
scraper_tool = Tool(
    name="LiveScraper",
    func=scrape_with_prompt,
    description="Scrapes web content using prompt + schema with ScrapeGraphAI"
)

Step 3: Initialize Agent with Scraper Tool

from langchain.chat_models import ChatOpenAI
from langchain.agents import initialize_agent, AgentType
 
llm = ChatOpenAI(model="gpt-4", temperature=0)
 
agent = initialize_agent(
    tools=[scraper_tool],
    llm=llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True
)

Step 4: Run Real-Time Scraping with Agent

result = agent.run({
    "url": "https://example.com/latest-news",
    "prompt": "Extract the latest headline and a short summary"
})
print(result)

Advanced Example: Scraping Product Pricing

schema = {
    "product_title": "string",
    "price": "string"
}
 
prompt = "Extract product title and price from this product detail page"
 
graph = SmartScraperGraph(
    prompt=prompt,
    source="https://example.com/product-page",
    schema=convert_to_json_schema(schema),
    config={
        "llm": {
            "provider": "openai",
            "api_key": "your-api-key",
            "model": "gpt-4"
        }
    }
)
 
result = graph.run()
print(result["result"])

YAML-Based Configuration for Reusability

llm:
  provider: openai
  api_key: YOUR_KEY
  model: gpt-4
 
schema:
  product_title: string
  price: string
 
prompt: Extract product title and price
source: https://example.com/product

Accuracy Before & After Using ScrapeGraphAI

Question Without Real-Time With ScrapeGraphAI
What's the latest iPhone price? Incorrect guess Live scraped data
What's trending on government portal? No results Fetched and summarized
What's the current weather alert? Generic output Live structured info

Best Practices

  • Log all scraped URLs, timestamps, and schemas used
  • Use caching to avoid redundant calls to the same site
  • Create prompt-schema pairs for reusable scraping agents
  • Respect robots.txt and site rate limits
  • Handle failures gracefully with fallback messages
  • Validate outputs before using them in final answers

Frequently Asked Questions

Can I use it with other LLM providers?

Yes, ScrapeGraphAI supports OpenAI, Groq, Mistral, and others via its flexible configuration.

What happens when the HTML structure changes?

ScrapeGraphAI uses LLMs to adapt to layout shifts and interpret content semantically, unlike brittle CSS selectors.

Can I extract complex tables?

Yes. Define a schema to match table rows and columns, and use prompts that explain what the table represents.

Can I scrape behind authentication?

ScrapeGraphAI is primarily designed for open-access content. Advanced setups can enable browser sessions if needed.


Conclusion

Integrating ScrapeGraphAI with LangChain empowers your AI agents to access, scrape, and structure real-time web data with precision. Whether building research bots, news aggregators, or product monitoring tools, this integration fills the last-mile gap between language understanding and live data retrieval. Let your AI agents not just think—but think with context and data that's current, accurate, and aligned with the real world.


Related Resources

Want to learn more about social media scraping and data extraction? Explore these guides:

These resources will help you understand different approaches to social media data extraction and make the most of your scraping efforts.

Give your AI Agent superpowers with lightning-fast web data!