ScrapeGraphAIScrapeGraphAI

Integrating ScrapeGraphAI with ADK: Complete Guide

Integrating ScrapeGraphAI with ADK: Complete Guide

Author 1

Marco Vinciguerra

Integrating ScrapeGraphAI with ADK: Complete Guide

Agent Development Kit (ADK) is a powerful framework that enables you to create intelligent AI agents using Google's models like Gemini and supports the use of other generative AI models. ScrapeGraphAI provides a complete MCP (Model Context Protocol) server that seamlessly integrates web scraping, crawling, and structured data extraction with ADK agents.

In this tutorial, we'll discover how to combine these two technologies to create advanced AI agents capable of navigating the web, extracting information, and transforming it into ready-to-use structured data.


What is ADK?

Agent Development Kit (ADK) is a modern framework for developing AI agents using Google's Gemini models and supports the use of other generative AI models. ADK agents can:

  • Communicate naturally with users
  • Utilize external tools through the MCP protocol
  • Handle complex workflows with multi-step reasoning capabilities
  • Extract and process data from various sources

Why ScrapeGraphAI with ADK?

ScrapeGraphAI is an AI-powered platform for web scraping that offers:

  • Structured extraction: Transforms HTML into structured JSON using AI
  • Intelligent crawling: Automatically navigates complex websites
  • JavaScript support: Handles sites with heavy client-side rendering
  • MCP protocol: Standard integration with frameworks like ADK

By combining these technologies, you can create agents capable of:

  • Automatically searching for information on the web
  • Extracting structured data from web pages
  • Monitoring sites for changes
  • Gathering market intelligence
  • Automating complex research

Installation

To get started, you need to install the ScrapeGraphAI MCP server. The server is available as a Python package (requires Python 3.13 or higher):

pip install scrapegraph-mcp

Also make sure you have ADK installed:

pip install google-adk

Initial Setup

Before using ScrapeGraphAI with ADK, you'll need a ScrapeGraphAI API key. You can obtain it from the ScrapeGraphAI dashboard.

Save your API key in an environment variable or a Python variable:

SGAI_API_KEY = "YOUR_SCRAPEGRAPHAI_API_KEY"

Basic Integration with ADK

Here's how to set up a basic ADK agent that uses ScrapeGraphAI:

import asyncio
from google.adk.agents import Agent
from google.adk.runners import InMemoryRunner
from google.adk.tools.mcp_tool.mcp_toolset import MCPToolset
from google.adk.tools.mcp_tool.mcp_session_manager import StdioConnectionParams
from mcp import StdioServerParameters
 
SGAI_API_KEY = "YOUR_SCRAPEGRAPHAI_API_KEY"
 
# Create an ADK agent with ScrapeGraphAI integration
root_agent = Agent(
    model="gemini-2.5-pro",
    name="scrapegraph_assistant_agent",
    instruction="""Help the user with web scraping and data extraction using
ScrapeGraphAI. You can convert webpages to markdown, extract 
structured data using AI, perform web searches, crawl
multiple pages, and automate complex scraping workflows.""",
    tools=[
        MCPToolset(
            connection_params=StdioConnectionParams(
                server_params=StdioServerParameters(
                    # The following CLI command is available
                    # from `pip install scrapegraph-mcp`
                    command="scrapegraph-mcp",
                    env={
                        "SGAI_API_KEY": SGAI_API_KEY,
                    },
                ),
                timeout=300,
            ),
            # Optional: Filter which tools from the MCP server are exposed
            # tool_filter=["markdownify", "smartscraper", "searchscraper"]
        ),
    ],
)
 
runner = InMemoryRunner(agent=root_agent)

What This Code Does

  1. Imports Required Modules: Includes asyncio, Agent, InMemoryRunner, and MCP-related imports
  2. Creates an Agent: Uses Google's gemini-2.5-pro model
  3. Configures Instructions: Defines the agent's web scraping capabilities
  4. Adds MCPToolset: Integrates the ScrapeGraphAI MCP server
  5. Configures Connection: Uses stdio to communicate with the MCP server
  6. Sets Timeout: 300 seconds for complex operations
  7. Creates Runner: Initializes an InMemoryRunner to execute agent tasks

Using the Agent

Once the agent is configured, you can use it for various web scraping tasks:

# Example 1: Convert a webpage to markdown
response = asyncio.run(runner.run_debug("Convert this page to markdown: https://scrapegraphai.com"))
print(response)
 
# Example 2: Extract structured data
response = asyncio.run(runner.run_debug("Extract all products with name, price, and description from: https://scrapegraphai.com/blog"))
print(response)
 
# Example 3: Perform a web search
response = asyncio.run(runner.run_debug("Search for the latest AI news and return title, author, and publication date"))
print(response)

Available Tools

ScrapeGraphAI MCP Server offers a complete suite of tools for web scraping:

1. markdownify

Transforms any webpage into clean, structured markdown format.

Ideal for:

  • Archiving web content
  • Content migration
  • Reading and analyzing articles

Example:

response = asyncio.run(runner.run_debug("Convert https://docs.python.org/3/tutorial/ to markdown"))
print(response)

2. smartscraper

Uses AI to extract structured data from any webpage with support for infinite scrolling.

Ideal for:

  • E-commerce scraping (products, prices)
  • Business information extraction
  • Data collection from dynamic feeds

Example:

response = asyncio.run(runner.run_debug("""Extract all products from https://scrapegraphai.com with:
    - Product name
    - Price
    - Availability
    - Main image"""))
print(response)

3. searchscraper

Performs AI-powered web searches with structured, actionable results.

Ideal for:

  • Searching for information across multiple sites
  • Competitive intelligence
  • Market analysis

Example:

response = asyncio.run(runner.run_debug("Search for gaming laptop price information and return results from the top 5 sites"))
print(response)

4. scrape

Basic endpoint for fetching content with optional heavy JavaScript rendering.

Ideal for:

  • Fetching raw HTML
  • Page structure analysis
  • Pre-processing before other operations

5. sitemap

Extracts sitemap URLs and structure for any website.

Ideal for:

  • Content discovery
  • Crawling planning
  • SEO analysis

6. smartcrawler_initiate / smartcrawler_fetch_results

Initiates intelligent multi-page crawling (asynchronous operation).

Ideal for:

  • Complete site crawling
  • Large-scale data collection
  • Content archiving

7. agentic_scrapper

Runs advanced agentic scraping workflows with customizable steps and structured output schemas.

Ideal for:

  • Complex multi-step workflows
  • Form interactions
  • Guided navigation

Practical Example: E-commerce Price Monitoring

Here's a complete example of how to use the integration to monitor product prices:

import asyncio
from google.adk.agents import Agent
from google.adk.runners import InMemoryRunner
from google.adk.tools.mcp_tool.mcp_toolset import MCPToolset
from google.adk.tools.mcp_tool.mcp_session_manager import StdioConnectionParams
from mcp import StdioServerParameters
import os
 
SGAI_API_KEY = os.getenv("SGAI_API_KEY")
 
# Create agent specialized for price monitoring
price_monitor_agent = Agent(
    model="gemini-2.5-pro",
    name="price_monitor_agent",
    instruction="""You are an agent specialized in e-commerce price monitoring.
When you receive a monitoring request:
1. Identify the product page
2. Extract product name, current price, and availability
3. Compare with historical prices if available
4. Return a structured summary with key information""",
    tools=[
        MCPToolset(
            connection_params=StdioConnectionParams(
                server_params=StdioServerParameters(
                    command="scrapegraph-mcp",
                    env={"SGAI_API_KEY": SGAI_API_KEY},
                ),
                timeout=300,
            ),
            # Filter only necessary tools for better performance
            tool_filter=["smartscraper", "markdownify"]
        ),
    ],
)
 
runner = InMemoryRunner(agent=price_monitor_agent)
 
# Use the agent
result = asyncio.run(runner.run_debug("Monitor the price of this product: https://scrapegraphai.com/pricing"))
print(result)

Filtering Tools for Performance

To optimize performance, you can filter which tools from the MCP server to expose to your agent:

MCPToolset(
    connection_params=StdioConnectionParams(...),
    tool_filter=["markdownify", "smartscraper", "searchscraper"]
)

This limits the agent to using only the specified tools, reducing:

  • Latency: Fewer tools to load
  • Costs: Avoids calls to unnecessary tools
  • Complexity: Simpler interface for the agent

Error Handling

It's important to properly handle errors when working with web scraping:

import asyncio
from google.adk.agents import Agent
from google.adk.runners import InMemoryRunner
import logging
 
logging.basicConfig(level=logging.INFO)
 
runner = InMemoryRunner(agent=root_agent)
 
try:
    response = asyncio.run(runner.run_debug("Extract data from https://scrapegraphai.com"))
    print(response)
except Exception as e:
    logging.error(f"Error during scraping: {e}")
    # Handle the error or retry with different parameters

Common errors and solutions:

  1. Timeout: Increase timeout for complex operations
  2. Rate limiting: Implement exponential backoff
  3. Dynamic content: Use smartscraper with render_heavy_js=True

Best Practices

1. Use Clear Instructions

Provide specific instructions to the agent for better results:

# ❌ Not optimal
"Extract data from the page"
 
# ✅ Optimal
"Extract all products with name, price, description, and rating from https://scrapegraphai.com/blog"

2. Optimize Timeouts

Set appropriate timeouts based on operation complexity:

  • Simple: 60-120 seconds
  • Medium: 180-300 seconds
  • Complex: 300-600 seconds

3. Filter Tools When Possible

Limiting available tools improves performance and reduces costs:

tool_filter=["smartscraper"]  # Only when necessary

4. Handle Rate Limiting

Implement backoff when making many requests:

import asyncio
import time
from google.adk.runners import InMemoryRunner
 
async def scrape_with_backoff(runner, url, max_retries=3):
    for attempt in range(max_retries):
        try:
            return await runner.run_debug(f"Extract data from {url}")
        except Exception as e:
            if attempt < max_retries - 1:
                await asyncio.sleep(2 ** attempt)  # Exponential backoff
            else:
                raise
 
# Usage
response = asyncio.run(scrape_with_backoff(runner, "https://scrapegraphai.com"))
print(response)

Advanced Use Cases

1. Competitive Intelligence

Create an agent that monitors competitors:

competitive_intel_agent = Agent(
    model="gemini-2.5-pro",
    name="competitive_intel",
    instruction="""Analyze competitor websites and extract:
    - Products and prices
    - Features and positioning
    - Marketing content
    - SEO strategies""",
    tools=[MCPToolset(...)],
)

2. Content Aggregation

Gather content from different sources:

content_aggregator = Agent(
    model="gemini-2.5-pro",
    name="content_aggregator",
    instruction="""Aggregate articles and content from different sources,
    extract title, author, date, and main content.""",
    tools=[MCPToolset(...)],
)

3. Market Research

Perform automated market research:

market_researcher = Agent(
    model="gemini-2.5-pro",
    name="market_researcher",
    instruction="""Perform market research on specific topics,
    aggregating data from multiple sources and providing structured insights.""",
    tools=[MCPToolset(...)],
)

Limitations and Considerations

Technical Limitations

  • Python 3.13+: The MCP server requires Python 3.13 or higher
  • API Key Required: A valid ScrapeGraphAI API key is needed
  • Rate Limits: Respect the platform's rate limits

Cost Considerations

  • Each call to ScrapeGraphAI tools consumes credits
  • Monitor usage through the dashboard
  • Use tool filters to reduce costs

Legal Compliance

  • Always respect site robots.txt files
  • Don't violate terms of service
  • Use ethical and responsible scraping

Troubleshooting

Issue: MCP server won't start

Solution:

# Verify installation
pip show scrapegraph-mcp
 
# Verify API key is configured
echo $SGAI_API_KEY

Issue: Frequent timeouts

Solution:

# Increase timeout
timeout=600  # 10 minutes

Issue: Tools not available

Solution:

# Verify tools are correctly filtered
tool_filter=["smartscraper", "markdownify"]  # Explicit list

Additional Resources

To learn more about the integration, consult:


Conclusion

Integrating ScrapeGraphAI with ADK opens new possibilities for creating powerful AI agents capable of:

  • Navigating the web autonomously
  • Extracting structured data from any source
  • Aggregating information from multiple sources
  • Automating complex research
  • Providing insights based on real-time data

With this combination, you can build automated intelligence systems, market monitoring, and advanced research agents that go far beyond traditional scraping capabilities.

Start experimenting with this powerful combination today and discover what you can build! 🚀


Frequently Asked Questions

How do I get a ScrapeGraphAI API key?

Sign up at dashboard.scrapegraphai.com to get your free API key.

Can I use different Gemini models?

Yes, you can use any Gemini model available in ADK:

  • gemini-2.5-pro (recommended for complex tasks)
  • gemini-1.5-pro (more economical)
  • gemini-1.5-flash (faster)

What is the cost of usage?

ScrapeGraphAI uses a credit system. Check the pricing page for details.

Can I use multiple agents simultaneously?

Yes, each agent can have its own MCP connection. Make sure to properly manage resources.

How do I handle network errors?

Implement retry logic with exponential backoff and appropriate error handling.

Related Articles

Want to learn more about AI agents and automation? Explore these guides:

Give your AI Agent superpowers with lightning-fast web data!