Integrating ScrapeGraphAI with ADK: Complete Guide
Agent Development Kit (ADK) is a powerful framework that enables you to create intelligent AI agents using Google's models like Gemini and supports the use of other generative AI models. ScrapeGraphAI provides a complete MCP (Model Context Protocol) server that seamlessly integrates web scraping, crawling, and structured data extraction with ADK agents.
In this tutorial, we'll discover how to combine these two technologies to create advanced AI agents capable of navigating the web, extracting information, and transforming it into ready-to-use structured data.
What is ADK?
Agent Development Kit (ADK) is a modern framework for developing AI agents using Google's Gemini models and supports the use of other generative AI models. ADK agents can:
- Communicate naturally with users
- Utilize external tools through the MCP protocol
- Handle complex workflows with multi-step reasoning capabilities
- Extract and process data from various sources
Why ScrapeGraphAI with ADK?
ScrapeGraphAI is an AI-powered platform for web scraping that offers:
- Structured extraction: Transforms HTML into structured JSON using AI
- Intelligent crawling: Automatically navigates complex websites
- JavaScript support: Handles sites with heavy client-side rendering
- MCP protocol: Standard integration with frameworks like ADK
By combining these technologies, you can create agents capable of:
- Automatically searching for information on the web
- Extracting structured data from web pages
- Monitoring sites for changes
- Gathering market intelligence
- Automating complex research
Installation
To get started, you need to install the ScrapeGraphAI MCP server. The server is available as a Python package (requires Python 3.13 or higher):
pip install scrapegraph-mcpAlso make sure you have ADK installed:
pip install google-adkInitial Setup
Before using ScrapeGraphAI with ADK, you'll need a ScrapeGraphAI API key. You can obtain it from the ScrapeGraphAI dashboard.
Save your API key in an environment variable or a Python variable:
SGAI_API_KEY = "YOUR_SCRAPEGRAPHAI_API_KEY"Basic Integration with ADK
Here's how to set up a basic ADK agent that uses ScrapeGraphAI:
import asyncio
from google.adk.agents import Agent
from google.adk.runners import InMemoryRunner
from google.adk.tools.mcp_tool.mcp_toolset import MCPToolset
from google.adk.tools.mcp_tool.mcp_session_manager import StdioConnectionParams
from mcp import StdioServerParameters
SGAI_API_KEY = "YOUR_SCRAPEGRAPHAI_API_KEY"
# Create an ADK agent with ScrapeGraphAI integration
root_agent = Agent(
model="gemini-2.5-pro",
name="scrapegraph_assistant_agent",
instruction="""Help the user with web scraping and data extraction using
ScrapeGraphAI. You can convert webpages to markdown, extract
structured data using AI, perform web searches, crawl
multiple pages, and automate complex scraping workflows.""",
tools=[
MCPToolset(
connection_params=StdioConnectionParams(
server_params=StdioServerParameters(
# The following CLI command is available
# from `pip install scrapegraph-mcp`
command="scrapegraph-mcp",
env={
"SGAI_API_KEY": SGAI_API_KEY,
},
),
timeout=300,
),
# Optional: Filter which tools from the MCP server are exposed
# tool_filter=["markdownify", "smartscraper", "searchscraper"]
),
],
)
runner = InMemoryRunner(agent=root_agent)What This Code Does
- Imports Required Modules: Includes
asyncio,Agent,InMemoryRunner, and MCP-related imports - Creates an Agent: Uses Google's
gemini-2.5-promodel - Configures Instructions: Defines the agent's web scraping capabilities
- Adds MCPToolset: Integrates the ScrapeGraphAI MCP server
- Configures Connection: Uses stdio to communicate with the MCP server
- Sets Timeout: 300 seconds for complex operations
- Creates Runner: Initializes an
InMemoryRunnerto execute agent tasks
Using the Agent
Once the agent is configured, you can use it for various web scraping tasks:
# Example 1: Convert a webpage to markdown
response = asyncio.run(runner.run_debug("Convert this page to markdown: https://scrapegraphai.com"))
print(response)
# Example 2: Extract structured data
response = asyncio.run(runner.run_debug("Extract all products with name, price, and description from: https://scrapegraphai.com/blog"))
print(response)
# Example 3: Perform a web search
response = asyncio.run(runner.run_debug("Search for the latest AI news and return title, author, and publication date"))
print(response)Available Tools
ScrapeGraphAI MCP Server offers a complete suite of tools for web scraping:
1. markdownify
Transforms any webpage into clean, structured markdown format.
Ideal for:
- Archiving web content
- Content migration
- Reading and analyzing articles
Example:
response = asyncio.run(runner.run_debug("Convert https://docs.python.org/3/tutorial/ to markdown"))
print(response)2. smartscraper
Uses AI to extract structured data from any webpage with support for infinite scrolling.
Ideal for:
- E-commerce scraping (products, prices)
- Business information extraction
- Data collection from dynamic feeds
Example:
response = asyncio.run(runner.run_debug("""Extract all products from https://scrapegraphai.com with:
- Product name
- Price
- Availability
- Main image"""))
print(response)3. searchscraper
Performs AI-powered web searches with structured, actionable results.
Ideal for:
- Searching for information across multiple sites
- Competitive intelligence
- Market analysis
Example:
response = asyncio.run(runner.run_debug("Search for gaming laptop price information and return results from the top 5 sites"))
print(response)4. scrape
Basic endpoint for fetching content with optional heavy JavaScript rendering.
Ideal for:
- Fetching raw HTML
- Page structure analysis
- Pre-processing before other operations
5. sitemap
Extracts sitemap URLs and structure for any website.
Ideal for:
- Content discovery
- Crawling planning
- SEO analysis
6. smartcrawler_initiate / smartcrawler_fetch_results
Initiates intelligent multi-page crawling (asynchronous operation).
Ideal for:
- Complete site crawling
- Large-scale data collection
- Content archiving
7. agentic_scrapper
Runs advanced agentic scraping workflows with customizable steps and structured output schemas.
Ideal for:
- Complex multi-step workflows
- Form interactions
- Guided navigation
Practical Example: E-commerce Price Monitoring
Here's a complete example of how to use the integration to monitor product prices:
import asyncio
from google.adk.agents import Agent
from google.adk.runners import InMemoryRunner
from google.adk.tools.mcp_tool.mcp_toolset import MCPToolset
from google.adk.tools.mcp_tool.mcp_session_manager import StdioConnectionParams
from mcp import StdioServerParameters
import os
SGAI_API_KEY = os.getenv("SGAI_API_KEY")
# Create agent specialized for price monitoring
price_monitor_agent = Agent(
model="gemini-2.5-pro",
name="price_monitor_agent",
instruction="""You are an agent specialized in e-commerce price monitoring.
When you receive a monitoring request:
1. Identify the product page
2. Extract product name, current price, and availability
3. Compare with historical prices if available
4. Return a structured summary with key information""",
tools=[
MCPToolset(
connection_params=StdioConnectionParams(
server_params=StdioServerParameters(
command="scrapegraph-mcp",
env={"SGAI_API_KEY": SGAI_API_KEY},
),
timeout=300,
),
# Filter only necessary tools for better performance
tool_filter=["smartscraper", "markdownify"]
),
],
)
runner = InMemoryRunner(agent=price_monitor_agent)
# Use the agent
result = asyncio.run(runner.run_debug("Monitor the price of this product: https://scrapegraphai.com/pricing"))
print(result)Filtering Tools for Performance
To optimize performance, you can filter which tools from the MCP server to expose to your agent:
MCPToolset(
connection_params=StdioConnectionParams(...),
tool_filter=["markdownify", "smartscraper", "searchscraper"]
)This limits the agent to using only the specified tools, reducing:
- Latency: Fewer tools to load
- Costs: Avoids calls to unnecessary tools
- Complexity: Simpler interface for the agent
Error Handling
It's important to properly handle errors when working with web scraping:
import asyncio
from google.adk.agents import Agent
from google.adk.runners import InMemoryRunner
import logging
logging.basicConfig(level=logging.INFO)
runner = InMemoryRunner(agent=root_agent)
try:
response = asyncio.run(runner.run_debug("Extract data from https://scrapegraphai.com"))
print(response)
except Exception as e:
logging.error(f"Error during scraping: {e}")
# Handle the error or retry with different parametersCommon errors and solutions:
- Timeout: Increase timeout for complex operations
- Rate limiting: Implement exponential backoff
- Dynamic content: Use
smartscraperwithrender_heavy_js=True
Best Practices
1. Use Clear Instructions
Provide specific instructions to the agent for better results:
# ❌ Not optimal
"Extract data from the page"
# ✅ Optimal
"Extract all products with name, price, description, and rating from https://scrapegraphai.com/blog"2. Optimize Timeouts
Set appropriate timeouts based on operation complexity:
- Simple: 60-120 seconds
- Medium: 180-300 seconds
- Complex: 300-600 seconds
3. Filter Tools When Possible
Limiting available tools improves performance and reduces costs:
tool_filter=["smartscraper"] # Only when necessary4. Handle Rate Limiting
Implement backoff when making many requests:
import asyncio
import time
from google.adk.runners import InMemoryRunner
async def scrape_with_backoff(runner, url, max_retries=3):
for attempt in range(max_retries):
try:
return await runner.run_debug(f"Extract data from {url}")
except Exception as e:
if attempt < max_retries - 1:
await asyncio.sleep(2 ** attempt) # Exponential backoff
else:
raise
# Usage
response = asyncio.run(scrape_with_backoff(runner, "https://scrapegraphai.com"))
print(response)Advanced Use Cases
1. Competitive Intelligence
Create an agent that monitors competitors:
competitive_intel_agent = Agent(
model="gemini-2.5-pro",
name="competitive_intel",
instruction="""Analyze competitor websites and extract:
- Products and prices
- Features and positioning
- Marketing content
- SEO strategies""",
tools=[MCPToolset(...)],
)2. Content Aggregation
Gather content from different sources:
content_aggregator = Agent(
model="gemini-2.5-pro",
name="content_aggregator",
instruction="""Aggregate articles and content from different sources,
extract title, author, date, and main content.""",
tools=[MCPToolset(...)],
)3. Market Research
Perform automated market research:
market_researcher = Agent(
model="gemini-2.5-pro",
name="market_researcher",
instruction="""Perform market research on specific topics,
aggregating data from multiple sources and providing structured insights.""",
tools=[MCPToolset(...)],
)Limitations and Considerations
Technical Limitations
- Python 3.13+: The MCP server requires Python 3.13 or higher
- API Key Required: A valid ScrapeGraphAI API key is needed
- Rate Limits: Respect the platform's rate limits
Cost Considerations
- Each call to ScrapeGraphAI tools consumes credits
- Monitor usage through the dashboard
- Use tool filters to reduce costs
Legal Compliance
- Always respect site
robots.txtfiles - Don't violate terms of service
- Use ethical and responsible scraping
Troubleshooting
Issue: MCP server won't start
Solution:
# Verify installation
pip show scrapegraph-mcp
# Verify API key is configured
echo $SGAI_API_KEYIssue: Frequent timeouts
Solution:
# Increase timeout
timeout=600 # 10 minutesIssue: Tools not available
Solution:
# Verify tools are correctly filtered
tool_filter=["smartscraper", "markdownify"] # Explicit listAdditional Resources
To learn more about the integration, consult:
- ScrapeGraphAI MCP Server Documentation
- ScrapeGraphAI MCP Repository
- ADK Documentation
- ADK Integration Pull Request
Conclusion
Integrating ScrapeGraphAI with ADK opens new possibilities for creating powerful AI agents capable of:
- Navigating the web autonomously
- Extracting structured data from any source
- Aggregating information from multiple sources
- Automating complex research
- Providing insights based on real-time data
With this combination, you can build automated intelligence systems, market monitoring, and advanced research agents that go far beyond traditional scraping capabilities.
Start experimenting with this powerful combination today and discover what you can build! 🚀
Frequently Asked Questions
How do I get a ScrapeGraphAI API key?
Sign up at dashboard.scrapegraphai.com to get your free API key.
Can I use different Gemini models?
Yes, you can use any Gemini model available in ADK:
gemini-2.5-pro(recommended for complex tasks)gemini-1.5-pro(more economical)gemini-1.5-flash(faster)
What is the cost of usage?
ScrapeGraphAI uses a credit system. Check the pricing page for details.
Can I use multiple agents simultaneously?
Yes, each agent can have its own MCP connection. Make sure to properly manage resources.
How do I handle network errors?
Implement retry logic with exponential backoff and appropriate error handling.
Related Articles
Want to learn more about AI agents and automation? Explore these guides:
- AI Agents Tutorial - Learn the fundamentals of AI agents
- Web Scraping with AI - Discover AI-powered scraping capabilities
- MCP Server Tutorial - Deep dive into the MCP protocol
- Structured Data Extraction - Learn to structure extracted data
- Web Scraping Best Practices - Improve your scraping techniques
