Recently, big tech companies like Google, Mistral, and Anthropic have released web search tools for extracting data from the internet. These innovative solutions aim to simplify the process of extracting valuable information from the web, making it easily accessible and usable within various applications and services.
However, like any technology, they come with their own set of advantages and disadvantages. In this comprehensive review, we'll compare the various services available on the market, providing insights into how each one works.
The parameters taken into consideration for the comparison will be: speed (in terms of seconds), quality of the response, and price.
Let's break down the pros and cons of these services together!
Before Starting: Testing Methodologies
Our benchmark evaluation employs standardized testing parameters to ensure objective comparison:
Test Parameters:
- Target: Amazon keyboard listings (high-complexity e-commerce site)
- Goal: Extract product prices and ratings
- URL: https://www.amazon.us/s?k=keyboards
- Evaluation Criteria: Response time, data accuracy, and extraction completeness
Amazon serves as an ideal benchmark due to its dynamic content loading, anti-scraping measures, and complex page structure.
Anthropic Fetch Tool: The Worst Performer
Let's start with the first tool. Using the Anthropic client, it's possible to handle the request directly from there:
import anthropic
def fetch_keyboard_prices():
"""
Fetch keyboard prices from Amazon using Anthropic's web fetch tool.
Returns:
dict: Response from Claude with web fetch results
"""
# Initialize the Anthropic client
client = anthropic.Anthropic()
# Create message with web fetch tool
response = client.messages.create(
model="claude-opus-4-1-20250805",
max_tokens=1024,
messages=[
{
"role": "user",
"content": "Please find the names and prices of keyboards from https://www.amazon.us/"
}
],
tools=[
{
"type": "web_fetch_20250910",
"name": "web_fetch",
"max_uses": 5
}
],
extra_headers={
"anthropic-beta": "web-fetch-2025-09-10"
}
)
return response
def main():
"""
Main function to execute the web fetch operation and display results.
"""
print("Fetching keyboard prices from Amazon...")
# Execute the web fetch request
response = fetch_keyboard_prices()
if response:
print("\n--- Response ---")
print(f"Response ID: {response.id}")
print(f"Model: {response.model}")
print(f"Role: {response.role}")
# Display content
for content in response.content:
if content.type == "text":
print(f"\nContent:\n{content.text}")
elif content.type == "tool_use":
print(f"\nTool Used: {content.name}")
print(f"Tool Input: {content.input}")
else:
print("Failed to fetch keyboard prices")
if __name__ == "__main__":
main()
And here is the partial answer:
{
"id":"msg_01HEN3iocaJsASWgoSU6dWhF",
"type":"message",
"role":"assistant",
"model":"claude-opus-4-1-20250805",
"content":[
{
"type":"text",
"text":"I apologize, but I'm unable to fetch the Amazon page directly at this moment. However, I can provide you with some helpful information about finding keyboard prices on Amazon..."
}
]
}
As is possible to see, Anthropic is not able to enter inside Amazon, returning an apology message.
PDF Handling
Anthropic could handle PDF fetching too using the Anthropic client:
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-opus-4-1-20250805",
max_tokens=1024,
messages=[
{
"role": "user",
"content": "Fetch the markdown from this link https://storage.dtelab.com.ar/uploads/2023/02/short-stories-for-children-ingles-primaria-continuemos-estudiando.pdf"
}
],
tools=[
{
"type": "web_fetch_20250910",
"name": "web_fetch",
"max_uses": 5
}
],
extra_headers={
"anthropic-beta": "web-fetch-2025-09-10"
}
)
print(response)
The snippet code for downloading the pdf took 75 seconds.
ScrapeGraph AI: The Best Solution
from scrapegraph_py import Client
sgai_client = Client(api_key="your_sgai_api_key")
response = sgai_client.smartscraper(
website_url="https://www.amazon.it/s?k=keyboards",
user_prompt="Get the keyboard prices and rating"
)
print(f"Results: {response.result}")
The result is the most accurate one and it took just 15 seconds!
Here is a snippet of the results:
{
"keyboard_prices": [
{
"product": "Logitech G213 Prodigy Gaming Cablata",
"price": "36.87 $"
},
{
"product": "Logitech K120 Wired Keyboard for Windows",
"price": "26.57 $"
},
{
"product": "Basic Keyboard",
"price": "12.34 $"
},
{
"product": "Logitech K120 Wired Keyboard with Cable Business Edition",
"price": "13.29 $"
},
{
"product": "Ewent Professional USB Wired Keyboard Italian Layout QWERTY",
"price": "11.00 $"
}
]
}
The execution time is just 15 seconds!
ScrapeGraph AI could also fetch PDFs from the internet using the markdownify endpoint:
from scrapegraph_py import Client
# Initialize the client
sgai_client = Client(api_key="sgai-api-key")
# Markdownify request
for url in urls:
response = sgai_client.markdownify(
website_url="https://storage.dtelab.com.ar/uploads/2023/02/short-stories-for-children-ingles-primaria-continuemos-estudiando.pdf"
)
print(f"Results for {url}:", response.result)
The most interesting aspect about ScrapeGraph is that it's credit-based on scraped websites and not based on tokens, so if you need to scale the service, it's easier to forecast the credits you need to spend for your agent!
For more information about the pricing, take a look here: https://scrapegraphai.com/pricing
You can also see the requests done from your scripts inside the dashboard.
Google Gemini: Sometimes It Works
Google Gemini presents a mixed performance profile in our web scraping evaluation, demonstrating both the potential and limitations of search-integrated AI tools.
Implementation Requirements: To conduct this assessment, we utilized Google's official Gemini SDK, which requires specific package installation and configuration:
from google import genai
from google.genai import types
def generate_content():
# Configure the client
client = genai.Client()
# Define the grounding tool
grounding_tool = types.Tool(
google_search=types.GoogleSearch()
)
# Configure generation settings
config = types.GenerateContentConfig(
tools=[grounding_tool]
)
# Make the request with a single content item
contents = [
{
"parts": [
{"text": "Get the keyboard prices and rating from this url https://www.amazon.it"
}
]
}
]
response = client.models.generate_content(
model="gemini-2.5-flash",
contents=contents,
config=config,
)
# Print the grounded response
print(response.text)
generate_content()
At the first attempt, it did not work, but after some retries, it had the correct response in 3:09 minutes!
Mistral: Cannot Access the Data
Mistral also has a web search feature agent but is not able to access directly to a website given a URL:
from mistralai import Mistral
# Initialize the Mistral client
client = Mistral(api_key="your_mistral_api_key_here")
# Create the web search agent
websearch_agent = client.beta.agents.create(
model="mistral-medium-2505",
description="Agent able to search information over the web, such as news, weather, sport results...",
name="Websearch Agent",
instructions="You have the ability to perform web searches with `web_search` to find up-to-date information.",
tools=[{"type": "web_search"}],
completion_args={
"temperature": 0.3,
"top_p": 0.95,
}
)
# Create a chat session with the agent
chat_response = client.beta.agents.complete(
agent_id=websearch_agent.id,
messages=[
{
"role": "user",
"content": "Fetch the keyboard prices from this link only https://www.amazon.in/s?k=keyboards&crid=3HUDQNNRZ1M1R&sprefix=keyboar%2Caps%2C341&ref=nb_sb_noss_2"
}
]
)
# Print the response
print(chat_response.choices[0].message.content)
As Claude, it will return a result like that:
{
"choices":[
{
"index":0,
"finish_reason":"stop",
"message":{
"role":"assistant",
"tool_calls":null,
"content":"I can't directly fetch live data from external websites like Amazon due to technical limitations. However, I can guide you on how to extract keyboard prices from the Amazon USA search results page you provided.\n\n### Steps to Extract Keyboard Prices from Amazon USA:\n\n1. **Open the Amazon Link**:\n Visit: [Amazon USA Keyboards Search](https://www.amazon.in/s?k=keyboards&crid=3HUDQNNRZ1M1R&sprefix=keyboar%2Caps%2C341&ref=nb_sb_noss_2)\n\n2. **Inspect the Page**:\n - Right-click on the page and select **\"Inspect\"** (or press `F12`/`Ctrl+Shift+I`).\n - Go to the **\"Elements\"** tab to see the ..."
}
}
]
}
Comparison Recap
Here's a table with all the services recapped:
Service | Speed | Quality of Response | PDF Support | Correct Result |
---|---|---|---|---|
Claude | - | No answer | ✅ | ❌ |
ScrapeGraphAI | 15s | Perfect | ✅ | ✅ |
3:09m | Good | ❌ | ❌ | |
Mistral | - | No answer | ❌ | ❌ |
And here is the ranking (lower is better):
Service | Speed | Quality of Response |
---|---|---|
2 | 2 | |
Claude | 3 | - |
ScrapeGraphAI | 1 | 1 |
Mistral | 4 | - |
Conclusions
Based on the tests provided in the previous chapters, here are the main considerations: both Claude and Google Gemini services are token-based, making it difficult to understand how much you will spend at scale; Google Gemini cannot extract data from PDFs; ScrapeGraph AI has the better performance in terms of accuracy, reliability, and speed, and it has a dashboard to see the results.
Frequently Asked Questions
What factors should I consider when choosing a web scraping solution?
Evaluate response time, data accuracy, pricing transparency, scalability, and compatibility with your target websites. Consider whether you need PDF processing capabilities and dashboard monitoring features.
Why do some scraping tools fail on major e-commerce sites?
E-commerce platforms implement sophisticated anti-scraping measures including IP blocking, CAPTCHA systems, and dynamic content loading. Tools must be specifically designed to handle these challenges.
What are web scraping tools and why do I need them?
Web scraping tools are automated solutions that extract data from websites, converting unstructured web content into structured, usable information. They're essential for businesses that need to gather market intelligence, monitor competitors, track prices, or collect data at scale without manual effort.
Why did Anthropic's fetch tool fail on Amazon?
Anthropic's web fetch tool couldn't access Amazon's content, likely due to the site's anti-scraping measures or the tool's limitations with dynamic content. The tool returned an apology message instead of the requested data, making it unsuitable for this type of e-commerce scraping.
What makes ScrapeGraph AI the best solution according to your tests?
ScrapeGraph AI excelled in several areas:
- Fastest performance: Completed the task in just 15 seconds
- Highest accuracy: Provided perfect, structured results with actual product names and prices
- Predictable pricing: Credit-based system tied to websites scraped, not tokens used
- Scalability: Easy to forecast costs for large-scale operations
- Additional features: Includes a dashboard and supports PDF extraction
How do pricing models differ between these tools?
- Token-based pricing (Claude, Google Gemini): Costs depend on input/output tokens, making it difficult to predict expenses
- Credit-based pricing (ScrapeGraph AI): Costs are tied to websites scraped, providing predictable scaling costs
What are the main limitations of built-in LLM web scraping tools?
Common limitations include:
- Anti-scraping measure bypassing
- Dynamic content handling
- Rate limiting
- Limited customization options
- Token-based pricing unpredictability
Can these tools handle JavaScript-heavy websites?
Most built-in tools struggle with JavaScript-heavy sites that require browser rendering. ScrapeGraph AI is specifically designed to handle dynamic content and complex JavaScript interactions.
What security considerations should I keep in mind?
Important security aspects include:
- Respecting robots.txt files
- Implementing proper rate limiting
- Following website terms of service
- Using appropriate user agents
- Avoiding excessive server load
How can I integrate these tools into my existing workflow?
Integration options include:
- API-based integration
- SDK libraries for popular languages
- Webhook support for real-time processing
- Dashboard monitoring and analytics
- Export capabilities for various formats
What are the legal considerations for web scraping?
Legal considerations include:
- Respecting website terms of service
- Following robots.txt guidelines
- Avoiding copyrighted content
- Implementing proper data protection measures
- Understanding jurisdiction-specific regulations
How do I choose the right tool for my specific use case?
Consider these factors:
- Target website complexity
- Required data volume
- Budget constraints
- Technical requirements
- Compliance needs
- Integration capabilities
Related Resources
Want to learn more about web scraping and AI-powered data extraction? Explore these guides:
- Web Scraping 101 - Master the basics of data collection
- AI Agent Web Scraping - Learn about AI-powered data extraction
- LlamaIndex Integration - Discover advanced data analysis techniques
- Building Intelligent Agents - Learn how to build AI agents for data analysis
- Pre-AI to Post-AI Scraping - See how AI has transformed data collection
- Structured Output - Master handling structured data
- Stock Analysis with AI - Learn about AI-powered financial analysis
- LinkedIn Lead Generation with AI - Discover AI-driven business intelligence
- Web Scraping Legality - Understand the legal aspects of data collection
- Data Innovation: 5 Ways to Transform Your Business in 2025 - Explore how advanced data collection technologies are revolutionizing business operations
These resources will help you understand how to leverage AI and modern tools for innovative data collection and analysis.