Unlocking the Web: A Comprehensive Review of Top Built-in Web Scraping Tools for LLMs

Recently, big tech companies like Google, Mistral, and Anthropic have released web search tools for extracting data from the internet. These innovative solutions aim to simplify the process of extracting valuable information from the web, making it easily accessible and usable within various applications and services.

However, like any technology, they come with their own set of advantages and disadvantages. In this comprehensive review, we'll compare the various services available on the market, providing insights into how each one works.

The parameters taken into consideration for the comparison will be: speed (in terms of seconds), quality of the response, and price.

Let's break down the pros and cons of these services together!

Before Starting: Testing Methodologies

Our benchmark evaluation employs standardized testing parameters to ensure objective comparison:

Test Parameters:

Target: Amazon keyboard listings (high-complexity e-commerce site)
Goal: Extract product prices and ratings
URL: https://www.amazon.us/s?k=keyboards
Evaluation Criteria: Response time, data accuracy, and extraction completeness

Amazon serves as an ideal benchmark due to its dynamic content loading, anti-scraping measures, and complex page structure.

Anthropic Fetch Tool: The Worst Performer

Let's start with the first tool. Using the Anthropic client, it's possible to handle the request directly from there:

import anthropic
 
def fetch_keyboard_prices():
    """
    Fetch keyboard prices from Amazon using Anthropic's web fetch tool.
    
    Returns:
        dict: Response from Claude with web fetch results
    """
    
    # Initialize the Anthropic client
    client = anthropic.Anthropic()
 
    # Create message with web fetch tool
    response = client.messages.create(
        model="claude-opus-4-1-20250805",
        max_tokens=1024,
        messages=[
            {
                "role": "user", 
                "content": "Please find the names and prices of keyboards from https://www.amazon.us/"
            }
        ],
        tools=[
            {
                "type": "web_fetch_20250910",
                "name": "web_fetch", 
                "max_uses": 5
            }
        ],
        extra_headers={
            "anthropic-beta": "web-fetch-2025-09-10"
        }
    )
    
    return response
        
 
def main():
    """
    Main function to execute the web fetch operation and display results.
    """
    
    print("Fetching keyboard prices from Amazon...")
    
    # Execute the web fetch request
    response = fetch_keyboard_prices()
    
    if response:
        print("\n--- Response ---")
        print(f"Response ID: {response.id}")
        print(f"Model: {response.model}")
        print(f"Role: {response.role}")
        
        # Display content
        for content in response.content:
            if content.type == "text":
                print(f"\nContent:\n{content.text}")
            elif content.type == "tool_use":
                print(f"\nTool Used: {content.name}")
                print(f"Tool Input: {content.input}")
        
    else:
        print("Failed to fetch keyboard prices")
 
if __name__ == "__main__":
    main()

And here is the partial answer:

{
   "id":"msg_01HEN3iocaJsASWgoSU6dWhF",
   "type":"message",
   "role":"assistant",
   "model":"claude-opus-4-1-20250805",
   "content":[
      {
         "type":"text",
         "text":"I apologize, but I'm unable to fetch the Amazon page directly at this moment. However, I can provide you with some helpful information about finding keyboard prices on Amazon..."
      }
   ]
}

As is possible to see, Anthropic is not able to enter inside Amazon, returning an apology message.

PDF Handling

Anthropic could handle PDF fetching too using the Anthropic client:

import anthropic
 
client = anthropic.Anthropic()
 
response = client.messages.create(
    model="claude-opus-4-1-20250805",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": "Fetch the markdown from this link https://storage.dtelab.com.ar/uploads/2023/02/short-stories-for-children-ingles-primaria-continuemos-estudiando.pdf"
        }
    ],
    tools=[
        {
            "type": "web_fetch_20250910",
            "name": "web_fetch",
            "max_uses": 5
        }
    ],
    extra_headers={
        "anthropic-beta": "web-fetch-2025-09-10"
    }
)
 
print(response)

The snippet code for downloading the pdf took 75 seconds.

ScrapeGraph AI: The Best Solution

from scrapegraph_py import Client
 
sgai_client = Client(api_key="your_sgai_api_key")
 
response = sgai_client.smartscraper(
    website_url="https://www.amazon.it/s?k=keyboards",
    user_prompt="Get the keyboard prices and rating"
)
 
print(f"Results: {response.result}")

The result is the most accurate one and it took just 15 seconds!

Here is a snippet of the results:

{
  "keyboard_prices": [
    {
      "product": "Logitech G213 Prodigy Gaming Cablata",
      "price": "36.87 $"
    },
    {
      "product": "Logitech K120 Wired Keyboard for Windows",
      "price": "26.57 $"
    },
    {
      "product": "Basic Keyboard",
      "price": "12.34 $"
    },
    {
      "product": "Logitech K120 Wired Keyboard with Cable Business Edition",
      "price": "13.29 $"
    },
    {
      "product": "Ewent Professional USB Wired Keyboard Italian Layout QWERTY",
      "price": "11.00 $"
    }
  ]
}

The execution time is just 15 seconds!

ScrapeGraph AI could also fetch PDFs from the internet using the markdownify endpoint:

from scrapegraph_py import Client
 
# Initialize the client
sgai_client = Client(api_key="sgai-api-key")
 
# Markdownify request
for url in urls:
    response = sgai_client.markdownify(
        website_url="https://storage.dtelab.com.ar/uploads/2023/02/short-stories-for-children-ingles-primaria-continuemos-estudiando.pdf"
    )
    print(f"Results for {url}:", response.result)

The most interesting aspect about ScrapeGraph is that it's credit-based on scraped websites and not based on tokens, so if you need to scale the service, it's easier to forecast the credits you need to spend for your agent!

For more information about the pricing, take a look here: https://scrapegraphai.com/pricing

You can also see the requests done from your scripts inside the dashboard.

Google Gemini: Sometimes It Works

Google Gemini presents a mixed performance profile in our web scraping evaluation, demonstrating both the potential and limitations of search-integrated AI tools.

Implementation Requirements: To conduct this assessment, we utilized Google's official Gemini SDK, which requires specific package installation and configuration:

from google import genai
from google.genai import types
 
def generate_content():
    # Configure the client
    client = genai.Client()
 
    # Define the grounding tool
    grounding_tool = types.Tool(
        google_search=types.GoogleSearch()
    )
 
    # Configure generation settings
    config = types.GenerateContentConfig(
        tools=[grounding_tool]
    )
 
    # Make the request with a single content item
    contents = [
        {
            "parts": [
                {"text": "Get the keyboard prices and rating from this url https://www.amazon.it"
                }
            ]
        }
    ]
 
    response = client.models.generate_content(
        model="gemini-2.5-flash",
        contents=contents,
        config=config,
    )
 
    # Print the grounded response
    print(response.text)
 
generate_content()

At the first attempt, it did not work, but after some retries, it had the correct response in 3:09 minutes!

Mistral: Cannot Access the Data

Mistral also has a web search feature agent but is not able to access directly to a website given a URL:

from mistralai import Mistral
 
# Initialize the Mistral client
client = Mistral(api_key="your_mistral_api_key_here")
 
# Create the web search agent
websearch_agent = client.beta.agents.create(
    model="mistral-medium-2505",
    description="Agent able to search information over the web, such as news, weather, sport results...",
    name="Websearch Agent", 
    instructions="You have the ability to perform web searches with `web_search` to find up-to-date information.",
    tools=[{"type": "web_search"}],
    completion_args={
        "temperature": 0.3,
        "top_p": 0.95,
    }
)
 
# Create a chat session with the agent
chat_response = client.beta.agents.complete(
    agent_id=websearch_agent.id,
    messages=[
        {
            "role": "user", 
            "content": "Fetch the keyboard prices from this link only https://www.amazon.in/s?k=keyboards&crid=3HUDQNNRZ1M1R&sprefix=keyboar%2Caps%2C341&ref=nb_sb_noss_2"
        }
    ]
)
 
# Print the response
print(chat_response.choices[0].message.content)

As Claude, it will return a result like that:

{
   "choices":[
      {
         "index":0,
         "finish_reason":"stop",
         "message":{
            "role":"assistant",
            "tool_calls":null,
            "content":"I can't directly fetch live data from external websites like Amazon due to technical limitations. However, I can guide you on how to extract keyboard prices from the Amazon USA search results page you provided.\n\n### Steps to Extract Keyboard Prices from Amazon USA:\n\n1. **Open the Amazon Link**:\n   Visit: [Amazon USA Keyboards Search](https://www.amazon.in/s?k=keyboards&crid=3HUDQNNRZ1M1R&sprefix=keyboar%2Caps%2C341&ref=nb_sb_noss_2)\n\n2. **Inspect the Page**:\n   - Right-click on the page and select **\"Inspect\"** (or press `F12`/`Ctrl+Shift+I`).\n   - Go to the **\"Elements\"** tab to see the ..."
         }
      }
   ]
}

Comparison Recap

Here's a table with all the services recapped:

Service	Speed	Quality of Response	PDF Support	Correct Result
Claude	-	No answer	✅	❌
ScrapeGraphAI	15s	Perfect	✅	✅
Google	3:09m	Good	❌	❌
Mistral	-	No answer	❌	❌

And here is the ranking (lower is better):

Service	Speed	Quality of Response
Google	2	2
Claude	3	-
ScrapeGraphAI	1	1
Mistral	4	-

Conclusions

Based on the tests provided in the previous chapters, here are the main considerations: both Claude and Google Gemini services are token-based, making it difficult to understand how much you will spend at scale; Google Gemini cannot extract data from PDFs; ScrapeGraph AI has the better performance in terms of accuracy, reliability, and speed, and it has a dashboard to see the results.

Frequently Asked Questions

What factors should I consider when choosing a web scraping solution?

Evaluate response time, data accuracy, pricing transparency, scalability, and compatibility with your target websites. Consider whether you need PDF processing capabilities and dashboard monitoring features.

Why do some scraping tools fail on major e-commerce sites?

E-commerce platforms implement sophisticated anti-scraping measures including IP blocking, CAPTCHA systems, and dynamic content loading. Tools must be specifically designed to handle these challenges.

What are web scraping tools and why do I need them?

Web scraping tools are automated solutions that extract data from websites, converting unstructured web content into structured, usable information. They're essential for businesses that need to gather market intelligence, monitor competitors, track prices, or collect data at scale without manual effort.

Why did Anthropic's fetch tool fail on Amazon?

Anthropic's web fetch tool couldn't access Amazon's content, likely due to the site's anti-scraping measures or the tool's limitations with dynamic content. The tool returned an apology message instead of the requested data, making it unsuitable for this type of e-commerce scraping.

What makes ScrapeGraph AI the best solution according to your tests?

ScrapeGraph AI excelled in several areas:

Fastest performance: Completed the task in just 15 seconds
Highest accuracy: Provided perfect, structured results with actual product names and prices
Predictable pricing: Credit-based system tied to websites scraped, not tokens used
Scalability: Easy to forecast costs for large-scale operations
Additional features: Includes a dashboard and supports PDF extraction

How do pricing models differ between these tools?

Token-based pricing (Claude, Google Gemini): Costs depend on input/output tokens, making it difficult to predict expenses
Credit-based pricing (ScrapeGraph AI): Costs are tied to websites scraped, providing predictable scaling costs

What are the main limitations of built-in LLM web scraping tools?

Common limitations include:

Anti-scraping measure bypassing
Dynamic content handling
Rate limiting
Limited customization options
Token-based pricing unpredictability

Can these tools handle JavaScript-heavy websites?

Most built-in tools struggle with JavaScript-heavy sites that require browser rendering. ScrapeGraph AI is specifically designed to handle dynamic content and complex JavaScript interactions.

What security considerations should I keep in mind?

Important security aspects include:

Respecting robots.txt files
Implementing proper rate limiting
Following website terms of service
Using appropriate user agents
Avoiding excessive server load

How can I integrate these tools into my existing workflow?

Integration options include:

API-based integration
SDK libraries for popular languages
Webhook support for real-time processing
Dashboard monitoring and analytics
Export capabilities for various formats

What are the legal considerations for web scraping?

Legal considerations include:

Respecting website terms of service
Following robots.txt guidelines
Avoiding copyrighted content
Implementing proper data protection measures
Understanding jurisdiction-specific regulations

How do I choose the right tool for my specific use case?

Consider these factors:

Target website complexity
Required data volume
Budget constraints
Technical requirements
Compliance needs
Integration capabilities

Related Resources

Want to learn more about web scraping and AI-powered data extraction? Explore these guides:

Web Scraping 101 - Master the basics of data collection
AI Agent Web Scraping - Learn about AI-powered data extraction
LlamaIndex Integration - Discover advanced data analysis techniques
Building Intelligent Agents - Learn how to build AI agents for data analysis
Pre-AI to Post-AI Scraping - See how AI has transformed data collection
Structured Output - Master handling structured data
Stock Analysis with AI - Learn about AI-powered financial analysis
LinkedIn Lead Generation with AI - Discover AI-driven business intelligence
Web Scraping Legality - Understand the legal aspects of data collection
Data Innovation: 5 Ways to Transform Your Business in 2025 - Explore how advanced data collection technologies are revolutionizing business operations

These resources will help you understand how to leverage AI and modern tools for innovative data collection and analysis.