ScrapeGraphAIScrapeGraphAI

Building Custom Web Scraping Agents with LLMs: Comparing LangGraph, OpenAI Library, and Local Models using Mistral

Building Custom Web Scraping Agents with LLMs: Comparing LangGraph, OpenAI Library, and Local Models using Mistral

Author 1

Marco Vinciguerra

Three weeks ago we did a benchmark in this blog where we compared all the various fetching tools built in by major LLM providers.

The conclusions were that llms providers are not suited for extracting data from web.

Starting from these results I had this question in my mind: "Is it possible to create a custom agent based on llms for extracting correct and precise data from the web?".

In this tutorial we are going to do that with the most famous frameworks for building agents.

We will test both Langraph and OpenAI lib for making a well done comparison between these libraries.

Before starting

Here's a quick recap of the two weeks ago post:

Service Speed Quality of Response PDF support Correct result
Claude - No answer
ScrapeGraphAI 15s Perfect
Google 3,09m Good
Mistral - No answer

Here is a breakdown of the results:

  • Mistral and Claude did not provide an answer
  • Google took 3.09m
  • ScrapeGraphAI just took 15 s

In this week blogpost we are going to create an agent using Mistral for unlocking:

  • search capabilities
  • scraping capabilities
  • fetching pdf capabilities

For being coherent with the previous blogpost we will continue to use the same link of the previous week: https://www.amazon.it/s?k=keyboards.

Comparison goals: The comparison features will be:

  • Quality of the response
  • Response time

Case 1: LangGraph

Let's start with LangGraph, the most famous tool for agents creation.

The first thing to do is creating the tools the agent will use

Search tool

from scrapegraph_py import Client
from langchain_core.tools import tool
 
@tool
def search_scraper_tool(user_prompt: str, num_results: int = 3) -> dict:
    """Perform AI-powered web searches with structured results."""
    try:
        sgai_client = Client(api_key="sgai-***")
        response = sgai_client.searchscraper(
            user_prompt=user_prompt,
            num_results=num_results
        )
        return {"success": True, "results": response.result}
    except Exception as e:
        return {"success": False, "error": str(e)}

Scraping tool

from scrapegraph_py import Client
from langchain_core.tools import tool
 
@tool
def smart_scraper_tool(website_url: str, user_prompt: str) -> dict:
    """Extract structured data from a webpage using AI-powered scraping."""
    try:
        sgai_client = Client(api_key="sgai-***")
        response = sgai_client.smartscraper(
            website_url=website_url,
            user_prompt=user_prompt
        )
        return {"success": True, "data": response.result, "source_url": website_url}
    except Exception as e:
        return {"success": False, "error": str(e)}

Fetching pdf capabilities

from scrapegraph_py import Client
from langchain_core.tools import tool
 
@tool
def markdownify_tool(website_url: str) -> dict:
    """Convert a webpage or PDF into clean, formatted markdown."""
    try:
        sgai_client = Client(api_key="sgai-***")
        response = sgai_client.markdownify(website_url=website_url)
        return {"success": True, "markdown": response.result, "source_url": website_url}
    except Exception as e:
        return {"success": False, "error": str(e)}

Agent definition with main

import os
from langgraph.graph import StateGraph, END
from langgraph.prebuilt import ToolNode
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_mistralai import ChatMistralAI
 
class AgentState(dict):
    messages: list
    task: str
 
def agent_node(state):
    """Main agent node with all tools"""
    llm = ChatMistralAI(
        api_key=os.getenv("MISTRAL_API_KEY", "your_mistral_key_here"),
        model="mistral-large-latest",
        temperature=0.1
    )
    
    tools = [smart_scraper_tool, markdownify_tool, search_scraper_tool]
    llm_with_tools = llm.bind_tools(tools)
    
    response = llm_with_tools.invoke(state["messages"])
    return {"messages": state["messages"] + [response]}
 
def create_agent():
    """Create simple LangGraph agent"""
    workflow = StateGraph(AgentState)
    workflow.add_node("agent", agent_node)
    workflow.set_entry_point("agent")
    workflow.add_edge("agent", END)
    return workflow.compile()
 
def main():
    agent = create_agent()
    
    # System prompt
    system_msg = SystemMessage(content="""You are an AI assistant with web scraping capabilities.
    Use smart_scraper_tool for data extraction, markdownify_tool for PDFs, and search_scraper_tool for searches.""")
    
    # Example usage
    examples = [
        "Scrape https://www.amazon.it/s?k=keyboards and get prices",
    ]
    
    for query in examples:
        print(f"Query: {query}")
        result = agent.invoke({
            "messages": [system_msg, HumanMessage(content=query)]
        })
        print(f"Response: {result['messages'][-1].content}\n")
 
if __name__ == "__main__":
    main()

Result

The execution of the script took in total 27.3 seconds.

The agent successfully understood the task and called the appropriate scraping tool. The response quality was excellent, providing well-structured data with accurate pricing information. The execution time includes the initial API calls to Mistral, the tool invocation overhead, and the ScrapeGraphAI processing time.

Here is a snippet of the result:

{
  "keyboard_prices": [
    {
      "product": "Logitech G213 Prodigy Gaming Cablata",
      "price": "36.87 $"
    },
    {
      "product": "Logitech K120 Wired Keyboard for Windows",
      "price": "26.57 $"
    },
    {
      "product": "Basic Keyboard",
      "price": "12.34 $"
    },
    {
      "product": "Logitech K120 Wired Keyboard with Cable Business Edition",
      "price": "13.29 $"
    },
    {
      "product": "Ewent Professional USB Wired Keyboard Italian Layout QWERTY",
      "price": "11.00 $"
    }
  ]
}

Case 2: Custom Agent with OpenAI Library

One of the few interesting things LangGraph provides is tracing of your agent.

If this feature is not required there is no clue at using it, because it is a heavy lib with many useless features.

So lets start with the definition of the tools that will be used inside the agent.

Search tool

from scrapegraph_py import Client
 
def search_scraper_tool(user_prompt: str, num_results: int = 3) -> dict:
    """Perform AI-powered web searches with structured results."""
    try:
        sgai_client = Client(api_key="sgai-***")
        response = sgai_client.searchscraper(
            user_prompt=user_prompt,
            num_results=num_results
        )
        return {"success": True, "results": response.result}
    except Exception as e:
        return {"success": False, "error": str(e)}

Scraping tool

from scrapegraph_py import Client
 
def smart_scraper_tool(website_url: str, user_prompt: str) -> dict:
    """Extract structured data from a webpage using AI-powered scraping."""
    try:
        sgai_client = Client(api_key="sgai-***")
        response = sgai_client.smartscraper(
            website_url=website_url,
            user_prompt=user_prompt
        )
        return {"success": True, "data": response.result, "source_url": website_url}
    except Exception as e:
        return {"success": False, "error": str(e)}

Fetching pdf capabilities

from scrapegraph_py import Client
 
def markdownify_tool(website_url: str) -> dict:
    """Convert a webpage or PDF into clean, formatted markdown."""
    try:
        sgai_client = Client(api_key="sgai-***")
        response = sgai_client.markdownify(website_url=website_url)
        return {"success": True, "markdown": response.result, "source_url": website_url}
    except Exception as e:
        return {"success": False, "error": str(e)}

Agent definition

import os
import json
from typing import Any, Dict
from openai import OpenAI
 
class MistralAgent:
    def __init__(self):
        self.client = OpenAI(
            api_key=os.getenv("MISTRAL_API_KEY", "your_mistral_key_here"),
            base_url="https://api.mistral.ai/v1"
        )
        self.tools = [
            {
                "type": "function",
                "function": {
                    "name": "smart_scraper_tool",
                    "description": "Extract structured data from a webpage using AI-powered scraping",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "website_url": {"type": "string", "description": "URL to scrape"},
                            "user_prompt": {"type": "string", "description": "What to extract"}
                        },
                        "required": ["website_url", "user_prompt"]
                    }
                }
            },
            {
                "type": "function",
                "function": {
                    "name": "search_scraper_tool",
                    "description": "Perform AI-powered web searches with structured results",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "user_prompt": {"type": "string", "description": "Search query"},
                            "num_results": {"type": "integer", "description": "Number of results", "default": 3}
                        },
                        "required": ["user_prompt"]
                    }
                }
            },
            {
                "type": "function",
                "function": {
                    "name": "markdownify_tool",
                    "description": "Convert a webpage or PDF into clean formatted markdown",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "website_url": {"type": "string", "description": "URL to convert"}
                        },
                        "required": ["website_url"]
                    }
                }
            }
        ]
    
    def call_mistral(self, messages, tools=None):
        """Call Mistral API with optional tools"""
        try:
            response = self.client.chat.completions.create(
                model="mistral-large-latest",
                messages=messages,
                tools=tools,
                temperature=0.1
            )
            return response
        except Exception as e:
            return {"error": str(e)}
    
    def execute_tool(self, tool_name: str, arguments: Dict[str, Any]) -> Dict[str, Any]:
        """Execute the appropriate tool"""
        if tool_name == "smart_scraper_tool":
            return smart_scraper_tool(**arguments)
        elif tool_name == "search_scraper_tool":
            return search_scraper_tool(**arguments)
        elif tool_name == "markdownify_tool":
            return markdownify_tool(**arguments)
        else:
            return {"error": f"Unknown tool: {tool_name}"}
    
    def run(self, user_message: str) -> str:
        """Run the agent with user message"""
        messages = [
            {"role": "user", "content": user_message}
        ]
        
        # Get initial response from model
        response = self.call_mistral(messages, self.tools)
        
        if hasattr(response, 'error'):
            return f"Error: {response.error}"
        
        # Check if model wants to use a tool
        if response.choices[0].message.tool_calls:
            tool_call = response.choices[0].message.tool_calls[0]
            tool_name = tool_call.function.name
            arguments = json.loads(tool_call.function.arguments)
            
            # Execute the tool
            tool_result = self.execute_tool(tool_name, arguments)
            
            # Add tool result to conversation and get final response
            messages.append({"role": "assistant", "content": response.choices[0].message.content})
            messages.append({
                "role": "user",
                "content": f"Tool result: {json.dumps(tool_result)}. Please provide a human-readable summary."
            })
            
            final_response = self.call_mistral(messages)
            
            if hasattr(final_response, 'error'):
                return f"Error: {final_response.error}"
            
            return final_response.choices[0].message.content
        
        else:
            return response.choices[0].message.content
 
def main():
    agent = MistralAgent()
    
    # Example usage
    examples = [
        "Scrape https://www.amazon.it/s?k=keyboards and get keyboard prices",
    ]
    
    for query in examples:
        print(f"User: {query}")
        response = agent.run(query)
        print(f"Assistant: {response}\n" + "="*60 + "\n")
 
if __name__ == "__main__":
    main()

Results

The script execution took a total of 22.8 seconds.

The OpenAI library approach proved to be more efficient than LangGraph. The implementation is more straightforward, with less overhead from the framework. The agent correctly identified the need to use the smart_scraper_tool and executed it efficiently. The result quality was identical to the LangGraph implementation, but with faster overall execution time. This is primarily due to the lighter weight of the OpenAI library compared to the more feature-rich LangGraph framework.

Here's an excerpt of the result:

json

{
  "keyboard_prices": [
    {
      "product": "Logitech G213 Prodigy Gaming Cablata",
      "price": "36.87 $"
    },
    {
      "product": "Logitech K120 Wired Keyboard for Windows", 
      "price": "26.57 $"
    },
    {
      "product": "Basic Keyboard",
      "price": "12.34 $"
    }
  ]
}

Case 3: Local Agent with Ollama

For users concerned about privacy and operational costs, using a local model through Ollama offers an interesting alternative. While inference speed may be slower, you gain complete control and privacy.

Local Mistral Agent Implementation

import json
import requests
from typing import Any, Dict
 
class LocalMistralAgent:
    def __init__(self, ollama_url: str = "http://localhost:11434"):
        self.ollama_url = ollama_url
        self.model = "mistral"
        self.tools = [
            {"name": "smart_scraper_tool", "description": "Extract structured data from webpages"},
            {"name": "search_scraper_tool", "description": "Perform web searches"},
            {"name": "markdownify_tool", "description": "Convert webpages/PDFs to markdown"}
        ]
    
    def call_ollama(self, messages, tools=None):
        """Call local Ollama model"""
        try:
            payload = {
                "model": self.model,
                "messages": messages,
                "stream": False,
                "temperature": 0.1
            }
            
            if tools:
                tools_description = "\n".join([f"- {t['name']}: {t['description']}" for t in tools])
                messages[0]["content"] += f"\n\nAvailable tools:\n{tools_description}"
            
            response = requests.post(f"{self.ollama_url}/api/chat", json=payload)
            
            if response.status_code == 200:
                return response.json()
            else:
                return {"error": f"HTTP {response.status_code}: {response.text}"}
        except Exception as e:
            return {"error": str(e)}
    
    def execute_tool(self, tool_name: str, arguments: Dict[str, Any]) -> Dict[str, Any]:
        """Execute the appropriate tool"""
        if tool_name == "smart_scraper_tool":
            return smart_scraper_tool(**arguments)
        elif tool_name == "search_scraper_tool":
            return search_scraper_tool(**arguments)
        elif tool_name == "markdownify_tool":
            return markdownify_tool(**arguments)
        else:
            return {"error": f"Unknown tool: {tool_name}"}
    
    def parse_response(self, response_content: str) -> Dict[Any, Any]:
        """Parse the model response to extract tool calls or direct responses"""
        try:
            parsed = json.loads(response_content)
            return parsed
        except json.JSONDecodeError:
            return {"action": "response", "content": response_content}
    
    def run(self, user_message: str) -> str:
        """Run the agent with user message"""
        messages = [
            {"role": "user", "content": user_message}
        ]
        
        # Get initial response from model
        response = self.call_ollama(messages, self.tools)
        
        if "error" in response:
            return f"Error: {response['error']}"
        
        model_response = response["message"]["content"]
        parsed_response = self.parse_response(model_response)
        
        # Check if model wants to use a tool
        if parsed_response.get("action") == "tool_call":
            tool_name = parsed_response.get("tool")
            arguments = parsed_response.get("arguments", {})
            
            # Execute the tool
            tool_result = self.execute_tool(tool_name, arguments)
            
            # Add tool result to conversation and get final response
            messages.extend([
                {"role": "assistant", "content": model_response},
                {"role": "user", "content": f"Tool result: {json.dumps(tool_result)}. Please provide a human-readable summary."}
            ])
            
            final_response = self.call_ollama(messages, tools=None)
            
            if "error" in final_response:
                return f"Error in final response: {final_response['error']}"
            
            return final_response["message"]["content"]
        
        elif parsed_response.get("action") == "response":
            return parsed_response.get("content", model_response)
        
        else:
            return model_response
 
def main():
    agent = LocalMistralAgent()
    
    # Test connectivity to Ollama
    try:
        test_response = agent.call_ollama([{"role": "user", "content": "Hello, are you working?"}])
        if "error" in test_response:
            print(f"Error connecting to Ollama: {test_response['error']}")
            print("Make sure Ollama is running with: ollama serve")
            return
    except Exception as e:
        print(f"Failed to connect to Ollama: {e}")
        return
    
    # Example usage
    examples = [
        "Scrape https://www.amazon.it/s?k=keyboards and get the keyboard prices",
    ]
    
    for query in examples:
        print(f"User: {query}")
        response = agent.run(query)
        print(f"Assistant: {response}\n" + "="*60 + "\n")
 
if __name__ == "__main__":
    main()

Results

The script execution took a total of 54.2 seconds.

The result is complete and accurate, although the inference speed is lower compared to cloud models. The local Mistral model successfully understood the task and executed the appropriate tools. The longer execution time is expected when running on local hardware compared to optimized cloud infrastructure. However, this approach offers superior privacy and eliminates recurring API costs. Here's an excerpt of the result:

{
  "keyboard_prices": [
    {
      "product": "Logitech G213 Prodigy Gaming Cablata",
      "price": "36.87 $"
    },
    {
      "product": "Logitech K120 Wired Keyboard for Windows", 
      "price": "26.57 $"
    },
    {
      "product": "Basic Keyboard",
      "price": "12.34 $"
    }
  ]
}

Advantages of Local Models

  • Privacy: All data remains on your hardware
  • Costs: No API call costs after initial hardware investment
  • Control: Complete control over model versions and configurations
  • Availability: Works offline (except for scraping tools)

Disadvantages

  • Performance: Slower inference speed compared to cloud models
  • Quality: Potentially lower quality compared to larger models
  • Resources: Requires adequate local hardware (GPU recommended)
  • Complexity: More complex setup compared to cloud solutions

Conclusions

Comparison Table

Looking at the results provided in the previous chapters here is a comprehensive breakdown of all approaches:

Service Response time Quality of Response PDF support Correct result
Mistral (standalone) - No answer
Mistral with LangGraph 27.3s Perfect
Mistral with OpenAI lib 22.8s Perfect
Mistral with Ollama 54.2s Perfect

Key Findings

As is evident from the results, Mistral API without external tools cannot access websites or PDFs independently. However, when integrated with specialized agents and tools, it becomes a powerful solution for web scraping and data extraction.

The OpenAI library approach provides the best balance between performance and simplicity, completing the task in just 22.8 seconds with minimal overhead. LangGraph offers additional features for complex workflows at the cost of slightly increased execution time. The local Ollama approach trades speed for privacy and cost efficiency, making it suitable for scenarios where data security is paramount.

All three implementations successfully extracted structured data from the website with perfect accuracy, demonstrating that agent-based approaches effectively overcome the limitations of standalone LLM providers.

Recommendations

For production environments prioritizing speed and minimal complexity, use the OpenAI library approach. For complex multi-step workflows with debugging requirements, LangGraph is ideal. For privacy-sensitive applications where cost is less critical, deploy a local Ollama solution.

FAQ

Q: What is ScrapeGraphAI and why do you use it for the tools?

A: ScrapeGraphAI is a specialized service for AI-powered web scraping. We use it because, as shown in our previous benchmark, traditional LLM providers like Claude and Mistral alone couldn't effectively extract data from web sources. ScrapeGraphAI provides the missing capabilities for search, scraping, and PDF processing.

Q: Do I need a ScrapeGraphAI API key to run these examples?

A: Yes, you'll need a ScrapeGraphAI API key. Replace "sgai-***" in the code examples with your actual API key. You can get one from their website.

Q: Why did Mistral fail in the original benchmark but succeed with agents?

A: Mistral (and other LLMs) failed because they lack built-in web scraping capabilities. By creating agents that use specialized tools like ScrapeGraphAI, we give Mistral the ability to interact with web content, PDFs, and perform searches effectively.

Technical Implementation

Q: Which approach should I choose: LangGraph or OpenAI lib?

A: Choose based on your needs:

  • LangGraph: Better for complex workflows, built-in tracing, and debugging capabilities
  • OpenAI lib: Lighter weight, simpler implementation, better performance if you don't need advanced features
  • Ollama: Best for privacy-focused applications where speed is not critical

Q: Can I use other LLM providers instead of Mistral?

A: Yes! For the OpenAI lib approach, you can easily switch to any OpenAI-compatible API by changing the base_url. For LangGraph, you can use any LangChain-supported LLM provider. For Ollama, you can use any model available on their repository.

Q: How do I handle rate limits and errors?

A: Both implementations include basic error handling in the tool functions. For production use, consider adding:

  • Retry logic with exponential backoff
  • Rate limiting mechanisms
  • Proper logging and monitoring
  • Circuit breakers for graceful degradation

Q: Are there any costs involved?

A: Yes, you'll need to pay for:

  • Mistral API calls (or your chosen LLM provider)
  • ScrapeGraphAI API usage
  • Any additional services the tools access
  • Infrastructure costs for local models (electricity, hardware maintenance)

Customization and Extensions

Q: Can I add more tools to the agent?

A: Absolutely! All implementations are designed to be extensible. Simply:

  1. Create new tool functions following the same pattern
  2. Add them to the tools list
  3. Update the system prompt to describe the new capabilities

Q: How can I improve the agent's performance?

A: Consider these optimizations:

  • Adjust the temperature parameter for more consistent results
  • Fine-tune your system prompts with examples
  • Implement caching for repeated requests
  • Add input validation and sanitization
  • Use batch processing for multiple queries

Q: Can these agents handle multiple concurrent requests?

A: The basic implementations are synchronous. For production use with concurrent requests, consider:

  • Using async/await patterns
  • Implementing connection pooling
  • Adding queue management for heavy workloads
  • Deploying with load balancing

Troubleshooting

Q: The agent isn't calling tools when expected. What's wrong?

A: Common issues:

  • Check that your system prompt clearly describes when to use tools
  • Ensure your user queries are specific enough to trigger tool usage
  • Verify API keys are correctly set
  • Make sure tool schemas match the function signatures

Q: How do I debug what the agent is doing?

A: LangGraph provides built-in tracing. For the OpenAI lib approach, add logging to track:

  • Messages sent to the LLM
  • Tool calls and their arguments
  • Tool execution results
  • Response times for each component

Q: Can I deploy these agents in production?

A: Yes, but consider additional requirements:

  • Proper error handling and logging
  • Security measures (API key management, input sanitization)
  • Monitoring and alerting systems
  • Scaling considerations (load balancing, caching)
  • Rate limiting and cost controls
  • Environment-based configuration management

Related Articles

If you found this tutorial helpful, you might also be interested in these related articles:

Give your AI Agent superpowers with lightning-fast web data!