Three weeks ago we did a benchmark in this blog where we compared all the various fetching tools built in by major LLM providers.
The conclusions were that llms providers are not suited for extracting data from web.
Starting from these results I had this question in my mind: "Is it possible to create a custom agent based on llms for extracting correct and precise data from the web?".
In this tutorial we are going to do that with the most famous frameworks for building agents.
We will test both Langraph and OpenAI lib for making a well done comparison between these libraries.
Before starting
Here's a quick recap of the two weeks ago post:
| Service | Speed | Quality of Response | PDF support | Correct result |
|---|---|---|---|---|
| Claude | - | No answer | ✅ | ❌ |
| ScrapeGraphAI | 15s | Perfect | ✅ | ✅ |
| 3,09m | Good | ❌ | ❌ | |
| Mistral | - | No answer | ❌ | ❌ |
Here is a breakdown of the results:
- Mistral and Claude did not provide an answer
- Google took 3.09m
- ScrapeGraphAI just took 15 s
In this week blogpost we are going to create an agent using Mistral for unlocking:
- search capabilities
- scraping capabilities
- fetching pdf capabilities
For being coherent with the previous blogpost we will continue to use the same link of the previous week: https://www.amazon.it/s?k=keyboards.
Comparison goals: The comparison features will be:
- Quality of the response
- Response time
Case 1: LangGraph
Let's start with LangGraph, the most famous tool for agents creation.
The first thing to do is creating the tools the agent will use
Search tool
from scrapegraph_py import Client
from langchain_core.tools import tool
@tool
def search_scraper_tool(user_prompt: str, num_results: int = 3) -> dict:
"""Perform AI-powered web searches with structured results."""
try:
sgai_client = Client(api_key="sgai-***")
response = sgai_client.searchscraper(
user_prompt=user_prompt,
num_results=num_results
)
return {"success": True, "results": response.result}
except Exception as e:
return {"success": False, "error": str(e)}Scraping tool
from scrapegraph_py import Client
from langchain_core.tools import tool
@tool
def smart_scraper_tool(website_url: str, user_prompt: str) -> dict:
"""Extract structured data from a webpage using AI-powered scraping."""
try:
sgai_client = Client(api_key="sgai-***")
response = sgai_client.smartscraper(
website_url=website_url,
user_prompt=user_prompt
)
return {"success": True, "data": response.result, "source_url": website_url}
except Exception as e:
return {"success": False, "error": str(e)}Fetching pdf capabilities
from scrapegraph_py import Client
from langchain_core.tools import tool
@tool
def markdownify_tool(website_url: str) -> dict:
"""Convert a webpage or PDF into clean, formatted markdown."""
try:
sgai_client = Client(api_key="sgai-***")
response = sgai_client.markdownify(website_url=website_url)
return {"success": True, "markdown": response.result, "source_url": website_url}
except Exception as e:
return {"success": False, "error": str(e)}Agent definition with main
import os
from langgraph.graph import StateGraph, END
from langgraph.prebuilt import ToolNode
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_mistralai import ChatMistralAI
class AgentState(dict):
messages: list
task: str
def agent_node(state):
"""Main agent node with all tools"""
llm = ChatMistralAI(
api_key=os.getenv("MISTRAL_API_KEY", "your_mistral_key_here"),
model="mistral-large-latest",
temperature=0.1
)
tools = [smart_scraper_tool, markdownify_tool, search_scraper_tool]
llm_with_tools = llm.bind_tools(tools)
response = llm_with_tools.invoke(state["messages"])
return {"messages": state["messages"] + [response]}
def create_agent():
"""Create simple LangGraph agent"""
workflow = StateGraph(AgentState)
workflow.add_node("agent", agent_node)
workflow.set_entry_point("agent")
workflow.add_edge("agent", END)
return workflow.compile()
def main():
agent = create_agent()
# System prompt
system_msg = SystemMessage(content="""You are an AI assistant with web scraping capabilities.
Use smart_scraper_tool for data extraction, markdownify_tool for PDFs, and search_scraper_tool for searches.""")
# Example usage
examples = [
"Scrape https://www.amazon.it/s?k=keyboards and get prices",
]
for query in examples:
print(f"Query: {query}")
result = agent.invoke({
"messages": [system_msg, HumanMessage(content=query)]
})
print(f"Response: {result['messages'][-1].content}\n")
if __name__ == "__main__":
main()Result
The execution of the script took in total 27.3 seconds.
The agent successfully understood the task and called the appropriate scraping tool. The response quality was excellent, providing well-structured data with accurate pricing information. The execution time includes the initial API calls to Mistral, the tool invocation overhead, and the ScrapeGraphAI processing time.
Here is a snippet of the result:
{
"keyboard_prices": [
{
"product": "Logitech G213 Prodigy Gaming Cablata",
"price": "36.87 $"
},
{
"product": "Logitech K120 Wired Keyboard for Windows",
"price": "26.57 $"
},
{
"product": "Basic Keyboard",
"price": "12.34 $"
},
{
"product": "Logitech K120 Wired Keyboard with Cable Business Edition",
"price": "13.29 $"
},
{
"product": "Ewent Professional USB Wired Keyboard Italian Layout QWERTY",
"price": "11.00 $"
}
]
}Case 2: Custom Agent with OpenAI Library
One of the few interesting things LangGraph provides is tracing of your agent.
If this feature is not required there is no clue at using it, because it is a heavy lib with many useless features.
So lets start with the definition of the tools that will be used inside the agent.
Search tool
from scrapegraph_py import Client
def search_scraper_tool(user_prompt: str, num_results: int = 3) -> dict:
"""Perform AI-powered web searches with structured results."""
try:
sgai_client = Client(api_key="sgai-***")
response = sgai_client.searchscraper(
user_prompt=user_prompt,
num_results=num_results
)
return {"success": True, "results": response.result}
except Exception as e:
return {"success": False, "error": str(e)}Scraping tool
from scrapegraph_py import Client
def smart_scraper_tool(website_url: str, user_prompt: str) -> dict:
"""Extract structured data from a webpage using AI-powered scraping."""
try:
sgai_client = Client(api_key="sgai-***")
response = sgai_client.smartscraper(
website_url=website_url,
user_prompt=user_prompt
)
return {"success": True, "data": response.result, "source_url": website_url}
except Exception as e:
return {"success": False, "error": str(e)}Fetching pdf capabilities
from scrapegraph_py import Client
def markdownify_tool(website_url: str) -> dict:
"""Convert a webpage or PDF into clean, formatted markdown."""
try:
sgai_client = Client(api_key="sgai-***")
response = sgai_client.markdownify(website_url=website_url)
return {"success": True, "markdown": response.result, "source_url": website_url}
except Exception as e:
return {"success": False, "error": str(e)}Agent definition
import os
import json
from typing import Any, Dict
from openai import OpenAI
class MistralAgent:
def __init__(self):
self.client = OpenAI(
api_key=os.getenv("MISTRAL_API_KEY", "your_mistral_key_here"),
base_url="https://api.mistral.ai/v1"
)
self.tools = [
{
"type": "function",
"function": {
"name": "smart_scraper_tool",
"description": "Extract structured data from a webpage using AI-powered scraping",
"parameters": {
"type": "object",
"properties": {
"website_url": {"type": "string", "description": "URL to scrape"},
"user_prompt": {"type": "string", "description": "What to extract"}
},
"required": ["website_url", "user_prompt"]
}
}
},
{
"type": "function",
"function": {
"name": "search_scraper_tool",
"description": "Perform AI-powered web searches with structured results",
"parameters": {
"type": "object",
"properties": {
"user_prompt": {"type": "string", "description": "Search query"},
"num_results": {"type": "integer", "description": "Number of results", "default": 3}
},
"required": ["user_prompt"]
}
}
},
{
"type": "function",
"function": {
"name": "markdownify_tool",
"description": "Convert a webpage or PDF into clean formatted markdown",
"parameters": {
"type": "object",
"properties": {
"website_url": {"type": "string", "description": "URL to convert"}
},
"required": ["website_url"]
}
}
}
]
def call_mistral(self, messages, tools=None):
"""Call Mistral API with optional tools"""
try:
response = self.client.chat.completions.create(
model="mistral-large-latest",
messages=messages,
tools=tools,
temperature=0.1
)
return response
except Exception as e:
return {"error": str(e)}
def execute_tool(self, tool_name: str, arguments: Dict[str, Any]) -> Dict[str, Any]:
"""Execute the appropriate tool"""
if tool_name == "smart_scraper_tool":
return smart_scraper_tool(**arguments)
elif tool_name == "search_scraper_tool":
return search_scraper_tool(**arguments)
elif tool_name == "markdownify_tool":
return markdownify_tool(**arguments)
else:
return {"error": f"Unknown tool: {tool_name}"}
def run(self, user_message: str) -> str:
"""Run the agent with user message"""
messages = [
{"role": "user", "content": user_message}
]
# Get initial response from model
response = self.call_mistral(messages, self.tools)
if hasattr(response, 'error'):
return f"Error: {response.error}"
# Check if model wants to use a tool
if response.choices[0].message.tool_calls:
tool_call = response.choices[0].message.tool_calls[0]
tool_name = tool_call.function.name
arguments = json.loads(tool_call.function.arguments)
# Execute the tool
tool_result = self.execute_tool(tool_name, arguments)
# Add tool result to conversation and get final response
messages.append({"role": "assistant", "content": response.choices[0].message.content})
messages.append({
"role": "user",
"content": f"Tool result: {json.dumps(tool_result)}. Please provide a human-readable summary."
})
final_response = self.call_mistral(messages)
if hasattr(final_response, 'error'):
return f"Error: {final_response.error}"
return final_response.choices[0].message.content
else:
return response.choices[0].message.content
def main():
agent = MistralAgent()
# Example usage
examples = [
"Scrape https://www.amazon.it/s?k=keyboards and get keyboard prices",
]
for query in examples:
print(f"User: {query}")
response = agent.run(query)
print(f"Assistant: {response}\n" + "="*60 + "\n")
if __name__ == "__main__":
main()Results
The script execution took a total of 22.8 seconds.
The OpenAI library approach proved to be more efficient than LangGraph. The implementation is more straightforward, with less overhead from the framework. The agent correctly identified the need to use the smart_scraper_tool and executed it efficiently. The result quality was identical to the LangGraph implementation, but with faster overall execution time. This is primarily due to the lighter weight of the OpenAI library compared to the more feature-rich LangGraph framework.
Here's an excerpt of the result:
json
{
"keyboard_prices": [
{
"product": "Logitech G213 Prodigy Gaming Cablata",
"price": "36.87 $"
},
{
"product": "Logitech K120 Wired Keyboard for Windows",
"price": "26.57 $"
},
{
"product": "Basic Keyboard",
"price": "12.34 $"
}
]
}Case 3: Local Agent with Ollama
For users concerned about privacy and operational costs, using a local model through Ollama offers an interesting alternative. While inference speed may be slower, you gain complete control and privacy.
Local Mistral Agent Implementation
import json
import requests
from typing import Any, Dict
class LocalMistralAgent:
def __init__(self, ollama_url: str = "http://localhost:11434"):
self.ollama_url = ollama_url
self.model = "mistral"
self.tools = [
{"name": "smart_scraper_tool", "description": "Extract structured data from webpages"},
{"name": "search_scraper_tool", "description": "Perform web searches"},
{"name": "markdownify_tool", "description": "Convert webpages/PDFs to markdown"}
]
def call_ollama(self, messages, tools=None):
"""Call local Ollama model"""
try:
payload = {
"model": self.model,
"messages": messages,
"stream": False,
"temperature": 0.1
}
if tools:
tools_description = "\n".join([f"- {t['name']}: {t['description']}" for t in tools])
messages[0]["content"] += f"\n\nAvailable tools:\n{tools_description}"
response = requests.post(f"{self.ollama_url}/api/chat", json=payload)
if response.status_code == 200:
return response.json()
else:
return {"error": f"HTTP {response.status_code}: {response.text}"}
except Exception as e:
return {"error": str(e)}
def execute_tool(self, tool_name: str, arguments: Dict[str, Any]) -> Dict[str, Any]:
"""Execute the appropriate tool"""
if tool_name == "smart_scraper_tool":
return smart_scraper_tool(**arguments)
elif tool_name == "search_scraper_tool":
return search_scraper_tool(**arguments)
elif tool_name == "markdownify_tool":
return markdownify_tool(**arguments)
else:
return {"error": f"Unknown tool: {tool_name}"}
def parse_response(self, response_content: str) -> Dict[Any, Any]:
"""Parse the model response to extract tool calls or direct responses"""
try:
parsed = json.loads(response_content)
return parsed
except json.JSONDecodeError:
return {"action": "response", "content": response_content}
def run(self, user_message: str) -> str:
"""Run the agent with user message"""
messages = [
{"role": "user", "content": user_message}
]
# Get initial response from model
response = self.call_ollama(messages, self.tools)
if "error" in response:
return f"Error: {response['error']}"
model_response = response["message"]["content"]
parsed_response = self.parse_response(model_response)
# Check if model wants to use a tool
if parsed_response.get("action") == "tool_call":
tool_name = parsed_response.get("tool")
arguments = parsed_response.get("arguments", {})
# Execute the tool
tool_result = self.execute_tool(tool_name, arguments)
# Add tool result to conversation and get final response
messages.extend([
{"role": "assistant", "content": model_response},
{"role": "user", "content": f"Tool result: {json.dumps(tool_result)}. Please provide a human-readable summary."}
])
final_response = self.call_ollama(messages, tools=None)
if "error" in final_response:
return f"Error in final response: {final_response['error']}"
return final_response["message"]["content"]
elif parsed_response.get("action") == "response":
return parsed_response.get("content", model_response)
else:
return model_response
def main():
agent = LocalMistralAgent()
# Test connectivity to Ollama
try:
test_response = agent.call_ollama([{"role": "user", "content": "Hello, are you working?"}])
if "error" in test_response:
print(f"Error connecting to Ollama: {test_response['error']}")
print("Make sure Ollama is running with: ollama serve")
return
except Exception as e:
print(f"Failed to connect to Ollama: {e}")
return
# Example usage
examples = [
"Scrape https://www.amazon.it/s?k=keyboards and get the keyboard prices",
]
for query in examples:
print(f"User: {query}")
response = agent.run(query)
print(f"Assistant: {response}\n" + "="*60 + "\n")
if __name__ == "__main__":
main()Results
The script execution took a total of 54.2 seconds.
The result is complete and accurate, although the inference speed is lower compared to cloud models. The local Mistral model successfully understood the task and executed the appropriate tools. The longer execution time is expected when running on local hardware compared to optimized cloud infrastructure. However, this approach offers superior privacy and eliminates recurring API costs. Here's an excerpt of the result:
{
"keyboard_prices": [
{
"product": "Logitech G213 Prodigy Gaming Cablata",
"price": "36.87 $"
},
{
"product": "Logitech K120 Wired Keyboard for Windows",
"price": "26.57 $"
},
{
"product": "Basic Keyboard",
"price": "12.34 $"
}
]
}Advantages of Local Models
- Privacy: All data remains on your hardware
- Costs: No API call costs after initial hardware investment
- Control: Complete control over model versions and configurations
- Availability: Works offline (except for scraping tools)
Disadvantages
- Performance: Slower inference speed compared to cloud models
- Quality: Potentially lower quality compared to larger models
- Resources: Requires adequate local hardware (GPU recommended)
- Complexity: More complex setup compared to cloud solutions
Conclusions
Comparison Table
Looking at the results provided in the previous chapters here is a comprehensive breakdown of all approaches:
| Service | Response time | Quality of Response | PDF support | Correct result |
|---|---|---|---|---|
| Mistral (standalone) | - | No answer | ❌ | ❌ |
| Mistral with LangGraph | 27.3s | Perfect | ✅ | ✅ |
| Mistral with OpenAI lib | 22.8s | Perfect | ✅ | ✅ |
| Mistral with Ollama | 54.2s | Perfect | ✅ | ✅ |
Key Findings
As is evident from the results, Mistral API without external tools cannot access websites or PDFs independently. However, when integrated with specialized agents and tools, it becomes a powerful solution for web scraping and data extraction.
The OpenAI library approach provides the best balance between performance and simplicity, completing the task in just 22.8 seconds with minimal overhead. LangGraph offers additional features for complex workflows at the cost of slightly increased execution time. The local Ollama approach trades speed for privacy and cost efficiency, making it suitable for scenarios where data security is paramount.
All three implementations successfully extracted structured data from the website with perfect accuracy, demonstrating that agent-based approaches effectively overcome the limitations of standalone LLM providers.
Recommendations
For production environments prioritizing speed and minimal complexity, use the OpenAI library approach. For complex multi-step workflows with debugging requirements, LangGraph is ideal. For privacy-sensitive applications where cost is less critical, deploy a local Ollama solution.
FAQ
Q: What is ScrapeGraphAI and why do you use it for the tools?
A: ScrapeGraphAI is a specialized service for AI-powered web scraping. We use it because, as shown in our previous benchmark, traditional LLM providers like Claude and Mistral alone couldn't effectively extract data from web sources. ScrapeGraphAI provides the missing capabilities for search, scraping, and PDF processing.
Q: Do I need a ScrapeGraphAI API key to run these examples?
A: Yes, you'll need a ScrapeGraphAI API key. Replace "sgai-***" in the code examples with your actual API key. You can get one from their website.
Q: Why did Mistral fail in the original benchmark but succeed with agents?
A: Mistral (and other LLMs) failed because they lack built-in web scraping capabilities. By creating agents that use specialized tools like ScrapeGraphAI, we give Mistral the ability to interact with web content, PDFs, and perform searches effectively.
Technical Implementation
Q: Which approach should I choose: LangGraph or OpenAI lib?
A: Choose based on your needs:
- LangGraph: Better for complex workflows, built-in tracing, and debugging capabilities
- OpenAI lib: Lighter weight, simpler implementation, better performance if you don't need advanced features
- Ollama: Best for privacy-focused applications where speed is not critical
Q: Can I use other LLM providers instead of Mistral?
A: Yes! For the OpenAI lib approach, you can easily switch to any OpenAI-compatible API by changing the base_url. For LangGraph, you can use any LangChain-supported LLM provider. For Ollama, you can use any model available on their repository.
Q: How do I handle rate limits and errors?
A: Both implementations include basic error handling in the tool functions. For production use, consider adding:
- Retry logic with exponential backoff
- Rate limiting mechanisms
- Proper logging and monitoring
- Circuit breakers for graceful degradation
Q: Are there any costs involved?
A: Yes, you'll need to pay for:
- Mistral API calls (or your chosen LLM provider)
- ScrapeGraphAI API usage
- Any additional services the tools access
- Infrastructure costs for local models (electricity, hardware maintenance)
Customization and Extensions
Q: Can I add more tools to the agent?
A: Absolutely! All implementations are designed to be extensible. Simply:
- Create new tool functions following the same pattern
- Add them to the tools list
- Update the system prompt to describe the new capabilities
Q: How can I improve the agent's performance?
A: Consider these optimizations:
- Adjust the temperature parameter for more consistent results
- Fine-tune your system prompts with examples
- Implement caching for repeated requests
- Add input validation and sanitization
- Use batch processing for multiple queries
Q: Can these agents handle multiple concurrent requests?
A: The basic implementations are synchronous. For production use with concurrent requests, consider:
- Using async/await patterns
- Implementing connection pooling
- Adding queue management for heavy workloads
- Deploying with load balancing
Troubleshooting
Q: The agent isn't calling tools when expected. What's wrong?
A: Common issues:
- Check that your system prompt clearly describes when to use tools
- Ensure your user queries are specific enough to trigger tool usage
- Verify API keys are correctly set
- Make sure tool schemas match the function signatures
Q: How do I debug what the agent is doing?
A: LangGraph provides built-in tracing. For the OpenAI lib approach, add logging to track:
- Messages sent to the LLM
- Tool calls and their arguments
- Tool execution results
- Response times for each component
Q: Can I deploy these agents in production?
A: Yes, but consider additional requirements:
- Proper error handling and logging
- Security measures (API key management, input sanitization)
- Monitoring and alerting systems
- Scaling considerations (load balancing, caching)
- Rate limiting and cost controls
- Environment-based configuration management
Related Articles
If you found this tutorial helpful, you might also be interested in these related articles:
- How to Create an AI Agent Without Frameworks - Learn alternative approaches to building AI agents that can use ScrapeGraphAI for data extraction
- Traditional vs AI Scraping: Which One Wins in 2025? - Deep dive into the differences between traditional and AI-powered scraping approaches
- Beyond Firecrawl: The Future of Web Scraping - Discover why AI-powered scraping with ScrapeGraphAI is revolutionizing web data extraction
- ScrapeGraphAI vs AgentQL: Which AI Web Scraper Wins in 2025 - Compare AI-powered web scraping tools and their capabilities
- 7 Best No-Code AI Web Scraper: Top Tools for 2025 - Explore no-code alternatives for web scraping
- ScrapeGraphAI vs Extracta AI: AI Scraper Battle in 2025 - Compare different AI scraping solutions
- Best Web Scraping API Providers - Explore API-based scraping solutions for your projects
- Firecrawl Alternative: Why Developers Choose ScrapeGraphAI - Detailed analysis of Firecrawl alternatives for developers
