Having clean data and neatly visualising that is the backbone of data science and business analytics which can boost your understanding of the market you are in.
Finding the right tools for doing this is easier and smooth than it has ever been.
Introducing ScrapeGraphAI X PandasAI
What is ScrapeGraphAI 🕷️
ScrapeGraphAI is an API for extracting data from the web with the use of AI.
So it will help you with the data part which is focussed on scraping and aggregating information from various sources to gain insights from. This service will fit in your data pipeline perfectly because of the easy to use apis that we provide which are fast and accurate.And it's all AI powered.
What is PandasAI 🐼
PandaAI is an open-source framework that brings together intelligent data processing and natural language analysis. Whether you're working with complex datasets or just starting your data journey, PandaAI provides the tools to define, process, and analyze your data efficiently. Through its powerful data preparation layer and intuitive natural language interface, you can transform raw data into actionable insights without writing complex cod created by Gabriele Venturi in 2022.
PandasAI fits into our second piece which helps you perform operations on the data and visualize the data that we scraped from ScrapeGraphAI. It will help you perform operations on the data frames via natural language avoiding the tedious work that goes into transforming data. It will also plot the data for you using prompting.
How to create a dataset for extracting data from web and create analytics
You can either clone the repository and run the notebook directly or follow the tutorial
In the guide below we are going to create a simple agent which has access to two tools called analyze data and scrape website which will handle the plotting of the graph and scraping of the website using the PandasAI and ScrapeGraphAI. We are going to use OpenAI as our model provider you can choose the model provider of your own
1.First you need to install the following dependencies
ipython==9.2.0
langchain\_core==0.3.59
langchain\_openai==0.3.16
langgraph==0.4.3
pandasai
numpy==1.24.4
python-dotenv==1.1.0
scrapegraph\_py==1.12.0
2.Then after that import the following packages:
from dotenv import load\_dotenv
import pandas as pd
import pandasai as pai
import logging
from scrapegraph\_py import Client
from langchain\_openai import ChatOpenAI
from langgraph.graph import MessagesState, StateGraph, START
from langgraph.prebuilt import tools\_condition, ToolNode
from langchain\_core.messages import HumanMessage, SystemMessage
3.Then you need to load the your API Keys:
\# Load environment variables from .env file
load\_dotenv()
\# Access environment variables
scrape\_graph\_api\_key \= os.getenv('SCRAPEGRAPH\_API\_KEY')
pandasai\_api\_key \= os.getenv('PANDASAI\_API\_KEY')
openai\_api\_key \= os.getenv('OPENAI\_API\_KEY')
\# Set up logging
logging.basicConfig(level=logging.INFO)
(You can get the api keys from following links)
[https://dashboard.scrapegraphai.com/](https://dashboard.scrapegraphai.com/)
[https://app.pandabi.ai/sign-in](https://app.pandabi.ai/sign-in)
[https://platform.openai.com/api-keys](https://platform.openai.com/api-keys)
4. Now we will define our Tool which will use the PandasAI
This tool will plot the graphs for us. we have added some validations on the data to have consistent values
def analyze\_data(df\_dict, question):
"""
Analyze the dataset and answer a question using pandasai.
Parameters:
\- df\_dict (dict): Dictionary containing data as lists under arbitrary keys and numbers as floats
\- question (str): Prompt for plotting the graph based on the data
Returns:
\- The model's response
Raises:
\- ValueError: If input cannot be converted to a valid DataFrame
"""
try:
\# Convert dictionary to DataFrame
df \= pd.DataFrame(df\_dict)
\# Clean columns
for column in df.columns:
\# Check if the column might contain mixed data
if df\[column\].dtype \== object:
\# Remove 'default' or 'N/A' entries
df \= df\[\~df\[column\].isin(\['default', 'N/A'\])\]
\# Function to convert strings to float if they represent numbers
def convert\_to\_float(value):
if isinstance(value, str):
\# Remove currency symbols and replace commas with dots
cleaned\_value \= value.replace('€', '').replace(',', '.')
\# Check if the cleaned value can be converted to float
try:
return float(cleaned\_value)
except ValueError:
\# Return original value if it can't be converted
return value
return value
\# Apply conversion to the column
df\[column\] \= df\[column\].apply(convert\_to\_float)
\# If the column contains numeric values, ensure they're float
if df\[column\].apply(lambda x: isinstance(x, (int, float))).any():
try:
df\[column\] \= pd.to\_numeric(df\[column\], errors='coerce')
\# Drop rows with NaN in this column
df \= df.dropna(subset=\[column\]) if df\[column\].notna().any() else df
except (AttributeError, ValueError):
continue
logging.info("Setting pandasai API key")
pai.api\_key.set(pandasai\_api\_key)
logging.info("Creating DataFrame for analysis")
\# Use pandasai directly with the DataFrame
response \= pai.DataFrame(df).chat(question)
\# Save any generated plot
response.save('plot.png')
return response
except Exception as e:
logging.error(f"Error in analyze\_data: {str(e)}")
raise ValueError(f"Failed to process data: {str(e)}")
5.Now we will define our Tool which will use ScrapeGraphAI
This tool accepts website url and user prompt as arguments and scrapes the content of the website
def scrape\_website(website\_url, user\_prompt) \-\> dict:
"""
Perform a scraping request on a website using ScrapeGraphAI.
Parameters:
\- website\_url (str): The URL of the website to scrape
\- user\_prompt (str): The data extraction prompt
Returns:
\- A dictionary containing the scraped data
"""
try:
sgai\_client \= Client(api\_key=scrape\_graph\_api\_key)
logging.info("Creating ScrapeGraphAI client")
response \= sgai\_client.smartscraper(
website\_url=website\_url,
user\_prompt=user\_prompt
)
sgai\_client.close()
result \= response.values()
\# Extract the data dictionary from the response (usually at index 4\)
data\_dict \= list(result)
\# Ensure the return is a dictionary
if not isinstance(data\_dict, dict):
data\_dict \= {"data": data\_dict}
return data\_dict
except Exception as e:
logging.error(f"Error in scrape\_website: {str(e)}")
raise
6.Now we will create the Agent and add these tools to the agent
\# Create the agent
tools \= \[analyze\_data, scrape\_website\]
llm \= ChatOpenAI(model="gpt-4o", api\_key=openai\_api\_key)
llm\_with\_tools \= llm.bind\_tools(tools)
\# System message for the assistant
sys\_msg \= SystemMessage(content="""You are a helpful assistant tasked with performing scraping scripts with scrapegraphai and analyzing the data with pandasai.
You have access to the following tools:
\- scrape\_website: to scrape a website
\- analyze\_data: to analyze a pandas dataframe
""")
\# Assistant function
def assistant(state: MessagesState):
return {"messages": \[llm\_with\_tools.invoke(\[sys\_msg\] \+ state\["messages"\])\]}
\# Build the graph
builder \= StateGraph(MessagesState)
builder.add\_node("assistant", assistant)
builder.add\_node("tools", ToolNode(tools))
builder.add\_edge(START, "assistant")
builder.add\_conditional\_edges(
"assistant",
tools\_condition,
)
builder.add\_edge("tools", "assistant")
react\_graph \= builder.compile()
7\. You can see the plot of the graph by using the below code:
from IPython.display import Image, display
from langchain\_core.runnables.graph import MermaidDrawMethod
display(
Image(
react\_graph.get\_graph().draw\_mermaid\_png(
draw\_method=MermaidDrawMethod.API,
)
)
)
This is the structure of our graph
9.Now lets run the agent
messages \= react\_graph.invoke(
input={
"messages": \[HumanMessage(content="""Draw me a histogram for rating against price for the products in the following link:
https://www.amazon.com/s?k=keyboards\&crid=2F2S3TU22QHOF\&sprefix=keyboar%2Caps%2C442\&ref=nb\_sb\_noss\_2""")\]
}
)
It will output a file called plot.png in your directory
This is how you can use ScrapeGraphAI and PandasAI together and enhance your data analysis work by 10x with our AI native pipelines that handle the complex stuff for you.
Frequently Asked Question (FAQ)
Are PandasAI and ScrapeGraphAI open source?
Absolutely both the services are open source, do leave a star on our repos
Why should I use ScrapeGraphAI and PandasAI together?
Both fit into the data pipelines seamlessly and work perfectly and both the services are AI powered.
In which scenarios is the ScrapeGraphAI + PandasAI combo most powerful?
If you want to do some research on competitor analysis or study the prices of products on e-commerce websites then this is a perfect combination because such data is quantitative and can be scraped using ScrapeGraphAI and PandasAI can help transform that data and plot beautiful graphs for visualizing.
Related Resources
Want to learn more about data analysis and AI-powered processing? Explore these guides:
- Web Scraping 101 - Master the basics of web scraping
- AI Agent Web Scraping - Learn about AI-powered scraping
- Mastering ScrapeGraphAI - Deep dive into our scraping platform
- LlamaIndex Integration) - Learn about advanced data processing
- Building Intelligent Agents - Create powerful data analysis agents
- Pre-AI to Post-AI Scraping - See how AI has transformed data processing
- Structured Output - Learn about data formatting
- Data Innovation - Discover innovative data analysis methods
- Full Stack Development - Build complete data solutions