ScrapeGraphAIScrapeGraphAI

PandasAI x ScrapeGraphAI️: Building a scraping agent that makes analytics

PandasAI x ScrapeGraphAI️: Building a scraping agent that makes analytics

Having clean data and neatly visualising that is the backbone of data science and business analytics which can boost your understanding of the market you are in.
Finding the right tools for doing this is easier and smooth than it has ever been.
Introducing ScrapeGraphAI X PandasAI

What is ScrapeGraphAI 🕷️

ScrapeGraphAI is an API for extracting data from the web with the use of AI.
So it will help you with the data part which is focussed on scraping and aggregating information from various sources to gain insights from. This service will fit in your data pipeline perfectly because of the easy to use apis that we provide which are fast and accurate.And it's all AI powered.

What is PandasAI 🐼

PandaAI is an open-source framework that brings together intelligent data processing and natural language analysis. Whether you're working with complex datasets or just starting your data journey, PandaAI provides the tools to define, process, and analyze your data efficiently. Through its powerful data preparation layer and intuitive natural language interface, you can transform raw data into actionable insights without writing complex cod created by Gabriele Venturi in 2022.
PandasAI fits into our second piece which helps you perform operations on the data and visualize the data that we scraped from ScrapeGraphAI. It will help you perform operations on the data frames via natural language avoiding the tedious work that goes into transforming data. It will also plot the data for you using prompting.

How to create a dataset for extracting data from web and create analytics

You can either clone the repository and run the notebook directly or follow the tutorial

Repository with the tutorial

In the guide below we are going to create a simple agent which has access to two tools called analyze data and scrape website which will handle the plotting of the graph and scraping of the website using the PandasAI and ScrapeGraphAI. We are going to use OpenAI as our model provider you can choose the model provider of your own

1.First you need to install the following dependencies

ipython==9.2.0  
langchain\_core==0.3.59  
langchain\_openai==0.3.16  
langgraph==0.4.3  
pandasai  
numpy==1.24.4  
python-dotenv==1.1.0  
scrapegraph\_py==1.12.0

2.Then after that import the following packages:

from dotenv import load\_dotenv  
import pandas as pd  
import pandasai as pai  
import logging  
from scrapegraph\_py import Client  
from langchain\_openai import ChatOpenAI  
from langgraph.graph import MessagesState, StateGraph, START  
from langgraph.prebuilt import tools\_condition, ToolNode  
from langchain\_core.messages import HumanMessage, SystemMessage

3.Then you need to load the your API Keys:

\# Load environment variables from .env file  
load\_dotenv()
 
\# Access environment variables  
scrape\_graph\_api\_key \= os.getenv('SCRAPEGRAPH\_API\_KEY')  
pandasai\_api\_key \= os.getenv('PANDASAI\_API\_KEY')  
openai\_api\_key \= os.getenv('OPENAI\_API\_KEY')
 
\# Set up logging  
logging.basicConfig(level=logging.INFO)  
(You can get the api keys from following links)
 
[https://dashboard.scrapegraphai.com/](https://dashboard.scrapegraphai.com/)  
[https://app.pandabi.ai/sign-in](https://app.pandabi.ai/sign-in)  
[https://platform.openai.com/api-keys](https://platform.openai.com/api-keys)

4. Now we will define our Tool which will use the PandasAI

This tool will plot the graphs for us. we have added some validations on the data to have consistent values

def analyze\_data(df\_dict, question):  
    """  
    Analyze the dataset and answer a question using pandasai.
 
    Parameters:  
    \- df\_dict (dict): Dictionary containing data as lists under arbitrary keys and numbers as floats  
    \- question (str): Prompt for plotting the graph based on the data
 
    Returns:  
    \- The model's response
 
    Raises:  
    \- ValueError: If input cannot be converted to a valid DataFrame  
    """  
    try:  
        \# Convert dictionary to DataFrame  
        df \= pd.DataFrame(df\_dict)
 
        \# Clean columns  
        for column in df.columns:  
            \# Check if the column might contain mixed data  
            if df\[column\].dtype \== object:  
                \# Remove 'default' or 'N/A' entries  
                df \= df\[\~df\[column\].isin(\['default', 'N/A'\])\]
 
                \# Function to convert strings to float if they represent numbers  
                def convert\_to\_float(value):  
                    if isinstance(value, str):  
                        \# Remove currency symbols and replace commas with dots  
                        cleaned\_value \= value.replace('€', '').replace(',', '.')  
                        \# Check if the cleaned value can be converted to float  
                        try:  
                            return float(cleaned\_value)  
                        except ValueError:  
                            \# Return original value if it can't be converted  
                            return value  
                    return value
 
                \# Apply conversion to the column  
                df\[column\] \= df\[column\].apply(convert\_to\_float)
 
                \# If the column contains numeric values, ensure they're float  
                if df\[column\].apply(lambda x: isinstance(x, (int, float))).any():  
                    try:  
                        df\[column\] \= pd.to\_numeric(df\[column\], errors='coerce')  
                        \# Drop rows with NaN in this column  
                        df \= df.dropna(subset=\[column\]) if df\[column\].notna().any() else df  
                    except (AttributeError, ValueError):  
                        continue
 
        logging.info("Setting pandasai API key")  
        pai.api\_key.set(pandasai\_api\_key)
 
        logging.info("Creating DataFrame for analysis")  
        \# Use pandasai directly with the DataFrame  
        response \= pai.DataFrame(df).chat(question)
 
        \# Save any generated plot  
        response.save('plot.png')  
        return response
 
    except Exception as e:  
        logging.error(f"Error in analyze\_data: {str(e)}")  
        raise ValueError(f"Failed to process data: {str(e)}")

5.Now we will define our Tool which will use ScrapeGraphAI

This tool accepts website url and user prompt as arguments and scrapes the content of the website

def scrape\_website(website\_url, user\_prompt) \-\> dict:  
    """  
    Perform a scraping request on a website using ScrapeGraphAI.
 
    Parameters:  
    \- website\_url (str): The URL of the website to scrape  
    \- user\_prompt (str): The data extraction prompt
 
    Returns:  
    \- A dictionary containing the scraped data  
    """  
    try:  
        sgai\_client \= Client(api\_key=scrape\_graph\_api\_key)  
        logging.info("Creating ScrapeGraphAI client")
 
        response \= sgai\_client.smartscraper(  
            website\_url=website\_url,  
            user\_prompt=user\_prompt  
        )
 
        sgai\_client.close()  
        result \= response.values()  
        \# Extract the data dictionary from the response (usually at index 4\)  
        data\_dict \= list(result)
 
        \# Ensure the return is a dictionary  
        if not isinstance(data\_dict, dict):  
            data\_dict \= {"data": data\_dict}
 
        return data\_dict
 
    except Exception as e:  
        logging.error(f"Error in scrape\_website: {str(e)}")  
        raise

6.Now we will create the Agent and add these tools to the agent

\# Create the agent  
tools \= \[analyze\_data, scrape\_website\]  
llm \= ChatOpenAI(model="gpt-4o", api\_key=openai\_api\_key)  
llm\_with\_tools \= llm.bind\_tools(tools)
 
\# System message for the assistant  
sys\_msg \= SystemMessage(content="""You are a helpful assistant tasked with   performing scraping scripts with scrapegraphai and analyzing the data with pandasai.  
You have access to the following tools:  
\- scrape\_website: to scrape a website  
\- analyze\_data: to analyze a pandas dataframe  
""")
 
\# Assistant function  
def assistant(state: MessagesState):  
   return {"messages": \[llm\_with\_tools.invoke(\[sys\_msg\] \+ state\["messages"\])\]}
 
\# Build the graph  
builder \= StateGraph(MessagesState)  
builder.add\_node("assistant", assistant)  
builder.add\_node("tools", ToolNode(tools))  
builder.add\_edge(START, "assistant")  
builder.add\_conditional\_edges(  
   "assistant",  
   tools\_condition,  
)  
builder.add\_edge("tools", "assistant")  
react\_graph \= builder.compile()  
7\. You can see the plot of the graph by using the below code:
 
from IPython.display import Image, display  
from langchain\_core.runnables.graph import MermaidDrawMethod  
display(  
   Image(  
       react\_graph.get\_graph().draw\_mermaid\_png(  
           draw\_method=MermaidDrawMethod.API,  
       )  
   )  
)

This is the structure of our graph

9.Now lets run the agent

messages \= react\_graph.invoke(  
   input={  
       "messages": \[HumanMessage(content="""Draw me a histogram for rating against price for the products in the following link:  
                        https://www.amazon.com/s?k=keyboards\&crid=2F2S3TU22QHOF\&sprefix=keyboar%2Caps%2C442\&ref=nb\_sb\_noss\_2""")\]  
   }  
)  

It will output a file called plot.png in your directory

This is how you can use ScrapeGraphAI and PandasAI together and enhance your data analysis work by 10x with our AI native pipelines that handle the complex stuff for you.

Frequently Asked Question (FAQ)

Are PandasAI and ScrapeGraphAI open source?

Absolutely both the services are open source, do leave a star on our repos
Why should I use ScrapeGraphAI and PandasAI together?

Both fit into the data pipelines seamlessly and work perfectly and both the services are AI powered.

In which scenarios is the ScrapeGraphAI + PandasAI combo most powerful?

If you want to do some research on competitor analysis or study the prices of products on e-commerce websites then this is a perfect combination because such data is quantitative and can be scraped using ScrapeGraphAI and PandasAI can help transform that data and plot beautiful graphs for visualizing.

Related Resources

Want to learn more about data analysis and AI-powered processing? Explore these guides:

Give your AI Agent superpowers with lightning-fast web data!