PandasAI x ScrapeGraphAI️: Building a scraping agent that makes analytics

·6 min read min read·Tutorials
PandasAI x ScrapeGraphAI️: Building a scraping agent that makes analytics

Having clean data and neatly visualising that is the backbone of data science and business analytics which can boost your understanding of the market you are in.
Finding the right tools for doing this is easier and smooth than it has ever been.
Introducing ScrapeGraphAI X PandasAI

What is ScrapeGraphAI 🕷️

ScrapeGraphAI is an API for extracting data from the web with the use of AI.
So it will help you with the data part which is focussed on scraping and aggregating information from various sources to gain insights from. This service will fit in your data pipeline perfectly because of the easy to use apis that we provide which are fast and accurate.And it's all AI powered.

What is PandasAI 🐼

PandaAI is an open-source framework that brings together intelligent data processing and natural language analysis. Whether you're working with complex datasets or just starting your data journey, PandaAI provides the tools to define, process, and analyze your data efficiently. Through its powerful data preparation layer and intuitive natural language interface, you can transform raw data into actionable insights without writing complex cod created by Gabriele Venturi in 2022.
PandasAI fits into our second piece which helps you perform operations on the data and visualize the data that we scraped from ScrapeGraphAI. It will help you perform operations on the data frames via natural language avoiding the tedious work that goes into transforming data. It will also plot the data for you using prompting.

How to create a dataset for extracting data from web and create analytics

You can either clone the repository and run the notebook directly or follow the tutorial

Repository with the tutorial

In the guide below we are going to create a simple agent which has access to two tools called analyze data and scrape website which will handle the plotting of the graph and scraping of the website using the PandasAI and ScrapeGraphAI. We are going to use OpenAI as our model provider you can choose the model provider of your own

1.First you need to install the following dependencies

bash
ipython==9.2.0  
langchain_core==0.3.59  
langchain_openai==0.3.16  
langgraph==0.4.3  
pandasai  
numpy==1.24.4  
python-dotenv==1.1.0  
scrapegraph_py==1.12.0

2.Then after that import the following packages:

python
from dotenv import load_dotenv  
import pandas as pd  
import pandasai as pai  
import logging  
from scrapegraph_py import Client  
from langchain_openai import ChatOpenAI  
from langgraph.graph import MessagesState, StateGraph, START  
from langgraph.prebuilt import tools_condition, ToolNode  
from langchain_core.messages import HumanMessage, SystemMessage

3.Then you need to load the your API Keys:

python
# Load environment variables from .env file  
load_dotenv()

# Access environment variables  
scrape_graph_api_key = os.getenv('SCRAPEGRAPH_API_KEY')  
pandasai_api_key = os.getenv('PANDASAI_API_KEY')  
openai_api_key = os.getenv('OPENAI_API_KEY')

# Set up logging  
logging.basicConfig(level=logging.INFO)  
(You can get the api keys from following links)

[https://dashboard.scrapegraphai.com/](https://dashboard.scrapegraphai.com/)  
[https://app.pandabi.ai/sign-in](https://app.pandabi.ai/sign-in)  
[https://platform.openai.com/api-keys](https://platform.openai.com/api-keys)
  1. Now we will define our Tool which will use the PandasAI

This tool will plot the graphs for us. we have added some validations on the data to have consistent values

python
def analyze_data(df_dict, question):  
    """  
    Analyze the dataset and answer a question using pandasai.

    Parameters:  
    - df_dict (dict): Dictionary containing data as lists under arbitrary keys and numbers as floats  
    - question (str): Prompt for plotting the graph based on the data

    Returns:  
    - The model's response

    Raises:  
    - ValueError: If input cannot be converted to a valid DataFrame  
    """  
    try:  
        # Convert dictionary to DataFrame  
        df = pd.DataFrame(df_dict)

        # Clean columns  
        for column in df.columns:  
            # Check if the column might contain mixed data  
            if df[column].dtype == object:  
                # Remove 'default' or 'N/A' entries  
                df = df[~df[column].isin(['default', 'N/A'])]

                # Function to convert strings to float if they represent numbers  
                def convert_to_float(value):  
                    if isinstance(value, str):  
                        # Remove currency symbols and replace commas with dots  
                        cleaned_value = value.replace('€', '').replace(',', '.')  
                        # Check if the cleaned value can be converted to float  
                        try:  
                            return float(cleaned_value)  
                        except ValueError:  
                            # Return original value if it can't be converted  
                            return value  
                    return value

                # Apply conversion to the column  
                df[column] = df[column].apply(convert_to_float)

Ready to Scale Your Data Collection?

Join thousands of businesses using ScrapeGrapAI to automate their web scraping needs. Start your journey today with our powerful API.

text
            # If the column contains numeric values, ensure they're float  
            if df[column].apply(lambda x: isinstance(x, (int, float))).any():  
                try:  
                    df[column] = pd.to_numeric(df[column], errors='coerce')  
                    # Drop rows with NaN in this column  
                    df = df.dropna(subset=[column]) if df[column].notna().any() else df  
                except (AttributeError, ValueError):  
                    continue

    logging.info("Setting pandasai API key")  
    pai.api_key.set(pandasai_api_key)

    logging.info("Creating DataFrame for analysis")  
    # Use pandasai directly with the DataFrame  
    response = pai.DataFrame(df).chat(question)

    # Save any generated plot  
    response.save('plot.png')  
    return response

except Exception as e:  
    logging.error(f"Error in analyze_data: {str(e)}")  
    raise ValueError(f"Failed to process data: {str(e)}")
text

5.Now we will define our Tool which will use ScrapeGraphAI

This tool accepts website url and user prompt as arguments and scrapes the content of the website 

```python
def scrape_website(website_url, user_prompt) -> dict:  
    """  
    Perform a scraping request on a website using ScrapeGraphAI.

    Parameters:  
    - website_url (str): The URL of the website to scrape  
    - user_prompt (str): The data extraction prompt

    Returns:  
    - A dictionary containing the scraped data  
    """  
    try:  
        sgai_client = Client(api_key=scrape_graph_api_key)  
        logging.info("Creating ScrapeGraphAI client")

        response = sgai_client.smartscraper(  
            website_url=website_url,  
            user_prompt=user_prompt  
        )

        sgai_client.close()  
        result = response.values()  
        # Extract the data dictionary from the response (usually at index 4)  
        data_dict = list(result)

        # Ensure the return is a dictionary  
        if not isinstance(data_dict, dict):  
            data_dict = {"data": data_dict}

        return data_dict

    except Exception as e:  
        logging.error(f"Error in scrape_website: {str(e)}")  
        raise

6.Now we will create the Agent and add these tools to the agent

python
# Create the agent  
tools = [analyze_data, scrape_website]  
llm = ChatOpenAI(model="gpt-4o", api_key=openai_api_key)  
llm_with_tools = llm.bind_tools(tools)

# System message for the assistant  
sys_msg = SystemMessage(content="""You are a helpful assistant tasked with   performing scraping scripts with scrapegraphai and analyzing the data with pandasai.  
You have access to the following tools:  
- scrape_website: to scrape a website  
- analyze_data: to analyze a pandas dataframe  
""")

# Assistant function  
def assistant(state: MessagesState):  
   return {"messages": [llm_with_tools.invoke([sys_msg] + state["messages"])]}

# Build the graph  
builder = StateGraph(MessagesState)  
builder.add_node("assistant", assistant)  
builder.add_node("tools", ToolNode(tools))  
builder.add_edge(START, "assistant")  
builder.add_conditional_edges(  
   "assistant",  
   tools_condition,  
)  
builder.add_edge("tools", "assistant")  
react_graph = builder.compile()  
7. You can see the plot of the graph by using the below code:

from IPython.display import Image, display  
from langchain_core.runnables.graph import MermaidDrawMethod  
display(  
   Image(  
       react_graph.get_graph().draw_mermaid_png(  
           draw_method=MermaidDrawMethod.API,  
       )  
   )  
)

This is the structure of our graph

9.Now lets run the agent

python
messages = react_graph.invoke(  
   input={  
       "messages": [HumanMessage(content="""Draw me a histogram for rating against price for the products in the following link:  
                        https://www.amazon.com/s?k=keyboards&crid=2F2S3TU22QHOF&sprefix=keyboar%2Caps%2C442&ref=nb_sb_noss_2""")]  
   }  
)  

It will output a file called plot.png in your directory

This is how you can use ScrapeGraphAI and PandasAI together and enhance your data analysis work by 10x with our AI native pipelines that handle the complex stuff for you.

Frequently Asked Question (FAQ)

Are PandasAI and ScrapeGraphAI open source?

Absolutely both the services are open source, do leave a star on our repos
Why should I use ScrapeGraphAI and PandasAI together?

Both fit into the data pipelines seamlessly and work perfectly and both the services are AI powered.

In which scenarios is the ScrapeGraphAI + PandasAI combo most powerful?

If you want to do some research on competitor analysis or study the prices of products on e-commerce websites then this is a perfect combination because such data is quantitative and can be scraped using ScrapeGraphAI and PandasAI can help transform that data and plot beautiful graphs for visualizing.

Ready to Scale Your Data Collection?

Join thousands of businesses using ScrapeGrapAI to automate their web scraping needs. Start your journey today with our powerful API.

Did you find this article helpful?

Share it with your network!