ScrapeGraphAI x LlamaIndex: A Game-Changer for M&A Data Extraction

·5 min read min read·Tutorials
Share:
ScrapeGraphAI x LlamaIndex: A Game-Changer for M&A Data Extraction

In the ever-evolving landscape of data extraction, the integration of ScrapeGraphAI with LlamaIndex has unlocked powerful capabilities for structured content extraction. Whether you're monitoring mergers and acquisitions (M&A) or extracting detailed company information, this integration simplifies the workflow and ensures accuracy.

In this blog, we'll walk you through:

  1. How to integrate ScrapeGraphAI with LlamaIndex.
  2. An example use case: Extracting M&A data using the combined power of ScrapeGraphAI and LlamaIndex.
  3. A practical Python implementation.

Why LlamaIndex + ScrapeGraphAI?

The partnership between ScrapeGraphAI and LlamaIndex combines advanced AI-driven web scraping with LlamaIndex's indexing and querying capabilities. This integration enables:

  • Seamless structured data extraction.
  • Defined output schemas for consistency.
  • Enhanced scalability for dynamic content needs.

Getting Started

Step 1: Install Dependencies

To use the tools, install the required Python packages:

bash
pip install llama-index
pip install llama-index-tools-scrapegraphai

Step 2: Import Your ScrapeGraph API Key

You'll need to set up your API key securely:

python
import os
from getpass import getpass

# Check if the API key is already set in the environment
sgai_api_key = os.getenv("SGAI_API_KEY")

if sgai_api_key:
    print("SGAI_API_KEY found in environment.")
else:
    print("SGAI_API_KEY not found in environment.")
    # Prompt the user to input the API key securely (hidden input)
    sgai_api_key = getpass("Please enter your SGAI_API_KEY: ").strip()
    if sgai_api_key:
        # Set the API key in the environment
        os.environ["SGAI_API_KEY"] = sgai_api_key
        print("SGAI_API_KEY has been set in the environment.")
    else:
        print("No API key entered. Please set the API key to continue.")

Defining Output Schemas

Output schemas provide a blueprint for the data you want to extract. Using Pydantic, you can define schemas for both simple and complex data structures.

Simple Schema Example

For straightforward information extraction:

python
from pydantic import BaseModel, Field

class PageInfoSchema(BaseModel):
    title: str = Field(description="The title of the webpage")
    description: str = Field(description="The description of the webpage")

Complex Schema Example

For structured data with multiple related items:

python
from pydantic import BaseModel, Field
from typing import List, Dict, Optional

# Schema for founder information
class FounderSchema(BaseModel):
    name: str = Field(description="Name of the founder")
    role: str = Field(description="Role of the founder in the company")
    linkedin: str = Field(description="LinkedIn profile of the founder")

# Schema for pricing plans
class PricingPlanSchema(BaseModel):
    tier: str = Field(description="Name of the pricing tier")
    price: str = Field(description="Price of the plan")
    credits: int = Field(description="Number of credits included in the plan")

# Schema for social links
class SocialLinksSchema(BaseModel):
    linkedin: str = Field(description="LinkedIn page of the company")
    twitter: str = Field(description="Twitter page of the company")
    github: str = Field(description="GitHub page of the company")

# Schema for company information
class CompanyInfoSchema(BaseModel):
    company_name: str = Field(description="Name of the company")
    description: str = Field(description="Brief description of the company")
    founders: List[FounderSchema] = Field(description="List of company founders")
    logo: str = Field(description="Logo URL of the company")
    partners: List[str] = Field(description="List of company partners")
    pricing_plans: List[PricingPlanSchema] = Field(description="Details of pricing plans")
    contact_emails: List[str] = Field(description="Contact emails of the company")
    social_links: SocialLinksSchema = Field(description="Social links of the company")
    privacy_policy: str = Field(description="URL to the privacy policy")
    terms_of_service: str = Field(description="URL to the terms of service")
    api_status: str = Field(description="API status page URL")

Extracting Data with ScrapeGraphAI and LlamaIndex

Here's how to extract structured data using our schema:

python
from llama_index.tools.scrapegraph.base import ScrapegraphToolSpec

scrapegraph_tool = ScrapegraphToolSpec()

def extract_company_info(url: str, api_key: str):
    response = scrapegraph_tool.scrapegraph_smartscraper(
        prompt="Extract detailed company information including name, description, industry, founding year, employee count, location details, and contact information",
        url=url,
        api_key=api_key,
        schema=CompanyInfoSchema,
    )
    return response["result"]

url = "https://scrapegraphai.com/"
company_info = extract_company_info(url, sgai_api_key)

Output:

json
{
  "company_name": "ScrapeGraphAI",
  "description": "Transform any website into clean, organized data for AI agents and Data Analytics. Enhance your apps with our AI-powered API.",
  "founders": [
    {
      "name": "Marco Perini",
      "role": "Founder & Technical Lead",
      "linkedin": "https://www.linkedin.com/in/perinim/"
    },
    {
      "name": "Marco Vinciguerra",
      "role": "Founder & Software Engineer",
      "linkedin": "https://www.linkedin.com/in/marco-vinciguerra-7ba365242/"
    },
    {
      "name": "Lorenzo Padoan",
      "role": "Founder & Product Engineer",
      "linkedin": "https://www.linkedin.com/in/lorenzo-padoan-4521a2154/"
    }
  ],
  "logo": "https://scrapegraphai.com/images/scrapegraphai_logo.svg",
  "partners": ["LangChain", "PostHog", "AWS", "NVIDIA"],
  "pricing_plans": [
    {
      "tier": "Free",
      "price": "$0",
      "credits": 100
    },
    {
      "tier": "Starter",
      "price": "$20/month",
      "credits": 5000
    },
    {
      "tier": "Growth",
      "price": "$100/month",
      "credits": 40000
    },
    {
      "tier": "Pro",
      "price": "$500/month",
      "credits": 250000
    }
  ],
  "contact_emails": ["contact@scrapegraphai.com"],
  "social_links": {
    "linkedin": "https://www.linkedin.com/company/101881123",
    "twitter": "https://x.com/scrapegraphai",
    "github": "https://github.com/ScrapeGraphAI/Scrapegraph-ai"
  },
  "privacy_policy": "https://scrapegraphai.com/privacy",
  "terms_of_service": "https://scrapegraphai.com/terms",
  "api_status": "https://scrapegraphapi.openstatus.dev"
}

Processing and Saving Results

To process and save the extracted data:

python
import pandas as pd

# Flatten and organize the data
company_info_flat = {
    "company_name": company_info["company_name"],
    "description": company_info["description"],
    "founders": company_info["founders"],
    "logo": company_info["logo"],
    "partners": company_info["partners"],
    "pricing_plans": company_info["pricing_plans"],
    "contact_emails": ", ".join(company_info["contact_emails"]),
    "privacy_policy": company_info["privacy_policy"],
    "terms_of_service": company_info["terms_of_service"],
    "api_status": company_info["api_status"],
    "linkedin": company_info["social_links"]["linkedin"],
    "twitter": company_info["social_links"]["twitter"],
    "github": company_info["social_links"].get("github", None)
}

# Create separate DataFrames for different aspects
df_company = pd.DataFrame([company_info_flat])
df_founders = pd.DataFrame(company_info["founders"])
df_pricing = pd.DataFrame(company_info["pricing_plans"])
df_partners = pd.DataFrame({"partner": company_info["partners"]})

# Save to CSV files
df_company.to_csv("company_info.csv", index=False)
df_founders.to_csv("founders.csv", index=False)
df_pricing.to_csv("pricing_plans.csv", index=False)
df_partners.to_csv("partners.csv", index=False)

This integration showcases the power of combining ScrapeGraphAI's structured extraction capabilities with LlamaIndex's data processing features. Whether you're analyzing M&A data, tracking company information, or building comprehensive market research datasets, this combination provides a robust solution for your data extraction needs.

Data Processing Details

Our data processing pipeline involves the following steps:

  1. Data Extraction: We use ScrapeGraphAI's AI-powered API to extract structured data from websites.
  2. Data Transformation: We transform the cleaned data into a format suitable for analysis and visualization.
  3. Data Storage: We store the transformed data in a secure and scalable database.

Conclusion

In this blog post, we demonstrated the power of combining ScrapeGraphAI's structured extraction capabilities with LlamaIndex's data processing features. We showed how to extract structured data using our schema, process and save the results, and provided pricing information and data processing details. Whether you're analyzing M&A data, tracking company information, or building comprehensive market research datasets, this combination provides a robust solution for your data extraction needs.

Open In Colab

Did you find this article helpful?

Share it with your network!

Share:

Transform Your Data Collection

Experience the power of AI-driven web scraping with ScrapeGrapAI API. Start collecting structured data in minutes, not days.