ScrapeGraphAI x LlamaIndex: A Game-Changer for M&A Data Extraction
In the ever-evolving landscape of data extraction, the integration of ScrapeGraphAI with LlamaIndex has unlocked powerful capabilities for structured content extraction. Whether you're monitoring mergers and acquisitions (M&A) or extracting detailed company information, this integration simplifies the workflow and ensures accuracy.
In this blog, we'll walk you through:
- How to integrate ScrapeGraphAI with LlamaIndex.
- An example use case: Extracting M&A data using the combined power of ScrapeGraphAI and LlamaIndex.
- A practical Python implementation.
Why LlamaIndex + ScrapeGraphAI?
The partnership between ScrapeGraphAI and LlamaIndex combines advanced AI-driven web scraping with LlamaIndex's indexing and querying capabilities. This integration enables:
- Seamless structured data extraction.
- Defined output schemas for consistency.
- Enhanced scalability for dynamic content needs.
Getting Started
Step 1: Install Dependencies
To use the tools, install the required Python packages:
bashpip install llama-index pip install llama-index-tools-scrapegraphai
Step 2: Import Your ScrapeGraph API Key
You'll need to set up your API key securely:
pythonimport os from getpass import getpass # Check if the API key is already set in the environment sgai_api_key = os.getenv("SGAI_API_KEY") if sgai_api_key: print("SGAI_API_KEY found in environment.") else: print("SGAI_API_KEY not found in environment.") # Prompt the user to input the API key securely (hidden input) sgai_api_key = getpass("Please enter your SGAI_API_KEY: ").strip() if sgai_api_key: # Set the API key in the environment os.environ["SGAI_API_KEY"] = sgai_api_key print("SGAI_API_KEY has been set in the environment.") else: print("No API key entered. Please set the API key to continue.")
Defining Output Schemas
Output schemas provide a blueprint for the data you want to extract. Using Pydantic, you can define schemas for both simple and complex data structures.
Simple Schema Example
For straightforward information extraction:
pythonfrom pydantic import BaseModel, Field class PageInfoSchema(BaseModel): title: str = Field(description="The title of the webpage") description: str = Field(description="The description of the webpage")
Complex Schema Example
For structured data with multiple related items:
pythonfrom pydantic import BaseModel, Field from typing import List, Dict, Optional # Schema for founder information class FounderSchema(BaseModel): name: str = Field(description="Name of the founder") role: str = Field(description="Role of the founder in the company") linkedin: str = Field(description="LinkedIn profile of the founder") # Schema for pricing plans class PricingPlanSchema(BaseModel): tier: str = Field(description="Name of the pricing tier") price: str = Field(description="Price of the plan") credits: int = Field(description="Number of credits included in the plan") # Schema for social links class SocialLinksSchema(BaseModel): linkedin: str = Field(description="LinkedIn page of the company") twitter: str = Field(description="Twitter page of the company") github: str = Field(description="GitHub page of the company") # Schema for company information class CompanyInfoSchema(BaseModel): company_name: str = Field(description="Name of the company") description: str = Field(description="Brief description of the company") founders: List[FounderSchema] = Field(description="List of company founders") logo: str = Field(description="Logo URL of the company") partners: List[str] = Field(description="List of company partners") pricing_plans: List[PricingPlanSchema] = Field(description="Details of pricing plans") contact_emails: List[str] = Field(description="Contact emails of the company") social_links: SocialLinksSchema = Field(description="Social links of the company") privacy_policy: str = Field(description="URL to the privacy policy") terms_of_service: str = Field(description="URL to the terms of service") api_status: str = Field(description="API status page URL")
Extracting Data with ScrapeGraphAI and LlamaIndex
Here's how to extract structured data using our schema:
pythonfrom llama_index.tools.scrapegraph.base import ScrapegraphToolSpec scrapegraph_tool = ScrapegraphToolSpec() def extract_company_info(url: str, api_key: str): response = scrapegraph_tool.scrapegraph_smartscraper( prompt="Extract detailed company information including name, description, industry, founding year, employee count, location details, and contact information", url=url, api_key=api_key, schema=CompanyInfoSchema, ) return response["result"] url = "https://scrapegraphai.com/" company_info = extract_company_info(url, sgai_api_key)
Output:
json{ "company_name": "ScrapeGraphAI", "description": "Transform any website into clean, organized data for AI agents and Data Analytics. Enhance your apps with our AI-powered API.", "founders": [ { "name": "Marco Perini", "role": "Founder & Technical Lead", "linkedin": "https://www.linkedin.com/in/perinim/" }, { "name": "Marco Vinciguerra", "role": "Founder & Software Engineer", "linkedin": "https://www.linkedin.com/in/marco-vinciguerra-7ba365242/" }, { "name": "Lorenzo Padoan", "role": "Founder & Product Engineer", "linkedin": "https://www.linkedin.com/in/lorenzo-padoan-4521a2154/" } ], "logo": "https://scrapegraphai.com/images/scrapegraphai_logo.svg", "partners": ["LangChain", "PostHog", "AWS", "NVIDIA"], "pricing_plans": [ { "tier": "Free", "price": "$0", "credits": 100 }, { "tier": "Starter", "price": "$20/month", "credits": 5000 }, { "tier": "Growth", "price": "$100/month", "credits": 40000 }, { "tier": "Pro", "price": "$500/month", "credits": 250000 } ], "contact_emails": ["contact@scrapegraphai.com"], "social_links": { "linkedin": "https://www.linkedin.com/company/101881123", "twitter": "https://x.com/scrapegraphai", "github": "https://github.com/ScrapeGraphAI/Scrapegraph-ai" }, "privacy_policy": "https://scrapegraphai.com/privacy", "terms_of_service": "https://scrapegraphai.com/terms", "api_status": "https://scrapegraphapi.openstatus.dev" }
Processing and Saving Results
To process and save the extracted data:
pythonimport pandas as pd # Flatten and organize the data company_info_flat = { "company_name": company_info["company_name"], "description": company_info["description"], "founders": company_info["founders"], "logo": company_info["logo"], "partners": company_info["partners"], "pricing_plans": company_info["pricing_plans"], "contact_emails": ", ".join(company_info["contact_emails"]), "privacy_policy": company_info["privacy_policy"], "terms_of_service": company_info["terms_of_service"], "api_status": company_info["api_status"], "linkedin": company_info["social_links"]["linkedin"], "twitter": company_info["social_links"]["twitter"], "github": company_info["social_links"].get("github", None) } # Create separate DataFrames for different aspects df_company = pd.DataFrame([company_info_flat]) df_founders = pd.DataFrame(company_info["founders"]) df_pricing = pd.DataFrame(company_info["pricing_plans"]) df_partners = pd.DataFrame({"partner": company_info["partners"]}) # Save to CSV files df_company.to_csv("company_info.csv", index=False) df_founders.to_csv("founders.csv", index=False) df_pricing.to_csv("pricing_plans.csv", index=False) df_partners.to_csv("partners.csv", index=False)
This integration showcases the power of combining ScrapeGraphAI's structured extraction capabilities with LlamaIndex's data processing features. Whether you're analyzing M&A data, tracking company information, or building comprehensive market research datasets, this combination provides a robust solution for your data extraction needs.
Data Processing Details
Our data processing pipeline involves the following steps:
- Data Extraction: We use ScrapeGraphAI's AI-powered API to extract structured data from websites.
- Data Transformation: We transform the cleaned data into a format suitable for analysis and visualization.
- Data Storage: We store the transformed data in a secure and scalable database.
Conclusion
In this blog post, we demonstrated the power of combining ScrapeGraphAI's structured extraction capabilities with LlamaIndex's data processing features. We showed how to extract structured data using our schema, process and save the results, and provided pricing information and data processing details. Whether you're analyzing M&A data, tracking company information, or building comprehensive market research datasets, this combination provides a robust solution for your data extraction needs.
Did you find this article helpful?
Share it with your network!