Integrating ScrapeGraphAI with LlamaIndex: A Practical Guide
When I first started working with web scraping and data extraction, I spent way too much time writing custom parsers and dealing with inconsistent data formats. Then I discovered the power of combining ScrapeGraphAI with LlamaIndex, and it completely changed how I approach structured data extraction.
Let me show you how to set up this integration and use it for something practical - extracting company information in a structured way.
Why This Combination Works
ScrapeGraphAI handles the messy part of web scraping - navigating websites, dealing with JavaScript, and understanding page layouts. LlamaIndex excels at organizing and querying that data once you have it.
Together, they solve a common problem: getting clean, structured data from websites that you can actually use in your applications.
Setting Up the Integration
First, let's get everything installed:
pip install llama-index llama-index-tools-scrapegraphai
Next, you'll need to set up your API key. Here's a simple way to handle it:
import os
from getpass import getpass
# Check if the API key is already set
sgai_api_key = os.getenv("SGAI_API_KEY")
if not sgai_api_key:
# Prompt for the API key if not found
sgai_api_key = getpass("Enter your SGAI_API_KEY: ").strip()
os.environ["SGAI_API_KEY"] = sgai_api_key
print("API key set successfully.")
else:
print("API key found in environment.")
Defining Your Data Structure
This is where the magic happens. Instead of hoping your scraper returns consistent data, you define exactly what you want using Pydantic schemas.
Simple Example: Basic Page Info
Let's start with something simple:
from pydantic import BaseModel, Field
class PageInfoSchema(BaseModel):
title: str = Field(description="The title of the webpage")
description: str = Field(description="The description of the webpage")
Complex Example: Company Information
For more complex data, you can create nested schemas:
from pydantic import BaseModel, Field
from typing import List
class FounderSchema(BaseModel):
name: str = Field(description="Name of the founder")
role: str = Field(description="Role of the founder in the company")
linkedin: str = Field(description="LinkedIn profile of the founder")
class PricingPlanSchema(BaseModel):
tier: str = Field(description="Name of the pricing tier")
price: str = Field(description="Price of the plan")
credits: int = Field(description="Number of credits included in the plan")
class SocialLinksSchema(BaseModel):
linkedin: str = Field(description="LinkedIn page of the company")
twitter: str = Field(description="Twitter page of the company")
github: str = Field(description="GitHub page of the company")
class CompanyInfoSchema(BaseModel):
company_name: str = Field(description="Name of the company")
description: str = Field(description="Brief description of the company")
founders: List[FounderSchema] = Field(description="List of company founders")
logo: str = Field(description="Logo URL of the company")
partners: List[str] = Field(description="List of company partners")
pricing_plans: List[PricingPlanSchema] = Field(description="Details of pricing plans")
contact_emails: List[str] = Field(description="Contact emails of the company")
social_links: SocialLinksSchema = Field(description="Social links of the company")
privacy_policy: str = Field(description="URL to the privacy policy")
terms_of_service: str = Field(description="URL to the terms of service")
api_status: str = Field(description="API status page URL")
Extracting Structured Data
Now let's put it all together and extract some real data:
from llama_index.tools.scrapegraph.base import ScrapegraphToolSpec
scrapegraph_tool = ScrapegraphToolSpec()
def extract_company_info(url: str, api_key: str):
response = scrapegraph_tool.scrapegraph_smartscraper(
prompt="Extract detailed company information including name, description, founders, pricing, and contact details",
url=url,
api_key=api_key,
schema=CompanyInfoSchema,
)
return response["result"]
# Extract data from a real website
url = "https://scrapegraphai.com/"
company_info = extract_company_info(url, sgai_api_key)
This will return structured data like:
{
"company_name": "ScrapeGraphAI",
"description": "Transform any website into clean, organized data for AI agents and Data Analytics.",
"founders": [
{
"name": "Marco Perini",
"role": "Founder & Technical Lead",
"linkedin": "https://www.linkedin.com/in/perinim/"
},
{
"name": "Marco Vinciguerra",
"role": "Founder & Software Engineer",
"linkedin": "https://www.linkedin.com/in/marco-vinciguerra-7ba365242/"
}
],
"pricing_plans": [
{
"tier": "Free",
"price": "$0",
"credits": 100
},
{
"tier": "Starter",
"price": "$20/month",
"credits": 5000
}
],
"contact_emails": ["contact@scrapegraphai.com"],
"social_links": {
"linkedin": "https://www.linkedin.com/company/101881123",
"twitter": "https://x.com/scrapegraphai",
"github": "https://github.com/ScrapeGraphAI/Scrapegraph-ai"
}
}
Processing and Saving Your Data
Once you have structured data, you can easily work with it:
import pandas as pd
# Create separate DataFrames for different aspects
df_company = pd.DataFrame([{
"company_name": company_info["company_name"],
"description": company_info["description"],
"logo": company_info["logo"],
"contact_emails": ", ".join(company_info["contact_emails"]),
"linkedin": company_info["social_links"]["linkedin"],
"twitter": company_info["social_links"]["twitter"],
"github": company_info["social_links"].get("github", "")
}])
df_founders = pd.DataFrame(company_info["founders"])
df_pricing = pd.DataFrame(company_info["pricing_plans"])
df_partners = pd.DataFrame({"partner": company_info["partners"]})
# Save to CSV files
df_company.to_csv("company_info.csv", index=False)
df_founders.to_csv("founders.csv", index=False)
df_pricing.to_csv("pricing_plans.csv", index=False)
df_partners.to_csv("partners.csv", index=False)
Real-World Applications
I've used this approach for several practical projects:
Market Research: Extracting competitor information across multiple company websites to build comparison charts.
Lead Generation: Gathering contact information and company details from directory websites.
Price Monitoring: Tracking pricing changes across different service providers.
Due Diligence: Collecting detailed company information for investment research.
Tips for Success
Start simple: Begin with basic schemas and add complexity as needed.
Test your prompts: Different prompts can yield different results. Experiment to find what works best.
Handle missing data: Not all websites have all the information you're looking for. Make fields optional where appropriate.
Validate your results: Always check the extracted data for accuracy, especially for important business decisions.
Batch processing: If you're scraping multiple sites, add delays between requests to be respectful.
Common Challenges
Inconsistent website structures: Some sites are easier to scrape than others. You might need to adjust your schema for different types of sites.
Missing information: Not every website will have all the data you're looking for. Build your schemas to handle optional fields.
Rate limiting: APIs have limits. Plan your scraping accordingly and implement proper error handling.
Data quality: AI isn't perfect. Always validate critical information before using it.
The Data Processing Pipeline
Here's how I typically structure my data processing workflow:
- Extract: Use ScrapeGraphAI to get raw data from websites
- Transform: Clean and normalize the data using your schemas
- Store: Save the structured data in a database or files
- Analyze: Use the organized data for your business needs
Advanced Usage
Once you're comfortable with the basics, you can:
- Create more complex schemas with nested relationships
- Set up automated scraping schedules
- Build data pipelines that process multiple sites
- Integrate with your existing data systems
Troubleshooting
Schema validation errors: Check that your field descriptions match what's actually on the website.
Missing data: Make sure your prompts are specific enough to guide the AI to the right information.
API errors: Verify your API key is set correctly and you haven't exceeded rate limits.
Inconsistent results: Try refining your prompts or breaking complex extractions into smaller steps.
Final Thoughts
The combination of ScrapeGraphAI and LlamaIndex has made my data extraction workflows much more reliable and scalable. Instead of writing custom scrapers for each website, I can define what I want and let the AI figure out how to get it.
Start with simple examples, experiment with different schemas, and gradually build up to more complex use cases. The key is to focus on the data structure you need, not the mechanics of how to extract it.
This approach has saved me countless hours of maintenance and debugging, and I think it'll do the same for you.
Quick FAQ
Q: How accurate is the extracted data? A: Pretty good, but always validate important information. The AI understands context well but isn't perfect.
Q: Can I use this for any website? A: Most websites work, but some with heavy anti-bot measures might be challenging.
Q: What if the website structure changes? A: That's the beauty of this approach - the AI adapts to layout changes better than traditional scrapers.
Q: How much does it cost? A: It depends on your usage. You pay for API calls, so start small and scale as needed.
Q: Can I extract data from password-protected sites? A: Currently, this works best with publicly accessible content.
Remember to always respect robots.txt files and terms of service when scraping websites.
export const metadata = { // ... existing code ...