ScrapeGraphAIScrapeGraphAI

'E-Commerce Price Monitoring Case Study: How One Online Store Increased Profit

'E-Commerce Price Monitoring Case Study: How One Online Store Increased Profit

Author 1

Marco Vinciguerra

In the world of data-driven decisions, social media platforms like Instagram, LinkedIn, and Reddit are goldmines of public opinion, user behavior, and emerging trends. Whether you're building a trend monitoring dashboard, training an AI model, or running competitive analysis — the first step is structured data extraction. Enter ScrapeGraphAI: a smart web scraping framework that leverages LLMs (Large Language Models) to convert messy HTML into structured, ready-to-use JSON.

This blog will walk you through how to collect social media insights programmatically from Instagram, LinkedIn, and Reddit using ScrapeGraphAI, and how you can use that data for trend analysis.


Why Scrape Social Platforms?

Scraping Instagram, LinkedIn, and Reddit allows companies, researchers, and developers to:

  • Track hashtags, topics, or keywords over time.
  • Analyze engagement across communities or audiences.
  • Understand sentiment toward products, services, or competitors.
  • Detect trending discussions, pain points, or user needs in real time.
  • Build trend dashboards, AI training datasets, or market intelligence tools.

Why Use ScrapeGraphAI?

ScrapeGraphAI combines the power of Python with LLM-powered extraction, meaning:

  • You don't need to write brittle regex or custom parsers.
  • You define your schema (like "title", "author", "timestamp"), and the model fills it in.
  • It's modular — choose different browser engines, proxies, and LLMs (OpenAI, Groq, Mistral, etc.)
  • You get clean JSON directly from semi-structured pages.

Perfect for dynamic, changing pages like Reddit threads or Instagram profiles.


🔧 Setting Up the Project

First, install ScrapeGraphAI:

pip install scrapegraphai

You'll also need an LLM key (OpenAI, Groq, or any compatible provider) and optionally a proxy service like BrightData.


Example 1: Scraping Reddit for Trending Discussions

Let's extract top posts from a subreddit.

import json
from scrapegraphai import SmartScraperGraph
 
# Define what we want to extract
schema = {
    "posts": [
        {
            "title": "Post title",
            "author": "Username who posted",
            "upvotes": "Number of upvotes",
            "comments": "Number of comments",
            "timestamp": "When it was posted"
        }
    ]
}
 
# Set up the scraper
graph_config = {
    "llm": {
        "api_key": "YOUR_OPENAI_KEY",
        "model": "gpt-3.5-turbo",
    },
    "browser": {
        "headless": False  # Set to True for production
    }
}
 
# Create and run the scraper
smart_scraper = SmartScraperGraph(
    prompt="Extract the trending posts from this subreddit page",
    source="https://reddit.com/r/technology",
    config=graph_config,
    schema=schema
)
 
result = smart_scraper.run()
print(json.dumps(result, indent=2))

This will return structured JSON with trending posts from r/technology.

Example 2: Instagram Profile Analysis

from scrapegraphai import SmartScraperGraph
 
# Instagram profile schema
instagram_schema = {
    "profile": {
        "username": "Profile username",
        "followers": "Number of followers",
        "following": "Number following",
        "posts_count": "Total posts",
        "bio": "Profile bio text",
        "verified": "Is verified account",
        "recent_posts": [
            {
                "caption": "Post caption",
                "likes": "Number of likes",
                "comments": "Number of comments",
                "timestamp": "When posted"
            }
        ]
    }
}
 
graph_config = {
    "llm": {
        "api_key": "YOUR_API_KEY",
        "model": "gpt-4",
    }
}
 
scraper = SmartScraperGraph(
    prompt="Extract profile information and recent posts data",
    source="https://instagram.com/username",
    config=graph_config,
    schema=instagram_schema
)
 
profile_data = scraper.run()

Building a Trend Analysis Pipeline

Now let's create a complete pipeline that monitors multiple platforms:

import time
import pandas as pd
from datetime import datetime
from scrapegraphai import SmartScraperGraph
 
class SocialTrendMonitor:
    def __init__(self, api_key):
        self.api_key = api_key
        self.data_store = []
        
    def scrape_reddit_trends(self, subreddit):
        """Scrape trending posts from Reddit"""
        config = {
            "llm": {"api_key": self.api_key, "model": "gpt-3.5-turbo"}
        }
        
        schema = {
            "posts": [
                {
                    "title": "str",
                    "upvotes": "int", 
                    "comments": "int",
                    "author": "str"
                }
            ]
        }
        
        scraper = SmartScraperGraph(
            prompt=f"Extract trending posts from r/{subreddit}",
            source=f"https://reddit.com/r/{subreddit}",
            config=config,
            schema=schema
        )
        
        return scraper.run()
    
    def scrape_linkedin_posts(self, hashtag):
        """Scrape LinkedIn posts by hashtag"""
        # Similar implementation for LinkedIn
        pass
    
    def analyze_trends(self, data):
        """Analyze extracted data for trends"""
        df = pd.DataFrame(data)
        
        # Simple trend analysis
        trending_keywords = self.extract_keywords(df['title'].tolist())
        engagement_metrics = df['upvotes'].describe()
        
        return {
            "trending_keywords": trending_keywords,
            "engagement_stats": engagement_metrics.to_dict(),
            "timestamp": datetime.now().isoformat()
        }
    
    def extract_keywords(self, titles):
        """Extract trending keywords from titles"""
        # Use AI or NLP library to extract keywords
        all_words = " ".join(titles).lower().split()
        word_freq = {}
        
        for word in all_words:
            if len(word) > 3:  # Filter short words
                word_freq[word] = word_freq.get(word, 0) + 1
        
        # Return top 10 most frequent words
        return sorted(word_freq.items(), key=lambda x: x[1], reverse=True)[:10]
 
# Usage
monitor = SocialTrendMonitor(api_key="your_key")
reddit_data = monitor.scrape_reddit_trends("artificial intelligence")
trends = monitor.analyze_trends(reddit_data['posts'])

Advanced Use Cases

1. Sentiment Analysis Dashboard

from textblob import TextBlob
 
def analyze_sentiment(text_data):
    sentiments = []
    for text in text_data:
        blob = TextBlob(text)
        sentiments.append({
            'text': text,
            'polarity': blob.sentiment.polarity,
            'subjectivity': blob.sentiment.subjectivity
        })
    return sentiments
 
# Apply to scraped social media posts
reddit_posts = monitor.scrape_reddit_trends("cryptocurrency")
sentiment_data = analyze_sentiment([post['title'] for post in reddit_posts['posts']])

2. Competitive Brand Monitoring

def monitor_brand_mentions(brand_name, platforms):
    """Monitor brand mentions across platforms"""
    mentions = {}
    
    for platform in platforms:
        if platform == 'reddit':
            data = search_reddit_mentions(brand_name)
        elif platform == 'instagram':
            data = search_instagram_hashtags(brand_name)
        # Add more platforms
        
        mentions[platform] = data
    
    return mentions
 
# Track your brand vs competitors
brand_data = monitor_brand_mentions("YourBrand", ["reddit", "instagram"])
competitor_data = monitor_brand_mentions("Competitor", ["reddit", "instagram"])

3. Real-time Trend Alerts

import smtplib
from email.mime.text import MIMEText
 
class TrendAlert:
    def __init__(self, threshold=100):
        self.threshold = threshold
        
    def check_viral_content(self, posts):
        """Check if any post is going viral"""
        viral_posts = []
        
        for post in posts:
            if post.get('upvotes', 0) > self.threshold:
                viral_posts.append(post)
        
        return viral_posts
    
    def send_alert(self, viral_posts):
        """Send email alert for viral content"""
        if viral_posts:
            message = f"Alert: {len(viral_posts)} posts are trending!"
            # Send email notification
            print(message)
 
# Set up alerts
alert_system = TrendAlert(threshold=500)
reddit_data = monitor.scrape_reddit_trends("technology")
viral_content = alert_system.check_viral_content(reddit_data['posts'])
alert_system.send_alert(viral_content)

Data Storage and Analytics

Store your scraped data for long-term analysis:

import sqlite3
import json
from datetime import datetime
 
class SocialDataStore:
    def __init__(self, db_path="social_trends.db"):
        self.db_path = db_path
        self.init_database()
    
    def init_database(self):
        """Initialize SQLite database"""
        conn = sqlite3.connect(self.db_path)
        cursor = conn.cursor()
        
        cursor.execute("""
            CREATE TABLE IF NOT EXISTS social_posts (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                platform TEXT,
                title TEXT,
                content TEXT,
                author TEXT,
                engagement_score INTEGER,
                timestamp DATETIME,
                scraped_at DATETIME
            )
        """)
        
        conn.commit()
        conn.close()
    
    def store_posts(self, posts, platform):
        """Store scraped posts in database"""
        conn = sqlite3.connect(self.db_path)
        cursor = conn.cursor()
        
        for post in posts:
            cursor.execute("""
                INSERT INTO social_posts 
                (platform, title, content, author, engagement_score, timestamp, scraped_at)
                VALUES (?, ?, ?, ?, ?, ?, ?)
            """, (
                platform,
                post.get('title', ''),
                post.get('content', ''),
                post.get('author', ''),
                post.get('upvotes', 0),
                post.get('timestamp', ''),
                datetime.now()
            ))
        
        conn.commit()
        conn.close()
 
# Usage
data_store = SocialDataStore()
reddit_posts = monitor.scrape_reddit_trends("webdevelopment")
data_store.store_posts(reddit_posts['posts'], 'reddit')

Best Practices and Considerations

1. Rate Limiting and Respect

import time
import random
 
def respectful_scraping(urls, delay_range=(1, 3)):
    """Implement delays between requests"""
    results = []
    
    for url in urls:
        # Add random delay
        delay = random.uniform(*delay_range)
        time.sleep(delay)
        
        # Perform scraping
        result = scrape_url(url)
        results.append(result)
    
    return results

2. Error Handling

def robust_scraper(url, max_retries=3):
    """Scraper with error handling and retries"""
    for attempt in range(max_retries):
        try:
            result = scrape_url(url)
            return result
        except Exception as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)  # Exponential backoff
            else:
                return None

3. Data Quality Checks

def validate_scraped_data(data):
    """Validate scraped data quality"""
    issues = []
    
    for item in data:
        if not item.get('title'):
            issues.append("Missing title")
        if not item.get('author'):
            issues.append("Missing author")
        # Add more validation rules
    
    return issues

Conclusion

Social media scraping with ScrapeGraphAI opens up powerful possibilities for trend analysis, competitive intelligence, and market research. The combination of LLM-powered extraction and structured schemas makes it easy to collect and analyze social data at scale.

Key takeaways:

  • Start simple - Begin with basic post extraction
  • Scale gradually - Add more platforms and features over time
  • Respect platforms - Use appropriate delays and follow ToS
  • Store systematically - Build a database for historical analysis
  • Monitor trends - Set up alerts for important changes

The social media landscape is constantly evolving, and automated data collection gives you the insights needed to stay ahead of trends and understand your audience better.

Related Resources

Learn more about web scraping and social media analysis:

These resources will help you build sophisticated social media monitoring systems while maintaining ethical and legal compliance.

Give your AI Agent superpowers with lightning-fast web data!