In the world of data-driven decisions, social media platforms like Instagram, LinkedIn, and Reddit are goldmines of public opinion, user behavior, and emerging trends. Whether you're building a trend monitoring dashboard, training an AI model, or running competitive analysis — the first step is structured data extraction. Enter ScrapeGraphAI: a smart web scraping framework that leverages LLMs (Large Language Models) to convert messy HTML into structured, ready-to-use JSON.
This blog will walk you through how to collect social media insights programmatically from Instagram, LinkedIn, and Reddit using ScrapeGraphAI, and how you can use that data for trend analysis.
Why Scrape Social Platforms?
Scraping Instagram, LinkedIn, and Reddit allows companies, researchers, and developers to:
- Track hashtags, topics, or keywords over time.
- Analyze engagement across communities or audiences.
- Understand sentiment toward products, services, or competitors.
- Detect trending discussions, pain points, or user needs in real time.
- Build trend dashboards, AI training datasets, or market intelligence tools.
Why Use ScrapeGraphAI?
ScrapeGraphAI combines the power of Python with LLM-powered extraction, meaning:
- You don't need to write brittle regex or custom parsers.
- You define your schema (like "title", "author", "timestamp"), and the model fills it in.
- It's modular — choose different browser engines, proxies, and LLMs (OpenAI, Groq, Mistral, etc.)
- You get clean JSON directly from semi-structured pages.
Perfect for dynamic, changing pages like Reddit threads or Instagram profiles.
🔧 Setting Up the Project
First, install ScrapeGraphAI:
pip install scrapegraphai
You'll also need an LLM key (OpenAI, Groq, or any compatible provider) and optionally a proxy service like BrightData.
Example 1: Scraping Reddit for Trending Discussions
Let's extract top posts from a subreddit.
import json
from scrapegraphai import SmartScraperGraph
# Define what we want to extract
schema = {
"posts": [
{
"title": "Post title",
"author": "Username who posted",
"upvotes": "Number of upvotes",
"comments": "Number of comments",
"timestamp": "When it was posted"
}
]
}
# Set up the scraper
graph_config = {
"llm": {
"api_key": "YOUR_OPENAI_KEY",
"model": "gpt-3.5-turbo",
},
"browser": {
"headless": False # Set to True for production
}
}
# Create and run the scraper
smart_scraper = SmartScraperGraph(
prompt="Extract the trending posts from this subreddit page",
source="https://reddit.com/r/technology",
config=graph_config,
schema=schema
)
result = smart_scraper.run()
print(json.dumps(result, indent=2))
This will return structured JSON with trending posts from r/technology.
Example 2: Instagram Profile Analysis
from scrapegraphai import SmartScraperGraph
# Instagram profile schema
instagram_schema = {
"profile": {
"username": "Profile username",
"followers": "Number of followers",
"following": "Number following",
"posts_count": "Total posts",
"bio": "Profile bio text",
"verified": "Is verified account",
"recent_posts": [
{
"caption": "Post caption",
"likes": "Number of likes",
"comments": "Number of comments",
"timestamp": "When posted"
}
]
}
}
graph_config = {
"llm": {
"api_key": "YOUR_API_KEY",
"model": "gpt-4",
}
}
scraper = SmartScraperGraph(
prompt="Extract profile information and recent posts data",
source="https://instagram.com/username",
config=graph_config,
schema=instagram_schema
)
profile_data = scraper.run()
Building a Trend Analysis Pipeline
Now let's create a complete pipeline that monitors multiple platforms:
import time
import pandas as pd
from datetime import datetime
from scrapegraphai import SmartScraperGraph
class SocialTrendMonitor:
def __init__(self, api_key):
self.api_key = api_key
self.data_store = []
def scrape_reddit_trends(self, subreddit):
"""Scrape trending posts from Reddit"""
config = {
"llm": {"api_key": self.api_key, "model": "gpt-3.5-turbo"}
}
schema = {
"posts": [
{
"title": "str",
"upvotes": "int",
"comments": "int",
"author": "str"
}
]
}
scraper = SmartScraperGraph(
prompt=f"Extract trending posts from r/{subreddit}",
source=f"https://reddit.com/r/{subreddit}",
config=config,
schema=schema
)
return scraper.run()
def scrape_linkedin_posts(self, hashtag):
"""Scrape LinkedIn posts by hashtag"""
# Similar implementation for LinkedIn
pass
def analyze_trends(self, data):
"""Analyze extracted data for trends"""
df = pd.DataFrame(data)
# Simple trend analysis
trending_keywords = self.extract_keywords(df['title'].tolist())
engagement_metrics = df['upvotes'].describe()
return {
"trending_keywords": trending_keywords,
"engagement_stats": engagement_metrics.to_dict(),
"timestamp": datetime.now().isoformat()
}
def extract_keywords(self, titles):
"""Extract trending keywords from titles"""
# Use AI or NLP library to extract keywords
all_words = " ".join(titles).lower().split()
word_freq = {}
for word in all_words:
if len(word) > 3: # Filter short words
word_freq[word] = word_freq.get(word, 0) + 1
# Return top 10 most frequent words
return sorted(word_freq.items(), key=lambda x: x[1], reverse=True)[:10]
# Usage
monitor = SocialTrendMonitor(api_key="your_key")
reddit_data = monitor.scrape_reddit_trends("artificial intelligence")
trends = monitor.analyze_trends(reddit_data['posts'])
Advanced Use Cases
1. Sentiment Analysis Dashboard
from textblob import TextBlob
def analyze_sentiment(text_data):
sentiments = []
for text in text_data:
blob = TextBlob(text)
sentiments.append({
'text': text,
'polarity': blob.sentiment.polarity,
'subjectivity': blob.sentiment.subjectivity
})
return sentiments
# Apply to scraped social media posts
reddit_posts = monitor.scrape_reddit_trends("cryptocurrency")
sentiment_data = analyze_sentiment([post['title'] for post in reddit_posts['posts']])
2. Competitive Brand Monitoring
def monitor_brand_mentions(brand_name, platforms):
"""Monitor brand mentions across platforms"""
mentions = {}
for platform in platforms:
if platform == 'reddit':
data = search_reddit_mentions(brand_name)
elif platform == 'instagram':
data = search_instagram_hashtags(brand_name)
# Add more platforms
mentions[platform] = data
return mentions
# Track your brand vs competitors
brand_data = monitor_brand_mentions("YourBrand", ["reddit", "instagram"])
competitor_data = monitor_brand_mentions("Competitor", ["reddit", "instagram"])
3. Real-time Trend Alerts
import smtplib
from email.mime.text import MIMEText
class TrendAlert:
def __init__(self, threshold=100):
self.threshold = threshold
def check_viral_content(self, posts):
"""Check if any post is going viral"""
viral_posts = []
for post in posts:
if post.get('upvotes', 0) > self.threshold:
viral_posts.append(post)
return viral_posts
def send_alert(self, viral_posts):
"""Send email alert for viral content"""
if viral_posts:
message = f"Alert: {len(viral_posts)} posts are trending!"
# Send email notification
print(message)
# Set up alerts
alert_system = TrendAlert(threshold=500)
reddit_data = monitor.scrape_reddit_trends("technology")
viral_content = alert_system.check_viral_content(reddit_data['posts'])
alert_system.send_alert(viral_content)
Data Storage and Analytics
Store your scraped data for long-term analysis:
import sqlite3
import json
from datetime import datetime
class SocialDataStore:
def __init__(self, db_path="social_trends.db"):
self.db_path = db_path
self.init_database()
def init_database(self):
"""Initialize SQLite database"""
conn = sqlite3.connect(self.db_path)
cursor = conn.cursor()
cursor.execute("""
CREATE TABLE IF NOT EXISTS social_posts (
id INTEGER PRIMARY KEY AUTOINCREMENT,
platform TEXT,
title TEXT,
content TEXT,
author TEXT,
engagement_score INTEGER,
timestamp DATETIME,
scraped_at DATETIME
)
""")
conn.commit()
conn.close()
def store_posts(self, posts, platform):
"""Store scraped posts in database"""
conn = sqlite3.connect(self.db_path)
cursor = conn.cursor()
for post in posts:
cursor.execute("""
INSERT INTO social_posts
(platform, title, content, author, engagement_score, timestamp, scraped_at)
VALUES (?, ?, ?, ?, ?, ?, ?)
""", (
platform,
post.get('title', ''),
post.get('content', ''),
post.get('author', ''),
post.get('upvotes', 0),
post.get('timestamp', ''),
datetime.now()
))
conn.commit()
conn.close()
# Usage
data_store = SocialDataStore()
reddit_posts = monitor.scrape_reddit_trends("webdevelopment")
data_store.store_posts(reddit_posts['posts'], 'reddit')
Best Practices and Considerations
1. Rate Limiting and Respect
import time
import random
def respectful_scraping(urls, delay_range=(1, 3)):
"""Implement delays between requests"""
results = []
for url in urls:
# Add random delay
delay = random.uniform(*delay_range)
time.sleep(delay)
# Perform scraping
result = scrape_url(url)
results.append(result)
return results
2. Error Handling
def robust_scraper(url, max_retries=3):
"""Scraper with error handling and retries"""
for attempt in range(max_retries):
try:
result = scrape_url(url)
return result
except Exception as e:
print(f"Attempt {attempt + 1} failed: {e}")
if attempt < max_retries - 1:
time.sleep(2 ** attempt) # Exponential backoff
else:
return None
3. Data Quality Checks
def validate_scraped_data(data):
"""Validate scraped data quality"""
issues = []
for item in data:
if not item.get('title'):
issues.append("Missing title")
if not item.get('author'):
issues.append("Missing author")
# Add more validation rules
return issues
Conclusion
Social media scraping with ScrapeGraphAI opens up powerful possibilities for trend analysis, competitive intelligence, and market research. The combination of LLM-powered extraction and structured schemas makes it easy to collect and analyze social data at scale.
Key takeaways:
- Start simple - Begin with basic post extraction
- Scale gradually - Add more platforms and features over time
- Respect platforms - Use appropriate delays and follow ToS
- Store systematically - Build a database for historical analysis
- Monitor trends - Set up alerts for important changes
The social media landscape is constantly evolving, and automated data collection gives you the insights needed to stay ahead of trends and understand your audience better.
Related Resources
Learn more about web scraping and social media analysis:
- Web Scraping 101 - Master the fundamentals
- AI Agent Web Scraping - Advanced AI techniques
- Mastering ScrapeGraphAI - Complete platform guide
- Web Scraping Legality - Legal considerations
- Scraping with Python - Python-specific techniques
These resources will help you build sophisticated social media monitoring systems while maintaining ethical and legal compliance.