Tweet Scraper: How to Extract X/Twitter Data in Python [2026]

TL;DR

A practical guide to building a tweet scraper in Python, comparing the official X API, snscrape, and ScrapeGraphAI.

Official X API is expensive — $100/month minimum, $0.01 per tweet at Basic tier
snscrape and Playwright are fragile — free but break frequently as X changes endpoints
ScrapeGraphAI offers reliable AI extraction — natural language prompts with structured JSON output
Legal landscape favors public data scraping — but respect GDPR and rate-limit your requests
Use cases span finance, research, and brand monitoring — real-time sentiment and competitive intel

X (formerly Twitter) pumps out roughly 500 million posts a day. That's a staggering amount of unstructured text, engagement signals, and real-time opinion data sitting right there in public. If you know how to tap into it, you've got a firehose of market sentiment, trend signals, and competitive intel.

The problem? X doesn't make it easy. The official API now costs $100/month minimum and the free tier is borderline useless. snscrape broke (again). Selenium scripts rot faster than you can maintain them.

This guide walks you through building a tweet scraper that actually works in 2026 — from the official API to AI-powered extraction with ScrapeGraphAI — with real code you can run today.

Why Build a Tweet Scraper?

Tweet data is valuable because of the combination of text, timing, engagement metrics, and network context.

Financial Signal Detection — Hedge funds track specific accounts (earnings whispers, industry analysts, company insiders) and monitor for unusual posting patterns. A tweet scraper watching 500 finance accounts can detect sentiment shifts 15-30 minutes before they show up in price action.

Brand Crisis Detection — When a product breaks or a PR disaster unfolds, X is where it hits first. A twitter scraper monitoring your brand mentions gives your comms team a 20-minute head start over batch-processing social listening platforms.

Academic Research — Computational social science runs on tweet datasets. Studying misinformation spread, political polarization, language evolution, crisis communication — all of it needs large volumes of tweet data with metadata intact. Since X gutted academic API access in 2023, researchers have shifted to scraping. A well-built pipeline with proper deduplication can collect millions of tweets for a fraction of what the API costs.

Competitive Intelligence — Track what your competitors' users are saying. Monitor feature requests, complaints, praise. Watch how engagement patterns shift after product launches. This data feeds directly into product roadmaps if you're paying attention.

Legal Reality of Scraping Tweets

US case law leans in your favor. The hiQ v. LinkedIn ruling (upheld in 2022) established that scraping publicly available data doesn't violate the Computer Fraud and Abuse Act.

X's Terms of Service say no. Their ToS prohibits automated data collection, but a ToS violation is a civil matter, not criminal.

GDPR is the real risk. If you're collecting data from EU-based users and storing personal identifiers, you need a lawful basis under GDPR.

Practical advice:

Stick to public data only — never attempt to scrape protected accounts
Don't store data longer than your use case requires
Using a third-party extraction service like ScrapeGraphAI adds a layer of separation between your infrastructure and X's servers
Rate-limit your requests — hammering their servers is how you get noticed

Approach 1: Official X API with Tweepy

Tweepy wraps X's API v2 in a clean Python interface. You'll need a developer account at developer.x.com and at least the Basic tier ($100/month) for meaningful access.

pip install tweepy

import tweepy
 
client = tweepy.Client(bearer_token="YOUR_BEARER_TOKEN")
 
query = "artificial intelligence -is:retweet lang:en"
 
response = client.search_recent_tweets(
    query=query,
    max_results=100,
    tweet_fields=["created_at", "public_metrics", "author_id", "lang"],
    expansions=["author_id"],
    user_fields=["username", "public_metrics"]
)
 
users = {u.id: u for u in response.includes.get("users", [])}
 
tweets = []
for tweet in response.data:
    author = users.get(tweet.author_id)
    tweets.append({
        "id": tweet.id,
        "text": tweet.text,
        "author": author.username if author else None,
        "created_at": tweet.created_at.isoformat(),
        "likes": tweet.public_metrics["like_count"],
        "retweets": tweet.public_metrics["retweet_count"],
        "replies": tweet.public_metrics["reply_count"],
        "impressions": tweet.public_metrics.get("impression_count", 0),
    })
 
for t in tweets[:5]:
    print(f"@{t['author']}: {t['text'][:80]}... | {t['likes']} likes")

The cost problem kills Tweepy for most use cases. Basic tier gives you 10,000 tweets/month for $100 — that's $0.01 per tweet. Pro tier ($5,000/month) gets you 1 million tweets with strict rate limits. For any serious data collection, the math doesn't work.

Approach 2: snscrape and Playwright (Free but Fragile)

snscrape was the go-to free tweet scraper for years, scraping X's internal API endpoints without authentication. The reality in 2026: X keeps changing those endpoints, the library has 200+ open issues, and the last meaningful commit was months ago. It still works intermittently for one-off research grabs, but building a production pipeline on it is building on sand.

Playwright (browser automation) works by rendering X's React SPA in headless Chrome. It handles JavaScript rendering but it's slow (2-5 seconds per scroll), resource-hungry, and fragile — X changes their DOM structure regularly. You'll spend more time maintaining selectors than analyzing data. X also fingerprints headless browsers, so you'll need stealth plugins and proxy rotation.

Both approaches require significant maintenance effort. For anything beyond quick one-off grabs, you want something more reliable.

Approach 3: ScrapeGraphAI (AI-Powered Tweet Scraping)

ScrapeGraphAI takes a fundamentally different approach to twitter data extraction. Instead of targeting specific CSS selectors or API endpoints, you describe what you want in plain English. The AI handles rendering, navigation, anti-bot protections, and data structuring.

No browser to manage. No selectors to maintain. No API keys from X.

pip install scrapegraph-py

Scraping Tweets from a Profile

from scrapegraph_py import ScrapeGraphAI
 
sgai = ScrapeGraphAI(api_key="your-sgai-key")
 
response = sgai.extract(
    url="https://x.com/OpenAI",
    prompt="Extract the 10 most recent tweets with full text, timestamp, like count, retweet count, reply count, and view count"
)
 
tweets = response.data.json_data
for tweet in tweets:
    print(f"{tweet['timestamp']} | {tweet['text'][:80]}... | {tweet['likes']} likes")

Keyword Search Extraction

from scrapegraph_py import ScrapeGraphAI
 
sgai = ScrapeGraphAI(api_key="your-sgai-key")
 
response = sgai.extract(
    url="https://x.com/search?q=web+scraping+python&src=typed_query&f=live",
    prompt="""Extract all visible tweets. For each tweet return:
    - author_handle
    - display_name
    - full_text
    - timestamp
    - likes
    - retweets
    - replies
    - views
    - has_media (true/false)
    - media_urls (list of image/video URLs if present)"""
)
 
tweets = response.data.json_data
print(f"Extracted {len(tweets)} tweets")

Ready to scrape?

Start for free

Profile + Tweets Combined Extraction

from scrapegraph_py import ScrapeGraphAI
 
sgai = ScrapeGraphAI(api_key="your-sgai-key")
 
response = sgai.extract(
    url="https://x.com/ylecun",
    prompt="""Extract:
    1. Profile info: display name, handle, bio, follower count, following count, verified status
    2. The 5 most recent tweets with text, timestamp, and all engagement metrics
    3. Any pinned tweet if present"""
)
 
data = response.data.json_data
print(f"@{data.get('handle')} | {data.get('follower_count')} followers")
print(f"Bio: {data.get('bio')}")

Threads and Quote Tweets

Threads and quote tweets are where the interesting context lives. ScrapeGraphAI handles these natively — just point it at a thread URL and describe what you want.

from scrapegraph_py import ScrapeGraphAI
 
sgai = ScrapeGraphAI(api_key="your-sgai-key")
 
response = sgai.extract(
    url="https://x.com/username/status/1234567890",
    prompt="""This is a tweet thread. Extract:
    - The original tweet text and metrics
    - All replies from the original author (thread continuation)
    - Any quoted tweet text and author
    - Total engagement across the thread (sum of likes, retweets, replies)"""
)
 
thread = response.data.json_data
print(f"Thread length: {len(thread.get('thread_tweets', []))} posts")
print(f"Total likes across thread: {thread.get('total_likes', 'N/A')}")

Example JSON Output

Here's what the structured output looks like from a ScrapeGraphAI tweet scraper extraction:

{
  "profile": {
    "display_name": "Yann LeCun",
    "handle": "ylecun",
    "bio": "VP & Chief AI Scientist at Meta. Professor at NYU. ACM Turing Award Laureate.",
    "followers": "695.4K",
    "following": "2,891",
    "verified": true
  },
  "pinned_tweet": null,
  "recent_tweets": [
    {
      "text": "New paper on V-JEPA 2.0. We trained a video prediction model that learns rich visual representations without any labels.",
      "timestamp": "2026-03-23T16:45:00Z",
      "likes": 8420,
      "retweets": 2150,
      "replies": 342,
      "views": "1.8M",
      "has_media": true,
      "is_thread": false
    }
  ]
}

Comparison: Which Tweet Scraper Should You Use?

Feature	ScrapeGraphAI	Tweepy (X API)	snscrape	Playwright
Cost	Free tier + usage	$100-5,000/mo	Free	Free
Setup time	5 minutes	30 minutes	15 minutes	1-2 hours
Auth required	SGAI key only	X developer account	None	None
JS rendering	Handled	N/A (API)	No	Yes
Maintenance	None	Low	Very high	Very high
Rate limits	Plan-based	Strict per-endpoint	Breaks unpredictably	Self-managed
Structured output	AI-structured JSON	Native JSON	Python objects	Manual parsing
Thread support	Yes	Yes (with pagination)	Partial	Fragile
Media extraction	Yes	Yes	Partial	Manual
Best for	Production pipelines	Real-time streaming	Quick one-off grabs	Custom edge cases

For most use cases — brand monitoring, research data collection, competitive analysis — ScrapeGraphAI hits the sweet spot of reliability, cost, and zero maintenance. Tweepy wins if you specifically need real-time streaming. The rest are fallbacks.

Building a Tweet Data Pipeline

Scraping is step one. Here's how to build the complete pipeline from collection to storage.

Step 1: Collection with Batching

from scrapegraph_py import ScrapeGraphAI
import json
import time
from datetime import datetime
 
def collect_tweets(keywords, max_per_keyword=100):
    sgai = ScrapeGraphAI(api_key="your-sgai-key")
    all_tweets = []
 
    for keyword in keywords:
        search_url = f"https://x.com/search?q={keyword.replace(' ', '+')}&src=typed_query&f=live"
 
        response = sgai.extract(
            url=search_url,
            prompt=f"Extract all visible tweets about '{keyword}'. Include author_handle, full_text, timestamp, likes, retweets, replies, views."
        )
 
        tweets = response.data.json_data.get("tweets", [])
        for tweet in tweets[:max_per_keyword]:
            tweet["keyword"] = keyword
            tweet["collected_at"] = datetime.utcnow().isoformat()
            all_tweets.append(tweet)
 
        time.sleep(2)
 
    return all_tweets
 
keywords = ["ScrapeGraphAI", "web scraping API", "AI data extraction"]
raw_tweets = collect_tweets(keywords)
 
with open("raw_tweets.json", "w") as f:
    json.dump(raw_tweets, f, indent=2)
 
print(f"Collected {len(raw_tweets)} tweets across {len(keywords)} keywords")

Step 2: Cleaning, Dedup, and Storage

import json
import re
import sqlite3
from hashlib import sha256
 
def clean_tweet_text(text):
    text = re.sub(r"http\S+", "", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text
 
def deduplicate_tweets(tweets):
    seen = set()
    unique = []
    for tweet in tweets:
        content_hash = sha256(
            f"{tweet.get('author_handle', '')}:{tweet.get('full_text', '')[:100]}".encode()
        ).hexdigest()
        if content_hash not in seen:
            seen.add(content_hash)
            tweet["content_hash"] = content_hash
            tweet["clean_text"] = clean_tweet_text(tweet.get("full_text", ""))
            unique.append(tweet)
    return unique
 
def init_db(db_path="tweets.db"):
    conn = sqlite3.connect(db_path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS tweets (
            content_hash TEXT PRIMARY KEY,
            author_handle TEXT,
            full_text TEXT,
            clean_text TEXT,
            timestamp TEXT,
            likes INTEGER DEFAULT 0,
            retweets INTEGER DEFAULT 0,
            replies INTEGER DEFAULT 0,
            views TEXT,
            keyword TEXT,
            collected_at TEXT
        )
    """)
    conn.execute("CREATE INDEX IF NOT EXISTS idx_keyword ON tweets(keyword)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_timestamp ON tweets(timestamp)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_author ON tweets(author_handle)")
    conn.commit()
    return conn
 
def insert_tweets(conn, tweets):
    inserted = 0
    for tweet in tweets:
        try:
            conn.execute("""
                INSERT OR IGNORE INTO tweets
                (content_hash, author_handle, full_text, clean_text, timestamp,
                 likes, retweets, replies, views, keyword, collected_at)
                VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
            """, (
                tweet.get("content_hash"),
                tweet.get("author_handle"),
                tweet.get("full_text"),
                tweet.get("clean_text"),
                tweet.get("timestamp"),
                tweet.get("likes", 0),
                tweet.get("retweets", 0),
                tweet.get("replies", 0),
                tweet.get("views"),
                tweet.get("keyword"),
                tweet.get("collected_at"),
            ))
            inserted += 1
        except Exception:
            continue
    conn.commit()
    return inserted
 
with open("raw_tweets.json") as f:
    raw = json.load(f)
 
cleaned = deduplicate_tweets(raw)
conn = init_db()
count = insert_tweets(conn, cleaned)
print(f"Deduped: {len(raw)} → {len(cleaned)} | Inserted {count} into database")
conn.close()

Handling X's Anti-Bot Measures

X has gotten aggressive about blocking scrapers. Here's what to know.

Rate Limiting — X tracks request patterns per IP. If you're hitting their servers directly, you'll get rate-limited after a few hundred requests. Add random delays, rotate residential proxies, and limit concurrency. With ScrapeGraphAI, this is handled for you — the service manages its own proxy rotation and request pacing.

Login Walls — X increasingly shows login prompts to unauthenticated visitors, especially on search pages. This blocks snscrape and basic HTTP scrapers entirely. ScrapeGraphAI handles authentication barriers automatically.

JavaScript Rendering — X is a React SPA. Simple HTTP GET returns almost no tweet content. This eliminates requests/BeautifulSoup approaches entirely. You need either a headless browser, the official API, or a rendering service like ScrapeGraphAI.

Fingerprinting — X fingerprints browser environments. Headless Chrome has detectable characteristics (missing WebGL extensions, navigator properties) that X's bot detection picks up. If you're rolling your own Playwright setup, you'll need stealth plugins and constant maintenance. ScrapeGraphAI handles this internally.

Scheduling and Automation

A tweet scraper that runs once is a script. One that runs on a schedule is a pipeline.

#!/usr/bin/env python3
import sqlite3
import time
from datetime import datetime
from hashlib import sha256
from scrapegraph_py import ScrapeGraphAI
 
KEYWORDS = ["ScrapeGraphAI", "web scraping", "AI scraping"]
DB_PATH = "/var/data/tweets.db"
API_KEY = "your-sgai-key"
 
def run_collection():
    sgai = ScrapeGraphAI(api_key=API_KEY)
    conn = sqlite3.connect(DB_PATH)
 
    conn.execute("""
        CREATE TABLE IF NOT EXISTS tweets (
            content_hash TEXT PRIMARY KEY,
            author_handle TEXT,
            full_text TEXT,
            timestamp TEXT,
            likes INTEGER DEFAULT 0,
            retweets INTEGER DEFAULT 0,
            keyword TEXT,
            collected_at TEXT
        )
    """)
 
    total = 0
    for keyword in KEYWORDS:
        url = f"https://x.com/search?q={keyword.replace(' ', '+')}&f=live"
        try:
            response = sgai.extract(
                url=url,
                prompt="Extract all visible tweets with author_handle, full_text, timestamp, likes, retweets"
            )
            for tweet in response.data.json_data.get("tweets", []):
                h = sha256(f"{tweet.get('author_handle')}:{tweet.get('full_text', '')[:100]}".encode()).hexdigest()
                conn.execute(
                    "INSERT OR IGNORE INTO tweets VALUES (?,?,?,?,?,?,?,?)",
                    (h, tweet.get("author_handle"), tweet.get("full_text"),
                     tweet.get("timestamp"), tweet.get("likes", 0),
                     tweet.get("retweets", 0), keyword, datetime.utcnow().isoformat())
                )
                total += 1
            time.sleep(3)
        except Exception as e:
            print(f"Error collecting '{keyword}': {e}")
 
    conn.commit()
    conn.close()
    print(f"[{datetime.utcnow().isoformat()}] Collected {total} tweets")
 
if __name__ == "__main__":
    run_collection()

# Run every 6 hours
0 */6 * * * /usr/bin/python3 /opt/scripts/collect_tweets.py >> /var/log/tweet_collector.log 2>&1

Troubleshooting Common Issues

Empty Results from Search Pages — X is likely showing a login wall or you're rate-limited. Switch to ScrapeGraphAI which handles login walls and rendering. If using Playwright, add a logged-in session cookie.

Inconsistent Engagement Metrics — This isn't a bug. X's counters are eventually consistent. Accept variance of 5-10% for metrics and use the latest value.

Unicode and Emoji Handling — Always use UTF-8 encoding when writing to files:

with open("tweets.json", "w", encoding="utf-8") as f:
    json.dump(tweets, f, ensure_ascii=False, indent=2)

Real-World Example: Competitive Monitoring

Here's a concrete scenario. You're a product manager tracking mentions of your product and top competitors on X, analyzing sentiment weekly, and flagging negative spikes.

from scrapegraph_py import ScrapeGraphAI
from textblob import TextBlob
import sqlite3
from datetime import datetime
from hashlib import sha256
 
TARGETS = {
    "your_product": ["YourProduct", "@yourhandle"],
    "competitor_a": ["CompetitorA", "@compa"],
    "competitor_b": ["CompetitorB", "@compb"],
}
 
def collect_and_analyze():
    sgai = ScrapeGraphAI(api_key="your-sgai-key")
    conn = sqlite3.connect("competitive_intel.db")
 
    conn.execute("""
        CREATE TABLE IF NOT EXISTS mentions (
            hash TEXT PRIMARY KEY,
            target TEXT,
            author TEXT,
            text TEXT,
            timestamp TEXT,
            likes INTEGER,
            sentiment REAL,
            sentiment_label TEXT,
            collected_at TEXT
        )
    """)
 
    for target_name, keywords in TARGETS.items():
        query = " OR ".join(keywords)
        url = f"https://x.com/search?q={query}&f=live"
 
        response = sgai.extract(
            url=url,
            prompt="Extract all tweets with author_handle, full_text, timestamp, likes"
        )
 
        for tweet in response.data.json_data.get("tweets", []):
            text = tweet.get("full_text", "")
            score = TextBlob(text).sentiment.polarity
            label = "positive" if score > 0.1 else ("negative" if score < -0.1 else "neutral")
            h = sha256(f"{tweet.get('author_handle')}:{text[:100]}".encode()).hexdigest()
 
            conn.execute(
                "INSERT OR IGNORE INTO mentions VALUES (?,?,?,?,?,?,?,?,?)",
                (h, target_name, tweet.get("author_handle"), text,
                 tweet.get("timestamp"), tweet.get("likes", 0),
                 score, label, datetime.utcnow().isoformat())
            )
 
    conn.commit()
 
    for target in TARGETS:
        row = conn.execute("""
            SELECT COUNT(*), ROUND(AVG(sentiment), 3),
                   SUM(CASE WHEN sentiment_label='negative' THEN 1 ELSE 0 END)
            FROM mentions WHERE target = ?
        """, (target,)).fetchone()
        neg_pct = (row[2] / row[0] * 100) if row[0] > 0 else 0
        alert = " !! SPIKE" if neg_pct > 30 else ""
        print(f"  {target}: {row[0]} mentions, sentiment={row[1]}, {neg_pct:.0f}% negative{alert}")
 
    conn.close()
 
collect_and_analyze()

Run this weekly via cron and pipe the output to Slack. You've got a competitive intelligence system in under 80 lines.

Frequently Asked Questions

Is it legal to scrape tweets?

Scraping public tweets is generally legal in the US per the hiQ v. LinkedIn precedent. X's ToS prohibits it, but that's a civil matter, not criminal. Using a service like ScrapeGraphAI keeps the scraping infrastructure separate from your systems. If you're handling EU user data, GDPR applies — minimize PII storage and have a clear data retention policy.

Can I scrape tweets without the X API?

Yes. ScrapeGraphAI, Playwright, and snscrape all work without X API access. ScrapeGraphAI is the most reliable since it handles JavaScript rendering and anti-bot protections automatically.

How many tweets can I collect per day?

With the X API Basic plan ($100/mo), you're capped at 10,000 tweets per month. ScrapeGraphAI doesn't impose X-specific limits — throughput depends on your plan. For academic-scale collection, ScrapeGraphAI is orders of magnitude cheaper than X API Pro.

Can I scrape protected/private tweets?

No, and you shouldn't try. Protected tweets require authentication and the account owner has explicitly opted out of public visibility. Stick to public data.

TL;DR

A practical guide to building a tweet scraper in Python, comparing the official X API, snscrape, and ScrapeGraphAI.

Official X API is expensive — $100/month minimum, $0.01 per tweet at Basic tier
snscrape and Playwright are fragile — free but break frequently as X changes endpoints
ScrapeGraphAI offers reliable AI extraction — natural language prompts with structured JSON output
Legal landscape favors public data scraping — but respect GDPR and rate-limit your requests
Use cases span finance, research, and brand monitoring — real-time sentiment and competitive intel

This guide walks you through building a tweet scraper that actually works in 2026 — from the official API to AI-powered extraction with ScrapeGraphAI — with real code you can run today.

Why Build a Tweet Scraper?

Tweet data is valuable because of the combination of text, timing, engagement metrics, and network context.

Legal Reality of Scraping Tweets

US case law leans in your favor. The hiQ v. LinkedIn ruling (upheld in 2022) established that scraping publicly available data doesn't violate the Computer Fraud and Abuse Act.

X's Terms of Service say no. Their ToS prohibits automated data collection, but a ToS violation is a civil matter, not criminal.

GDPR is the real risk. If you're collecting data from EU-based users and storing personal identifiers, you need a lawful basis under GDPR.

Practical advice:

Stick to public data only — never attempt to scrape protected accounts
Don't store data longer than your use case requires
Using a third-party extraction service like ScrapeGraphAI adds a layer of separation between your infrastructure and X's servers
Rate-limit your requests — hammering their servers is how you get noticed

Approach 1: Official X API with Tweepy

Tweepy wraps X's API v2 in a clean Python interface. You'll need a developer account at developer.x.com and at least the Basic tier ($100/month) for meaningful access.

pip install tweepy

import tweepy
 
client = tweepy.Client(bearer_token="YOUR_BEARER_TOKEN")
 
query = "artificial intelligence -is:retweet lang:en"
 
response = client.search_recent_tweets(
    query=query,
    max_results=100,
    tweet_fields=["created_at", "public_metrics", "author_id", "lang"],
    expansions=["author_id"],
    user_fields=["username", "public_metrics"]
)
 
users = {u.id: u for u in response.includes.get("users", [])}
 
tweets = []
for tweet in response.data:
    author = users.get(tweet.author_id)
    tweets.append({
        "id": tweet.id,
        "text": tweet.text,
        "author": author.username if author else None,
        "created_at": tweet.created_at.isoformat(),
        "likes": tweet.public_metrics["like_count"],
        "retweets": tweet.public_metrics["retweet_count"],
        "replies": tweet.public_metrics["reply_count"],
        "impressions": tweet.public_metrics.get("impression_count", 0),
    })
 
for t in tweets[:5]:
    print(f"@{t['author']}: {t['text'][:80]}... | {t['likes']} likes")

Approach 2: snscrape and Playwright (Free but Fragile)

Both approaches require significant maintenance effort. For anything beyond quick one-off grabs, you want something more reliable.

Approach 3: ScrapeGraphAI (AI-Powered Tweet Scraping)

No browser to manage. No selectors to maintain. No API keys from X.

pip install scrapegraph-py

Scraping Tweets from a Profile

from scrapegraph_py import ScrapeGraphAI
 
sgai = ScrapeGraphAI(api_key="your-sgai-key")
 
response = sgai.extract(
    url="https://x.com/OpenAI",
    prompt="Extract the 10 most recent tweets with full text, timestamp, like count, retweet count, reply count, and view count"
)
 
tweets = response.data.json_data
for tweet in tweets:
    print(f"{tweet['timestamp']} | {tweet['text'][:80]}... | {tweet['likes']} likes")

Keyword Search Extraction

from scrapegraph_py import ScrapeGraphAI
 
sgai = ScrapeGraphAI(api_key="your-sgai-key")
 
response = sgai.extract(
    url="https://x.com/search?q=web+scraping+python&src=typed_query&f=live",
    prompt="""Extract all visible tweets. For each tweet return:
    - author_handle
    - display_name
    - full_text
    - timestamp
    - likes
    - retweets
    - replies
    - views
    - has_media (true/false)
    - media_urls (list of image/video URLs if present)"""
)
 
tweets = response.data.json_data
print(f"Extracted {len(tweets)} tweets")

Ready to scrape?

Start for free

Profile + Tweets Combined Extraction

from scrapegraph_py import ScrapeGraphAI
 
sgai = ScrapeGraphAI(api_key="your-sgai-key")
 
response = sgai.extract(
    url="https://x.com/ylecun",
    prompt="""Extract:
    1. Profile info: display name, handle, bio, follower count, following count, verified status
    2. The 5 most recent tweets with text, timestamp, and all engagement metrics
    3. Any pinned tweet if present"""
)
 
data = response.data.json_data
print(f"@{data.get('handle')} | {data.get('follower_count')} followers")
print(f"Bio: {data.get('bio')}")

Threads and Quote Tweets

Threads and quote tweets are where the interesting context lives. ScrapeGraphAI handles these natively — just point it at a thread URL and describe what you want.

from scrapegraph_py import ScrapeGraphAI
 
sgai = ScrapeGraphAI(api_key="your-sgai-key")
 
response = sgai.extract(
    url="https://x.com/username/status/1234567890",
    prompt="""This is a tweet thread. Extract:
    - The original tweet text and metrics
    - All replies from the original author (thread continuation)
    - Any quoted tweet text and author
    - Total engagement across the thread (sum of likes, retweets, replies)"""
)
 
thread = response.data.json_data
print(f"Thread length: {len(thread.get('thread_tweets', []))} posts")
print(f"Total likes across thread: {thread.get('total_likes', 'N/A')}")

Example JSON Output

Here's what the structured output looks like from a ScrapeGraphAI tweet scraper extraction:

{
  "profile": {
    "display_name": "Yann LeCun",
    "handle": "ylecun",
    "bio": "VP & Chief AI Scientist at Meta. Professor at NYU. ACM Turing Award Laureate.",
    "followers": "695.4K",
    "following": "2,891",
    "verified": true
  },
  "pinned_tweet": null,
  "recent_tweets": [
    {
      "text": "New paper on V-JEPA 2.0. We trained a video prediction model that learns rich visual representations without any labels.",
      "timestamp": "2026-03-23T16:45:00Z",
      "likes": 8420,
      "retweets": 2150,
      "replies": 342,
      "views": "1.8M",
      "has_media": true,
      "is_thread": false
    }
  ]
}

Comparison: Which Tweet Scraper Should You Use?

Feature	ScrapeGraphAI	Tweepy (X API)	snscrape	Playwright
Cost	Free tier + usage	$100-5,000/mo	Free	Free
Setup time	5 minutes	30 minutes	15 minutes	1-2 hours
Auth required	SGAI key only	X developer account	None	None
JS rendering	Handled	N/A (API)	No	Yes
Maintenance	None	Low	Very high	Very high
Rate limits	Plan-based	Strict per-endpoint	Breaks unpredictably	Self-managed
Structured output	AI-structured JSON	Native JSON	Python objects	Manual parsing
Thread support	Yes	Yes (with pagination)	Partial	Fragile
Media extraction	Yes	Yes	Partial	Manual
Best for	Production pipelines	Real-time streaming	Quick one-off grabs	Custom edge cases

Building a Tweet Data Pipeline

Scraping is step one. Here's how to build the complete pipeline from collection to storage.

Step 1: Collection with Batching

from scrapegraph_py import ScrapeGraphAI
import json
import time
from datetime import datetime
 
def collect_tweets(keywords, max_per_keyword=100):
    sgai = ScrapeGraphAI(api_key="your-sgai-key")
    all_tweets = []
 
    for keyword in keywords:
        search_url = f"https://x.com/search?q={keyword.replace(' ', '+')}&src=typed_query&f=live"
 
        response = sgai.extract(
            url=search_url,
            prompt=f"Extract all visible tweets about '{keyword}'. Include author_handle, full_text, timestamp, likes, retweets, replies, views."
        )
 
        tweets = response.data.json_data.get("tweets", [])
        for tweet in tweets[:max_per_keyword]:
            tweet["keyword"] = keyword
            tweet["collected_at"] = datetime.utcnow().isoformat()
            all_tweets.append(tweet)
 
        time.sleep(2)
 
    return all_tweets
 
keywords = ["ScrapeGraphAI", "web scraping API", "AI data extraction"]
raw_tweets = collect_tweets(keywords)
 
with open("raw_tweets.json", "w") as f:
    json.dump(raw_tweets, f, indent=2)
 
print(f"Collected {len(raw_tweets)} tweets across {len(keywords)} keywords")

Step 2: Cleaning, Dedup, and Storage

import json
import re
import sqlite3
from hashlib import sha256
 
def clean_tweet_text(text):
    text = re.sub(r"http\S+", "", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text
 
def deduplicate_tweets(tweets):
    seen = set()
    unique = []
    for tweet in tweets:
        content_hash = sha256(
            f"{tweet.get('author_handle', '')}:{tweet.get('full_text', '')[:100]}".encode()
        ).hexdigest()
        if content_hash not in seen:
            seen.add(content_hash)
            tweet["content_hash"] = content_hash
            tweet["clean_text"] = clean_tweet_text(tweet.get("full_text", ""))
            unique.append(tweet)
    return unique
 
def init_db(db_path="tweets.db"):
    conn = sqlite3.connect(db_path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS tweets (
            content_hash TEXT PRIMARY KEY,
            author_handle TEXT,
            full_text TEXT,
            clean_text TEXT,
            timestamp TEXT,
            likes INTEGER DEFAULT 0,
            retweets INTEGER DEFAULT 0,
            replies INTEGER DEFAULT 0,
            views TEXT,
            keyword TEXT,
            collected_at TEXT
        )
    """)
    conn.execute("CREATE INDEX IF NOT EXISTS idx_keyword ON tweets(keyword)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_timestamp ON tweets(timestamp)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_author ON tweets(author_handle)")
    conn.commit()
    return conn
 
def insert_tweets(conn, tweets):
    inserted = 0
    for tweet in tweets:
        try:
            conn.execute("""
                INSERT OR IGNORE INTO tweets
                (content_hash, author_handle, full_text, clean_text, timestamp,
                 likes, retweets, replies, views, keyword, collected_at)
                VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
            """, (
                tweet.get("content_hash"),
                tweet.get("author_handle"),
                tweet.get("full_text"),
                tweet.get("clean_text"),
                tweet.get("timestamp"),
                tweet.get("likes", 0),
                tweet.get("retweets", 0),
                tweet.get("replies", 0),
                tweet.get("views"),
                tweet.get("keyword"),
                tweet.get("collected_at"),
            ))
            inserted += 1
        except Exception:
            continue
    conn.commit()
    return inserted
 
with open("raw_tweets.json") as f:
    raw = json.load(f)
 
cleaned = deduplicate_tweets(raw)
conn = init_db()
count = insert_tweets(conn, cleaned)
print(f"Deduped: {len(raw)} → {len(cleaned)} | Inserted {count} into database")
conn.close()

Handling X's Anti-Bot Measures

X has gotten aggressive about blocking scrapers. Here's what to know.

Scheduling and Automation

A tweet scraper that runs once is a script. One that runs on a schedule is a pipeline.

#!/usr/bin/env python3
import sqlite3
import time
from datetime import datetime
from hashlib import sha256
from scrapegraph_py import ScrapeGraphAI
 
KEYWORDS = ["ScrapeGraphAI", "web scraping", "AI scraping"]
DB_PATH = "/var/data/tweets.db"
API_KEY = "your-sgai-key"
 
def run_collection():
    sgai = ScrapeGraphAI(api_key=API_KEY)
    conn = sqlite3.connect(DB_PATH)
 
    conn.execute("""
        CREATE TABLE IF NOT EXISTS tweets (
            content_hash TEXT PRIMARY KEY,
            author_handle TEXT,
            full_text TEXT,
            timestamp TEXT,
            likes INTEGER DEFAULT 0,
            retweets INTEGER DEFAULT 0,
            keyword TEXT,
            collected_at TEXT
        )
    """)
 
    total = 0
    for keyword in KEYWORDS:
        url = f"https://x.com/search?q={keyword.replace(' ', '+')}&f=live"
        try:
            response = sgai.extract(
                url=url,
                prompt="Extract all visible tweets with author_handle, full_text, timestamp, likes, retweets"
            )
            for tweet in response.data.json_data.get("tweets", []):
                h = sha256(f"{tweet.get('author_handle')}:{tweet.get('full_text', '')[:100]}".encode()).hexdigest()
                conn.execute(
                    "INSERT OR IGNORE INTO tweets VALUES (?,?,?,?,?,?,?,?)",
                    (h, tweet.get("author_handle"), tweet.get("full_text"),
                     tweet.get("timestamp"), tweet.get("likes", 0),
                     tweet.get("retweets", 0), keyword, datetime.utcnow().isoformat())
                )
                total += 1
            time.sleep(3)
        except Exception as e:
            print(f"Error collecting '{keyword}': {e}")
 
    conn.commit()
    conn.close()
    print(f"[{datetime.utcnow().isoformat()}] Collected {total} tweets")
 
if __name__ == "__main__":
    run_collection()

# Run every 6 hours
0 */6 * * * /usr/bin/python3 /opt/scripts/collect_tweets.py >> /var/log/tweet_collector.log 2>&1

Troubleshooting Common Issues

Inconsistent Engagement Metrics — This isn't a bug. X's counters are eventually consistent. Accept variance of 5-10% for metrics and use the latest value.

Unicode and Emoji Handling — Always use UTF-8 encoding when writing to files:

with open("tweets.json", "w", encoding="utf-8") as f:
    json.dump(tweets, f, ensure_ascii=False, indent=2)

Real-World Example: Competitive Monitoring

Here's a concrete scenario. You're a product manager tracking mentions of your product and top competitors on X, analyzing sentiment weekly, and flagging negative spikes.

from scrapegraph_py import ScrapeGraphAI
from textblob import TextBlob
import sqlite3
from datetime import datetime
from hashlib import sha256
 
TARGETS = {
    "your_product": ["YourProduct", "@yourhandle"],
    "competitor_a": ["CompetitorA", "@compa"],
    "competitor_b": ["CompetitorB", "@compb"],
}
 
def collect_and_analyze():
    sgai = ScrapeGraphAI(api_key="your-sgai-key")
    conn = sqlite3.connect("competitive_intel.db")
 
    conn.execute("""
        CREATE TABLE IF NOT EXISTS mentions (
            hash TEXT PRIMARY KEY,
            target TEXT,
            author TEXT,
            text TEXT,
            timestamp TEXT,
            likes INTEGER,
            sentiment REAL,
            sentiment_label TEXT,
            collected_at TEXT
        )
    """)
 
    for target_name, keywords in TARGETS.items():
        query = " OR ".join(keywords)
        url = f"https://x.com/search?q={query}&f=live"
 
        response = sgai.extract(
            url=url,
            prompt="Extract all tweets with author_handle, full_text, timestamp, likes"
        )
 
        for tweet in response.data.json_data.get("tweets", []):
            text = tweet.get("full_text", "")
            score = TextBlob(text).sentiment.polarity
            label = "positive" if score > 0.1 else ("negative" if score < -0.1 else "neutral")
            h = sha256(f"{tweet.get('author_handle')}:{text[:100]}".encode()).hexdigest()
 
            conn.execute(
                "INSERT OR IGNORE INTO mentions VALUES (?,?,?,?,?,?,?,?,?)",
                (h, target_name, tweet.get("author_handle"), text,
                 tweet.get("timestamp"), tweet.get("likes", 0),
                 score, label, datetime.utcnow().isoformat())
            )
 
    conn.commit()
 
    for target in TARGETS:
        row = conn.execute("""
            SELECT COUNT(*), ROUND(AVG(sentiment), 3),
                   SUM(CASE WHEN sentiment_label='negative' THEN 1 ELSE 0 END)
            FROM mentions WHERE target = ?
        """, (target,)).fetchone()
        neg_pct = (row[2] / row[0] * 100) if row[0] > 0 else 0
        alert = " !! SPIKE" if neg_pct > 30 else ""
        print(f"  {target}: {row[0]} mentions, sentiment={row[1]}, {neg_pct:.0f}% negative{alert}")
 
    conn.close()
 
collect_and_analyze()

Run this weekly via cron and pipe the output to Slack. You've got a competitive intelligence system in under 80 lines.

Frequently Asked Questions

Is it legal to scrape tweets?

Can I scrape tweets without the X API?

Yes. ScrapeGraphAI, Playwright, and snscrape all work without X API access. ScrapeGraphAI is the most reliable since it handles JavaScript rendering and anti-bot protections automatically.

How many tweets can I collect per day?

Can I scrape protected/private tweets?

No, and you shouldn't try. Protected tweets require authentication and the account owner has explicitly opted out of public visibility. Stick to public data.

Tweet Scraper: How to Extract X/Twitter Data in Python [2026]

TL;DR

Why Build a Tweet Scraper?

Legal Reality of Scraping Tweets

Approach 1: Official X API with Tweepy

Approach 2: snscrape and Playwright (Free but Fragile)

Approach 3: ScrapeGraphAI (AI-Powered Tweet Scraping)

Scraping Tweets from a Profile

Keyword Search Extraction

Ready to scrape?

Profile + Tweets Combined Extraction

Threads and Quote Tweets

Example JSON Output

Comparison: Which Tweet Scraper Should You Use?

Building a Tweet Data Pipeline

Step 1: Collection with Batching

Step 2: Cleaning, Dedup, and Storage

Handling X's Anti-Bot Measures

Scheduling and Automation

Troubleshooting Common Issues

Real-World Example: Competitive Monitoring

Frequently Asked Questions

Is it legal to scrape tweets?

Can I scrape tweets without the X API?

How many tweets can I collect per day?

Can I scrape protected/private tweets?

Related articles

ScrapeGraphAI + LiteLLM: Web Access for Any Model

ScrapeGraphAI + Agno: Fast Agents With Web Access

7 Best BrowserBlast Alternatives in 2026

Give your AI Agent superpowers with lightning-fast web data!

Tweet Scraper: How to Extract X/Twitter Data in Python [2026]

TL;DR

Why Build a Tweet Scraper?

Legal Reality of Scraping Tweets

Approach 1: Official X API with Tweepy

Approach 2: snscrape and Playwright (Free but Fragile)

Approach 3: ScrapeGraphAI (AI-Powered Tweet Scraping)

Scraping Tweets from a Profile

Keyword Search Extraction

Ready to scrape?

Profile + Tweets Combined Extraction

Threads and Quote Tweets

Example JSON Output

Comparison: Which Tweet Scraper Should You Use?

Building a Tweet Data Pipeline

Step 1: Collection with Batching

Step 2: Cleaning, Dedup, and Storage

Handling X's Anti-Bot Measures

Scheduling and Automation

Troubleshooting Common Issues

Real-World Example: Competitive Monitoring

Frequently Asked Questions

Is it legal to scrape tweets?

Can I scrape tweets without the X API?

How many tweets can I collect per day?

Can I scrape protected/private tweets?

Related articles

ScrapeGraphAI + LiteLLM: Web Access for Any Model

ScrapeGraphAI + Agno: Fast Agents With Web Access

7 Best BrowserBlast Alternatives in 2026

Give your AI Agent superpowers with lightning-fast web data!