ScrapeGraphAIScrapeGraphAI

Tweet Scraper: How to Extract X/Twitter Data in Python [2026]

Tweet Scraper: How to Extract X/Twitter Data in Python [2026]

Author 1

Marco Vinciguerra

Tweet Scraper: How to Extract X/Twitter Data in Python

X (formerly Twitter) pumps out roughly 500 million posts a day. That's a staggering amount of unstructured text, engagement signals, and real-time opinion data sitting right there in public. If you know how to tap into it, you've got a firehose of market sentiment, trend signals, and competitive intel.

The problem? X doesn't make it easy. The official API now costs $100/month minimum and the free tier is borderline useless. snscrape broke (again). Selenium scripts rot faster than you can maintain them.

This guide walks you through building a tweet scraper that actually works in 2026 — from the official API to AI-powered extraction with ScrapeGraphAI — with real code you can run today.

Why Build a Tweet Scraper?

Tweet data is valuable because of the combination of text, timing, engagement metrics, and network context.

Financial Signal Detection — Hedge funds track specific accounts (earnings whispers, industry analysts, company insiders) and monitor for unusual posting patterns. A tweet scraper watching 500 finance accounts can detect sentiment shifts 15-30 minutes before they show up in price action.

Brand Crisis Detection — When a product breaks or a PR disaster unfolds, X is where it hits first. A twitter scraper monitoring your brand mentions gives your comms team a 20-minute head start over batch-processing social listening platforms.

Academic Research — Computational social science runs on tweet datasets. Studying misinformation spread, political polarization, language evolution, crisis communication — all of it needs large volumes of tweet data with metadata intact. Since X gutted academic API access in 2023, researchers have shifted to scraping. A well-built pipeline with proper deduplication can collect millions of tweets for a fraction of what the API costs.

Competitive Intelligence — Track what your competitors' users are saying. Monitor feature requests, complaints, praise. Watch how engagement patterns shift after product launches. This data feeds directly into product roadmaps if you're paying attention.

US case law leans in your favor. The hiQ v. LinkedIn ruling (upheld in 2022) established that scraping publicly available data doesn't violate the Computer Fraud and Abuse Act.

X's Terms of Service say no. Their ToS prohibits automated data collection, but a ToS violation is a civil matter, not criminal.

GDPR is the real risk. If you're collecting data from EU-based users and storing personal identifiers, you need a lawful basis under GDPR.

Practical advice:

  • Stick to public data only — never attempt to scrape protected accounts
  • Don't store data longer than your use case requires
  • Using a third-party extraction service like ScrapeGraphAI adds a layer of separation between your infrastructure and X's servers
  • Rate-limit your requests — hammering their servers is how you get noticed

Approach 1: Official X API with Tweepy

Tweepy wraps X's API v2 in a clean Python interface. You'll need a developer account at developer.x.com and at least the Basic tier ($100/month) for meaningful access.

pip install tweepy
import tweepy
 
client = tweepy.Client(bearer_token="YOUR_BEARER_TOKEN")
 
query = "artificial intelligence -is:retweet lang:en"
 
response = client.search_recent_tweets(
    query=query,
    max_results=100,
    tweet_fields=["created_at", "public_metrics", "author_id", "lang"],
    expansions=["author_id"],
    user_fields=["username", "public_metrics"]
)
 
users = {u.id: u for u in response.includes.get("users", [])}
 
tweets = []
for tweet in response.data:
    author = users.get(tweet.author_id)
    tweets.append({
        "id": tweet.id,
        "text": tweet.text,
        "author": author.username if author else None,
        "created_at": tweet.created_at.isoformat(),
        "likes": tweet.public_metrics["like_count"],
        "retweets": tweet.public_metrics["retweet_count"],
        "replies": tweet.public_metrics["reply_count"],
        "impressions": tweet.public_metrics.get("impression_count", 0),
    })
 
for t in tweets[:5]:
    print(f"@{t['author']}: {t['text'][:80]}... | {t['likes']} likes")

The cost problem kills Tweepy for most use cases. Basic tier gives you 10,000 tweets/month for $100 — that's $0.01 per tweet. Pro tier ($5,000/month) gets you 1 million tweets with strict rate limits. For any serious data collection, the math doesn't work.

Approach 2: snscrape and Playwright (Free but Fragile)

snscrape was the go-to free tweet scraper for years, scraping X's internal API endpoints without authentication. The reality in 2026: X keeps changing those endpoints, the library has 200+ open issues, and the last meaningful commit was months ago. It still works intermittently for one-off research grabs, but building a production pipeline on it is building on sand.

Playwright (browser automation) works by rendering X's React SPA in headless Chrome. It handles JavaScript rendering but it's slow (2-5 seconds per scroll), resource-hungry, and fragile — X changes their DOM structure regularly. You'll spend more time maintaining selectors than analyzing data. X also fingerprints headless browsers, so you'll need stealth plugins and proxy rotation.

Both approaches require significant maintenance effort. For anything beyond quick one-off grabs, you want something more reliable.

Approach 3: ScrapeGraphAI (AI-Powered Tweet Scraping)

ScrapeGraphAI takes a fundamentally different approach to twitter data extraction. Instead of targeting specific CSS selectors or API endpoints, you describe what you want in plain English. The AI handles rendering, navigation, anti-bot protections, and data structuring.

No browser to manage. No selectors to maintain. No API keys from X.

pip install scrapegraph-py

Scraping Tweets from a Profile

from scrapegraph_py import Client
 
client = Client(api_key="your-sgai-key")
 
response = client.smartscraper(
    website_url="https://x.com/OpenAI",
    user_prompt="Extract the 10 most recent tweets with full text, timestamp, like count, retweet count, reply count, and view count"
)
 
tweets = response["result"]
for tweet in tweets:
    print(f"{tweet['timestamp']} | {tweet['text'][:80]}... | {tweet['likes']} likes")
 
client.close()

Keyword Search Extraction

from scrapegraph_py import Client
 
client = Client(api_key="your-sgai-key")
 
response = client.smartscraper(
    website_url="https://x.com/search?q=web+scraping+python&src=typed_query&f=live",
    user_prompt="""Extract all visible tweets. For each tweet return:
    - author_handle
    - display_name
    - full_text
    - timestamp
    - likes
    - retweets
    - replies
    - views
    - has_media (true/false)
    - media_urls (list of image/video URLs if present)"""
)
 
tweets = response["result"]
print(f"Extracted {len(tweets)} tweets")
 
client.close()

Profile + Tweets Combined Extraction

from scrapegraph_py import Client
 
client = Client(api_key="your-sgai-key")
 
response = client.smartscraper(
    website_url="https://x.com/ylecun",
    user_prompt="""Extract:
    1. Profile info: display name, handle, bio, follower count, following count, verified status
    2. The 5 most recent tweets with text, timestamp, and all engagement metrics
    3. Any pinned tweet if present"""
)
 
data = response["result"]
print(f"@{data.get('handle')} | {data.get('follower_count')} followers")
print(f"Bio: {data.get('bio')}")
 
client.close()

Threads and Quote Tweets

Threads and quote tweets are where the interesting context lives. ScrapeGraphAI handles these natively — just point it at a thread URL and describe what you want.

from scrapegraph_py import Client
 
client = Client(api_key="your-sgai-key")
 
response = client.smartscraper(
    website_url="https://x.com/username/status/1234567890",
    user_prompt="""This is a tweet thread. Extract:
    - The original tweet text and metrics
    - All replies from the original author (thread continuation)
    - Any quoted tweet text and author
    - Total engagement across the thread (sum of likes, retweets, replies)"""
)
 
thread = response["result"]
print(f"Thread length: {len(thread.get('thread_tweets', []))} posts")
print(f"Total likes across thread: {thread.get('total_likes', 'N/A')}")
 
client.close()

Example JSON Output

Here's what the structured output looks like from a ScrapeGraphAI tweet scraper extraction:

{
  "profile": {
    "display_name": "Yann LeCun",
    "handle": "ylecun",
    "bio": "VP & Chief AI Scientist at Meta. Professor at NYU. ACM Turing Award Laureate.",
    "followers": "695.4K",
    "following": "2,891",
    "verified": true
  },
  "pinned_tweet": null,
  "recent_tweets": [
    {
      "text": "New paper on V-JEPA 2.0. We trained a video prediction model that learns rich visual representations without any labels.",
      "timestamp": "2026-03-23T16:45:00Z",
      "likes": 8420,
      "retweets": 2150,
      "replies": 342,
      "views": "1.8M",
      "has_media": true,
      "is_thread": false
    }
  ]
}

Comparison: Which Tweet Scraper Should You Use?

Feature ScrapeGraphAI Tweepy (X API) snscrape Playwright
Cost Free tier + usage $100-5,000/mo Free Free
Setup time 5 minutes 30 minutes 15 minutes 1-2 hours
Auth required SGAI key only X developer account None None
JS rendering Handled N/A (API) No Yes
Maintenance None Low Very high Very high
Rate limits Plan-based Strict per-endpoint Breaks unpredictably Self-managed
Structured output AI-structured JSON Native JSON Python objects Manual parsing
Thread support Yes Yes (with pagination) Partial Fragile
Media extraction Yes Yes Partial Manual
Best for Production pipelines Real-time streaming Quick one-off grabs Custom edge cases

For most use cases — brand monitoring, research data collection, competitive analysis — ScrapeGraphAI hits the sweet spot of reliability, cost, and zero maintenance. Tweepy wins if you specifically need real-time streaming. The rest are fallbacks.

Building a Tweet Data Pipeline

Scraping is step one. Here's how to build the complete pipeline from collection to storage.

Step 1: Collection with Batching

from scrapegraph_py import Client
import json
import time
from datetime import datetime
 
def collect_tweets(keywords, max_per_keyword=100):
    client = Client(api_key="your-sgai-key")
    all_tweets = []
 
    for keyword in keywords:
        search_url = f"https://x.com/search?q={keyword.replace(' ', '+')}&src=typed_query&f=live"
 
        response = client.smartscraper(
            website_url=search_url,
            user_prompt=f"Extract all visible tweets about '{keyword}'. Include author_handle, full_text, timestamp, likes, retweets, replies, views."
        )
 
        tweets = response.get("result", [])
        for tweet in tweets[:max_per_keyword]:
            tweet["keyword"] = keyword
            tweet["collected_at"] = datetime.utcnow().isoformat()
            all_tweets.append(tweet)
 
        time.sleep(2)
 
    client.close()
    return all_tweets
 
keywords = ["ScrapeGraphAI", "web scraping API", "AI data extraction"]
raw_tweets = collect_tweets(keywords)
 
with open("raw_tweets.json", "w") as f:
    json.dump(raw_tweets, f, indent=2)
 
print(f"Collected {len(raw_tweets)} tweets across {len(keywords)} keywords")

Step 2: Cleaning, Dedup, and Storage

import json
import re
import sqlite3
from hashlib import sha256
 
def clean_tweet_text(text):
    text = re.sub(r"http\S+", "", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text
 
def deduplicate_tweets(tweets):
    seen = set()
    unique = []
    for tweet in tweets:
        content_hash = sha256(
            f"{tweet.get('author_handle', '')}:{tweet.get('full_text', '')[:100]}".encode()
        ).hexdigest()
        if content_hash not in seen:
            seen.add(content_hash)
            tweet["content_hash"] = content_hash
            tweet["clean_text"] = clean_tweet_text(tweet.get("full_text", ""))
            unique.append(tweet)
    return unique
 
def init_db(db_path="tweets.db"):
    conn = sqlite3.connect(db_path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS tweets (
            content_hash TEXT PRIMARY KEY,
            author_handle TEXT,
            full_text TEXT,
            clean_text TEXT,
            timestamp TEXT,
            likes INTEGER DEFAULT 0,
            retweets INTEGER DEFAULT 0,
            replies INTEGER DEFAULT 0,
            views TEXT,
            keyword TEXT,
            collected_at TEXT
        )
    """)
    conn.execute("CREATE INDEX IF NOT EXISTS idx_keyword ON tweets(keyword)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_timestamp ON tweets(timestamp)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_author ON tweets(author_handle)")
    conn.commit()
    return conn
 
def insert_tweets(conn, tweets):
    inserted = 0
    for tweet in tweets:
        try:
            conn.execute("""
                INSERT OR IGNORE INTO tweets
                (content_hash, author_handle, full_text, clean_text, timestamp,
                 likes, retweets, replies, views, keyword, collected_at)
                VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
            """, (
                tweet.get("content_hash"),
                tweet.get("author_handle"),
                tweet.get("full_text"),
                tweet.get("clean_text"),
                tweet.get("timestamp"),
                tweet.get("likes", 0),
                tweet.get("retweets", 0),
                tweet.get("replies", 0),
                tweet.get("views"),
                tweet.get("keyword"),
                tweet.get("collected_at"),
            ))
            inserted += 1
        except Exception:
            continue
    conn.commit()
    return inserted
 
with open("raw_tweets.json") as f:
    raw = json.load(f)
 
cleaned = deduplicate_tweets(raw)
conn = init_db()
count = insert_tweets(conn, cleaned)
print(f"Deduped: {len(raw)}{len(cleaned)} | Inserted {count} into database")
conn.close()

Handling X's Anti-Bot Measures

X has gotten aggressive about blocking scrapers. Here's what to know.

Rate Limiting — X tracks request patterns per IP. If you're hitting their servers directly, you'll get rate-limited after a few hundred requests. Add random delays, rotate residential proxies, and limit concurrency. With ScrapeGraphAI, this is handled for you — the service manages its own proxy rotation and request pacing.

Login Walls — X increasingly shows login prompts to unauthenticated visitors, especially on search pages. This blocks snscrape and basic HTTP scrapers entirely. ScrapeGraphAI handles authentication barriers automatically.

JavaScript Rendering — X is a React SPA. Simple HTTP GET returns almost no tweet content. This eliminates requests/BeautifulSoup approaches entirely. You need either a headless browser, the official API, or a rendering service like ScrapeGraphAI.

Fingerprinting — X fingerprints browser environments. Headless Chrome has detectable characteristics (missing WebGL extensions, navigator properties) that X's bot detection picks up. If you're rolling your own Playwright setup, you'll need stealth plugins and constant maintenance. ScrapeGraphAI handles this internally.

Scheduling and Automation

A tweet scraper that runs once is a script. One that runs on a schedule is a pipeline.

#!/usr/bin/env python3
import sqlite3
import time
from datetime import datetime
from hashlib import sha256
from scrapegraph_py import Client
 
KEYWORDS = ["ScrapeGraphAI", "web scraping", "AI scraping"]
DB_PATH = "/var/data/tweets.db"
API_KEY = "your-sgai-key"
 
def run_collection():
    client = Client(api_key=API_KEY)
    conn = sqlite3.connect(DB_PATH)
 
    conn.execute("""
        CREATE TABLE IF NOT EXISTS tweets (
            content_hash TEXT PRIMARY KEY,
            author_handle TEXT,
            full_text TEXT,
            timestamp TEXT,
            likes INTEGER DEFAULT 0,
            retweets INTEGER DEFAULT 0,
            keyword TEXT,
            collected_at TEXT
        )
    """)
 
    total = 0
    for keyword in KEYWORDS:
        url = f"https://x.com/search?q={keyword.replace(' ', '+')}&f=live"
        try:
            response = client.smartscraper(
                website_url=url,
                user_prompt="Extract all visible tweets with author_handle, full_text, timestamp, likes, retweets"
            )
            for tweet in response.get("result", []):
                h = sha256(f"{tweet.get('author_handle')}:{tweet.get('full_text', '')[:100]}".encode()).hexdigest()
                conn.execute(
                    "INSERT OR IGNORE INTO tweets VALUES (?,?,?,?,?,?,?,?)",
                    (h, tweet.get("author_handle"), tweet.get("full_text"),
                     tweet.get("timestamp"), tweet.get("likes", 0),
                     tweet.get("retweets", 0), keyword, datetime.utcnow().isoformat())
                )
                total += 1
            time.sleep(3)
        except Exception as e:
            print(f"Error collecting '{keyword}': {e}")
 
    conn.commit()
    conn.close()
    client.close()
    print(f"[{datetime.utcnow().isoformat()}] Collected {total} tweets")
 
if __name__ == "__main__":
    run_collection()
# Run every 6 hours
0 */6 * * * /usr/bin/python3 /opt/scripts/collect_tweets.py >> /var/log/tweet_collector.log 2>&1

Troubleshooting Common Issues

Empty Results from Search Pages — X is likely showing a login wall or you're rate-limited. Switch to ScrapeGraphAI which handles login walls and rendering. If using Playwright, add a logged-in session cookie.

Inconsistent Engagement Metrics — This isn't a bug. X's counters are eventually consistent. Accept variance of 5-10% for metrics and use the latest value.

Unicode and Emoji Handling — Always use UTF-8 encoding when writing to files:

with open("tweets.json", "w", encoding="utf-8") as f:
    json.dump(tweets, f, ensure_ascii=False, indent=2)

Real-World Example: Competitive Monitoring

Here's a concrete scenario. You're a product manager tracking mentions of your product and top competitors on X, analyzing sentiment weekly, and flagging negative spikes.

from scrapegraph_py import Client
from textblob import TextBlob
import sqlite3
from datetime import datetime
from hashlib import sha256
 
TARGETS = {
    "your_product": ["YourProduct", "@yourhandle"],
    "competitor_a": ["CompetitorA", "@compa"],
    "competitor_b": ["CompetitorB", "@compb"],
}
 
def collect_and_analyze():
    client = Client(api_key="your-sgai-key")
    conn = sqlite3.connect("competitive_intel.db")
 
    conn.execute("""
        CREATE TABLE IF NOT EXISTS mentions (
            hash TEXT PRIMARY KEY,
            target TEXT,
            author TEXT,
            text TEXT,
            timestamp TEXT,
            likes INTEGER,
            sentiment REAL,
            sentiment_label TEXT,
            collected_at TEXT
        )
    """)
 
    for target_name, keywords in TARGETS.items():
        query = " OR ".join(keywords)
        url = f"https://x.com/search?q={query}&f=live"
 
        response = client.smartscraper(
            website_url=url,
            user_prompt="Extract all tweets with author_handle, full_text, timestamp, likes"
        )
 
        for tweet in response.get("result", []):
            text = tweet.get("full_text", "")
            score = TextBlob(text).sentiment.polarity
            label = "positive" if score > 0.1 else ("negative" if score < -0.1 else "neutral")
            h = sha256(f"{tweet.get('author_handle')}:{text[:100]}".encode()).hexdigest()
 
            conn.execute(
                "INSERT OR IGNORE INTO mentions VALUES (?,?,?,?,?,?,?,?,?)",
                (h, target_name, tweet.get("author_handle"), text,
                 tweet.get("timestamp"), tweet.get("likes", 0),
                 score, label, datetime.utcnow().isoformat())
            )
 
    conn.commit()
 
    for target in TARGETS:
        row = conn.execute("""
            SELECT COUNT(*), ROUND(AVG(sentiment), 3),
                   SUM(CASE WHEN sentiment_label='negative' THEN 1 ELSE 0 END)
            FROM mentions WHERE target = ?
        """, (target,)).fetchone()
        neg_pct = (row[2] / row[0] * 100) if row[0] > 0 else 0
        alert = " !! SPIKE" if neg_pct > 30 else ""
        print(f"  {target}: {row[0]} mentions, sentiment={row[1]}, {neg_pct:.0f}% negative{alert}")
 
    conn.close()
    client.close()
 
collect_and_analyze()

Run this weekly via cron and pipe the output to Slack. You've got a competitive intelligence system in under 80 lines.

Frequently Asked Questions

Scraping public tweets is generally legal in the US per the hiQ v. LinkedIn precedent. X's ToS prohibits it, but that's a civil matter, not criminal. Using a service like ScrapeGraphAI keeps the scraping infrastructure separate from your systems. If you're handling EU user data, GDPR applies — minimize PII storage and have a clear data retention policy.

Can I scrape tweets without the X API?

Yes. ScrapeGraphAI, Playwright, and snscrape all work without X API access. ScrapeGraphAI is the most reliable since it handles JavaScript rendering and anti-bot protections automatically.

How many tweets can I collect per day?

With the X API Basic plan ($100/mo), you're capped at 10,000 tweets per month. ScrapeGraphAI doesn't impose X-specific limits — throughput depends on your plan. For academic-scale collection, ScrapeGraphAI is orders of magnitude cheaper than X API Pro.

Can I scrape protected/private tweets?

No, and you shouldn't try. Protected tweets require authentication and the account owner has explicitly opted out of public visibility. Stick to public data.

Give your AI Agent superpowers with lightning-fast web data!