ScrapeGraphAIScrapeGraphAI

Common Web Scraping Mistakes and How to Avoid Them

Common Web Scraping Mistakes and How to Avoid Them

Author 1

Marco Vinciguerra

Web scraping is powerful, but it's easy to make mistakes that waste time, damage your reputation, or worse—get you blocked by websites. Whether you're new to scraping or have been doing it for years, there are pitfalls almost everyone encounters.

In this article, we'll cover the most common mistakes developers make when scraping, why they're problems, and how to fix them.

Mistake 1: Scraping Without Checking Robots.txt and Terms of Service

The Problem

Many scrapers ignore the site's robots.txt file and terms of service. This can lead to:

  • Your IP getting permanently blocked
  • Legal issues (some sites actively pursue scrapers)
  • Damaging the site's performance
  • Wasting resources scraping data you shouldn't have

What NOT to Do

# ❌ Don't just start scraping without checking anything
import requests
from bs4 import BeautifulSoup
 
url = "https://example.com/products"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# Just start scraping...

What TO Do Instead

# ✅ Check robots.txt first
import requests
from urllib.robotparser import RobotFileParser
from bs4 import BeautifulSoup
 
def can_scrape_url(url, user_agent='MyBot/1.0'):
    """Check if URL can be scraped according to robots.txt"""
    rp = RobotFileParser()
    domain = url.split('/')[2]  # Extract domain
    robots_url = f"https://{domain}/robots.txt"
    
    try:
        rp.set_url(robots_url)
        rp.read()
        
        # Check if user-agent is allowed
        if not rp.can_fetch(user_agent, url):
            print(f"robots.txt forbids scraping {url}")
            return False
        
        # Get crawl delay if specified
        crawl_delay = rp.crawl_delay(user_agent)
        if crawl_delay:
            print(f"Crawl delay: {crawl_delay} seconds")
        
        return True
    except Exception as e:
        print(f"Error checking robots.txt: {e}")
        # Conservative approach: if we can't check, don't scrape
        return False
 
# Usage
url = "https://example.com/products"
if can_scrape_url(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    # Safe to scrape
else:
    print("Scraping not allowed for this URL")

Additional Check: Terms of Service

# ✅ Also review the site's terms
# Look for sections about:
# - Data usage restrictions
# - API availability (might be a better alternative)
# - Legal consequences for unauthorized scraping
# - Commercial use restrictions
 
# Best practice: Use the official API if available
# Example: Most major sites now have APIs
# - Twitter/X: Twitter API
# - Amazon: Product Advertising API
# - Google: Custom Search API

Mistake 2: Not Setting a Proper User-Agent

The Problem

Requests without a proper User-Agent header look like automated bots to servers. They're immediately suspicious and likely to get blocked or rate-limited. Many sites explicitly block requests with default User-Agent strings.

What NOT to Do

# ❌ Default requests User-Agent = suspicious
import requests
 
response = requests.get("https://example.com")
# Default User-Agent: python-requests/2.x.x
# Server sees this and thinks: BOT! Block it.

What TO Do Instead

# ✅ Use a realistic User-Agent
import requests
from bs4 import BeautifulSoup
 
# Option 1: Use a real browser User-Agent
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
}
 
response = requests.get("https://example.com", headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
 
# Option 2: Rotate User-Agents to avoid patterns
from itertools import cycle
 
user_agents = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36',
]
 
user_agent_cycle = cycle(user_agents)
 
def scrape_url_with_rotation(url):
    headers = {'User-Agent': next(user_agent_cycle)}
    response = requests.get(url, headers=headers)
    return response
 
# Option 3: Use a library that rotates for you
from user_agent import generate_user_agent
 
headers = {
    'User-Agent': generate_user_agent()
}
 
response = requests.get("https://example.com", headers=headers)

Mistake 3: Making Too Many Requests Too Quickly

The Problem

Scraping too aggressively overwhelms servers and gets you blocked. Servers can easily detect and block IPs making requests faster than any human could.

What NOT to Do

# ❌ Rapid-fire requests (will get blocked)
import requests
import time
 
urls = [f"https://example.com/page/{i}" for i in range(1, 101)]
 
for url in urls:
    response = requests.get(url)  # No delay!
    # Server: "1000 requests/second? That's a bot. BLOCKED."
    process_response(response)

What TO Do Instead

# ✅ Add delays between requests
import requests
import time
import random
 
urls = [f"https://example.com/page/{i}" for i in range(1, 101)]
 
for url in urls:
    response = requests.get(url)
    process_response(response)
    
    # Add random delay between 1-3 seconds
    time.sleep(random.uniform(1, 3))
 
# Better: Use concurrent requests with rate limiting
import asyncio
import aiohttp
 
async def scrape_with_rate_limiting(urls, min_delay=1, max_concurrent=5):
    """Scrape with rate limiting and concurrency"""
    semaphore = asyncio.Semaphore(max_concurrent)
    
    async def scrape_url(session, url):
        async with semaphore:
            await asyncio.sleep(random.uniform(min_delay, min_delay * 1.5))
            async with session.get(url) as response:
                return await response.text()
    
    async with aiohttp.ClientSession() as session:
        tasks = [scrape_url(session, url) for url in urls]
        results = await asyncio.gather(*tasks)
    
    return results
 
# Usage: Scrape with 5 concurrent requests, min 1 second delay
results = asyncio.run(scrape_with_rate_limiting(urls, min_delay=1, max_concurrent=5))

Mistake 4: Ignoring JavaScript-Rendered Content

The Problem

Many modern websites render content with JavaScript. If you just fetch the page source, you won't see the data—it's generated after the page loads. Your scraper gets an empty page.

What NOT to Do

# ❌ Using requests on a JavaScript-heavy site
import requests
from bs4 import BeautifulSoup
 
url = "https://example.com/products"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
 
# Try to find products
products = soup.find_all('div', class_='product')
print(f"Found {len(products)} products")  
# Output: Found 0 products
# Why? They were loaded with JavaScript!

What TO Do Instead

# ✅ Use a tool that renders JavaScript
# Option 1: Use Selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
 
def scrape_js_heavy_site(url):
    """Scrape site that requires JavaScript rendering"""
    driver = webdriver.Chrome()
    
    try:
        driver.get(url)
        
        # Wait for products to load (max 10 seconds)
        WebDriverWait(driver, 10).until(
            EC.presence_of_all_elements_located((By.CLASS_NAME, "product"))
        )
        
        # Now parse the rendered HTML
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        products = soup.find_all('div', class_='product')
        
        return products
    finally:
        driver.quit()
 
# Option 2: Use Playwright (modern, faster)
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
 
def scrape_with_playwright(url):
    """Scrape with Playwright"""
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()
        
        page.goto(url)
        page.wait_for_load_state("networkidle")  # Wait for JS to finish
        
        html = page.content()
        soup = BeautifulSoup(html, 'html.parser')
        products = soup.find_all('div', class_='product')
        
        browser.close()
        return products
 
# Option 3: Use ScrapeGraphAI (handles JavaScript automatically)
from scrapegraphai.graphs import SmartScraperGraph
 
def scrape_with_ai(url, prompt):
    """ScrapeGraphAI handles JavaScript rendering"""
    graph_config = {
        "llm": {
            "model": "gpt-4",
            "api_key": "your-api-key",
        },
    }
    
    scraper = SmartScraperGraph(
        prompt=prompt,
        source=url,
        config=graph_config
    )
    
    result = scraper.run()
    return result
 
products = scrape_with_ai(
    "https://example.com/products",
    "Extract all products with their name, price, and rating"
)

Mistake 5: Not Handling Errors and Timeouts

The Problem

Networks fail. Websites go down. Servers get overloaded. If your scraper doesn't handle these gracefully, it crashes and loses progress. You might also waste API credits retrying immediately.

What NOT to Do

# ❌ No error handling
import requests
 
urls = [f"https://example.com/page/{i}" for i in range(1, 1001)]
 
for url in urls:
    response = requests.get(url)  # What if this fails?
    data = response.json()        # What if response is not JSON?
    process(data)                 # What if process fails?
# Script crashes on first error, losing all progress

What TO Do Instead

# ✅ Comprehensive error handling with retries
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
import time
import logging
 
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
 
def create_session_with_retries(
    retries=3,
    backoff_factor=0.3,
    status_forcelist=(500, 502, 504),
    session=None,
):
    """Create requests session with automatic retries"""
    session = session or requests.Session()
    
    retry_strategy = Retry(
        total=retries,
        backoff_factor=backoff_factor,
        status_forcelist=status_forcelist,
        allowed_methods=["GET", "HEAD"]
    )
    
    adapter = HTTPAdapter(max_retries=retry_strategy)
    session.mount("http://", adapter)
    session.mount("https://", adapter)
    
    return session
 
def scrape_url_safely(url, max_retries=3):
    """Scrape with error handling"""
    session = create_session_with_retries(retries=max_retries)
    
    try:
        response = session.get(url, timeout=10)
        response.raise_for_status()  # Raise error for bad status codes
        
        return response.json()
    
    except requests.exceptions.Timeout:
        logger.error(f"Timeout: {url}")
        return None
    
    except requests.exceptions.HTTPError as e:
        logger.error(f"HTTP Error {response.status_code}: {url}")
        return None
    
    except requests.exceptions.ConnectionError:
        logger.error(f"Connection Error: {url}")
        return None
    
    except ValueError as e:
        logger.error(f"Invalid JSON response: {url}")
        return None
    
    except Exception as e:
        logger.error(f"Unexpected error for {url}: {e}")
        return None
 
# Usage with progress tracking
urls = [f"https://example.com/api/item/{i}" for i in range(1, 1001)]
results = []
failed_urls = []
 
for i, url in enumerate(urls):
    data = scrape_url_safely(url)
    
    if data:
        results.append(data)
        logger.info(f"[{i}/{len(urls)}] Successfully scraped {url}")
    else:
        failed_urls.append(url)
        logger.warning(f"[{i}/{len(urls)}] Failed to scrape {url}")
 
logger.info(f"Complete: {len(results)} successes, {len(failed_urls)} failures")
 
if failed_urls:
    logger.info(f"Failed URLs: {failed_urls}")

Mistake 6: Not Validating Scraped Data

The Problem

Scraped data is often messy—missing fields, unexpected formats, corrupted values. If you don't validate it, you end up with a database full of garbage data that's useless for analysis.

What NOT to Do

# ❌ No validation
from bs4 import BeautifulSoup
import json
 
def scrape_products(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    products = []
    for item in soup.find_all('div', class_='product'):
        product = {
            'name': item.find('h2').text if item.find('h2') else None,
            'price': item.find('span', class_='price').text,  # Will crash if missing
            'rating': item.find('span', class_='rating').text,
        }
        products.append(product)
    
    return products
 
# Result: Inconsistent data, crashes on missing fields, hard to debug

What TO Do Instead

# ✅ Use schema validation with Pydantic
from pydantic import BaseModel, validator, ValidationError
from typing import Optional
import requests
from bs4 import BeautifulSoup
import logging
 
logger = logging.getLogger(__name__)
 
class Product(BaseModel):
    """Product schema with validation"""
    name: str
    price: float
    rating: Optional[float] = None
    url: str
    in_stock: bool = True
    
    @validator('name')
    def name_not_empty(cls, v):
        if not v or not v.strip():
            raise ValueError('Product name cannot be empty')
        return v.strip()
    
    @validator('price')
    def price_positive(cls, v):
        if v < 0:
            raise ValueError('Price cannot be negative')
        return v
    
    @validator('rating')
    def rating_in_range(cls, v):
        if v is not None and not (0 <= v <= 5):
            raise ValueError('Rating must be between 0 and 5')
        return v
 
def scrape_products_with_validation(url):
    """Scrape and validate products"""
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    products = []
    errors = []
    
    for i, item in enumerate(soup.find_all('div', class_='product')):
        try:
            # Extract raw data
            name = item.find('h2').text if item.find('h2') else None
            price_text = item.find('span', class_='price')
            price = float(price_text.text.replace('$', '')) if price_text else None
            
            rating_text = item.find('span', class_='rating')
            rating = float(rating_text.text) if rating_text else None
            
            # Create and validate
            product = Product(
                name=name,
                price=price,
                rating=rating,
                url=item.find('a')['href'] if item.find('a') else url,
                in_stock=item.find('span', class_='out-of-stock') is None
            )
            
            products.append(product.dict())
        
        except ValidationError as e:
            errors.append(f"Item {i}: {e}")
            logger.warning(f"Validation error for item {i}: {e}")
        
        except Exception as e:
            errors.append(f"Item {i}: Unexpected error: {e}")
            logger.error(f"Error scraping item {i}: {e}")
    
    return {
        'products': products,
        'errors': errors,
        'success_rate': f"{len(products)}/{len(soup.find_all('div', class_='product'))}"
    }
 
# Usage
result = scrape_products_with_validation("https://example.com/products")
print(f"Scraped: {result['success_rate']} products")
if result['errors']:
    print(f"Errors: {result['errors']}")

Mistake 7: Not Respecting Rate Limits and Getting Blocked

The Problem

Websites implement rate limiting and IP blocking to prevent abuse. Ignoring these signals leads to:

  • Your requests returning 429 (Too Many Requests)
  • Your IP getting banned
  • Wasting credits on failed requests

What NOT to Do

# ❌ Ignoring rate limit headers
import requests
 
response = requests.get("https://api.example.com/data")
# You get 429 Too Many Requests
# But you ignore it and keep making requests
# Result: IP banned

What TO Do Instead

# ✅ Respect rate limit headers and implement backoff
import requests
import time
from datetime import datetime, timedelta
 
class RateLimitedScraper:
    def __init__(self):
        self.rate_limit_reset = None
        self.requests_remaining = None
    
    def respect_rate_limits(self, response):
        """Extract and respect rate limit headers"""
        # Common rate limit headers
        if 'X-RateLimit-Remaining' in response.headers:
            self.requests_remaining = int(response.headers['X-RateLimit-Remaining'])
        
        if 'X-RateLimit-Reset' in response.headers:
            reset_time = int(response.headers['X-RateLimit-Reset'])
            self.rate_limit_reset = datetime.fromtimestamp(reset_time)
        
        if 'Retry-After' in response.headers:
            retry_after = int(response.headers['Retry-After'])
            print(f"Rate limited. Waiting {retry_after} seconds...")
            time.sleep(retry_after)
    
    def scrape_with_backoff(self, url, max_retries=3):
        """Scrape with exponential backoff"""
        base_delay = 1
        
        for attempt in range(max_retries):
            try:
                response = requests.get(url, timeout=10)
                
                # Handle rate limiting
                if response.status_code == 429:
                    delay = base_delay * (2 ** attempt)  # Exponential backoff
                    print(f"Rate limited. Backing off for {delay} seconds...")
                    time.sleep(delay)
                    continue
                
                # Handle server errors
                if response.status_code >= 500:
                    delay = base_delay * (2 ** attempt)
                    print(f"Server error {response.status_code}. Retrying in {delay} seconds...")
                    time.sleep(delay)
                    continue
                
                response.raise_for_status()
                
                # Respect rate limit headers
                self.respect_rate_limits(response)
                
                # If we're close to limit, proactively wait
                if self.requests_remaining and self.requests_remaining < 10:
                    wait_time = (self.rate_limit_reset - datetime.now()).total_seconds()
                    if wait_time > 0:
                        print(f"Approaching rate limit. Waiting {wait_time:.0f} seconds...")
                        time.sleep(wait_time)
                
                return response
            
            except requests.exceptions.RequestException as e:
                if attempt == max_retries - 1:
                    raise
                delay = base_delay * (2 ** attempt)
                print(f"Request failed: {e}. Retrying in {delay} seconds...")
                time.sleep(delay)
        
        raise Exception(f"Failed after {max_retries} attempts")
 
# Usage
scraper = RateLimitedScraper()
response = scraper.scrape_with_backoff("https://api.example.com/data")
print(response.json())

Mistake 8: Storing API Keys and Credentials in Code

The Problem

Hardcoding API keys or credentials in your code is a security risk. If your code gets pushed to GitHub or shared, attackers can use your keys to make unauthorized requests.

What NOT to Do

# ❌ NEVER do this
API_KEY = "sk-abc123def456ghi789jkl"
API_SECRET = "secret_password_123"
 
response = requests.get(
    "https://api.example.com/data",
    headers={"Authorization": f"Bearer {API_KEY}"}
)

What TO Do Instead

# ✅ Use environment variables
import os
import requests
from dotenv import load_dotenv
 
# Load from .env file (not committed to git)
load_dotenv()
 
API_KEY = os.getenv('API_KEY')
API_SECRET = os.getenv('API_SECRET')
 
if not API_KEY:
    raise ValueError("API_KEY not found in environment variables")
 
response = requests.get(
    "https://api.example.com/data",
    headers={"Authorization": f"Bearer {API_KEY}"}
)
 
# .env file (add to .gitignore):
# API_KEY=sk-abc123def456ghi789jkl
# API_SECRET=secret_password_123
 
# Or use a secrets management system:
# AWS Secrets Manager, HashiCorp Vault, etc.

Mistake 9: Scraping Without a User-Defined Purpose and Respect

The Problem

Scraping without purpose or respect can:

  • Damage the website's performance
  • Violate their terms of service
  • Result in legal action
  • Get you on their blocklist
  • Waste bandwidth and resources

What NOT to Do

# ❌ Scrape everything without purpose
# "Let me download their entire product catalog just in case"
# This is usually:
# 1. Against their ToS
# 2. Illegal in many jurisdictions
# 3. Unfair competition

What TO Do Instead

# ✅ Have a legitimate, ethical purpose
 
# Good reasons to scrape:
# - Market research (aggregate, anonymize data)
# - Personal use (backup your own data)
# - Academic research (with proper attribution)
# - Data journalism (with proper sourcing)
# - Public data collection (government data, public records)
 
# How to be respectful:
import requests
import time
import random
 
class EthicalScraper:
    def __init__(self, delay_range=(2, 5)):
        """
        Ethical scraper with built-in respect mechanisms
        
        Args:
            delay_range: (min, max) seconds between requests
        """
        self.delay_range = delay_range
        self.session = self._create_respectful_session()
    
    def _create_respectful_session(self):
        """Create session with respectful defaults"""
        session = requests.Session()
        session.headers.update({
            'User-Agent': 'MyCompany-DataCollector/1.0 (+http://mycompany.com/scraper)',
            'Accept-Encoding': 'gzip, deflate',
        })
        return session
    
    def scrape_url(self, url):
        """Scrape with delays and respect"""
        # Random delay between requests
        delay = random.uniform(*self.delay_range)
        time.sleep(delay)
        
        # Scrape
        response = self.session.get(url, timeout=10)
        response.raise_for_status()
        
        return response
    
    def log_activity(self, url, timestamp, status):
        """Log what you're scraping for transparency"""
        print(f"[{timestamp}] Scraped: {url} - Status: {status}")
 
# Usage: Clearly identify yourself and your purpose
scraper = EthicalScraper(delay_range=(3, 7))
 
try:
    response = scraper.scrape_url("https://example.com/public-data")
    scraper.log_activity(response.url, time.time(), response.status_code)
except Exception as e:
    print(f"Error: {e}")

Mistake 10: Not Monitoring and Adapting to Changes

The Problem

Websites constantly change their HTML structure, add new anti-bot measures, and update their infrastructure. A scraper that works today might break tomorrow. Without monitoring, you won't know until it's too late.

What NOT to Do

# ❌ Set it and forget it
# "I built this scraper in 2020, and it's still running..."
# Wrong. It's probably broken.
 
def scrape_products():
    soup = BeautifulSoup(response.content, 'html.parser')
    products = soup.find_all('div', class_='product-item')  # This selector breaks when site updates
    # ...

What TO Do Instead

# ✅ Build monitoring and alerting
import logging
import json
from datetime import datetime
 
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
 
class MonitoredScraper:
    def __init__(self):
        self.stats = {
            'total_runs': 0,
            'successful_scrapes': 0,
            'failed_scrapes': 0,
            'errors': [],
            'last_run': None,
        }
    
    def scrape_with_validation(self, url, selectors):
        """Scrape with validation of expected structure"""
        try:
            response = requests.get(url, timeout=10)
            response.raise_for_status()
            
            soup = BeautifulSoup(response.content, 'html.parser')
            
            # Validate expected structure exists
            for selector_name, selector in selectors.items():
                elements = soup.select(selector)
                
                if not elements:
                    logger.warning(f"No elements found for selector: {selector_name}")
                    logger.warning("Website structure may have changed!")
                    self.stats['errors'].append({
                        'timestamp': datetime.now().isoformat(),
                        'url': url,
                        'error': f'Missing selector: {selector_name}'
                    })
                    return None
            
            # Continue scraping if structure is valid
            self.stats['successful_scrapes'] += 1
            logger.info(f"Successfully scraped {url}")
            
            return soup
        
        except Exception as e:
            self.stats['failed_scrapes'] += 1
            self.stats['errors'].append({
                'timestamp': datetime.now().isoformat(),
                'url': url,
                'error': str(e)
            })
            logger.error(f"Error scraping {url}: {e}")
            return None
        
        finally:
            self.stats['total_runs'] += 1
            self.stats['last_run'] = datetime.now().isoformat()
    
    def save_stats(self, filename='scraper_stats.json'):
        """Save stats for monitoring"""
        with open(filename, 'w') as f:
            json.dump(self.stats, f, indent=2)
        logger.info(f"Stats saved to {filename}")
    
    def get_health_report(self):
        """Generate health report"""
        if self.stats['total_runs'] == 0:
            return "No runs yet"
        
        success_rate = (
            self.stats['successful_scrapes'] / 
            self.stats['total_runs'] * 100
        )
        
        report = f"""
Scraper Health Report:
- Total runs: {self.stats['total_runs']}
- Success rate: {success_rate:.1f}%
- Last run: {self.stats['last_run']}
- Recent errors: {len(self.stats['errors'][-5:])}
        """
        return report
 
# Usage
scraper = MonitoredScraper()
 
selectors = {
    'products': 'div.product-item',
    'price': 'span.price',
    'rating': 'span.rating'
}
 
result = scraper.scrape_with_validation("https://example.com/products", selectors)
 
# Check health
print(scraper.get_health_report())
 
# Save stats for external monitoring
scraper.save_stats()
 
# Alert if structure breaks
if result is None:
    logger.critical("Website structure appears to have changed. Manual intervention needed.")
    # Send alert (email, Slack, etc.)

Mistake 11: Not Handling Dynamic URLs and Pagination

The Problem

Many websites use pagination, infinite scrolling, or dynamic URL generation. If you don't handle this, you only scrape a tiny fraction of the data.

What NOT to Do

# ❌ Only scraping the first page
url = "https://example.com/products"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
products = soup.find_all('div', class_='product')
# Gets only ~20 products when there are thousands

What TO Do Instead

# ✅ Handle pagination
import requests
from bs4 import BeautifulSoup
import logging
 
logger = logging.getLogger(__name__)
 
def scrape_all_pages(base_url, max_pages=None):
    """Scrape all pages with pagination"""
    all_products = []
    page = 1
    
    while True:
        if max_pages and page > max_pages:
            break
        
        # Construct paginated URL
        url = f"{base_url}?page={page}"
        
        try:
            logger.info(f"Scraping page {page}: {url}")
            response = requests.get(url, timeout=10)
            response.raise_for_status()
            
            soup = BeautifulSoup(response.content, 'html.parser')
            products = soup.find_all('div', class_='product')
            
            if not products:
                logger.info(f"No products found on page {page}. Stopping.")
                break
            
            all_products.extend(products)
            logger.info(f"Found {len(products)} products on page {page}")
            
            # Check if there's a next page button
            next_button = soup.find('a', class_='next-page')
            if not next_button:
                logger.info("No next page button found. Pagination complete.")
                break
            
            page += 1
        
        except requests.exceptions.RequestException as e:
            logger.error(f"Error scraping page {page}: {e}")
            break
    
    return all_products
 
# Usage
products = scrape_all_pages("https://example.com/products", max_pages=None)
logger.info(f"Total products scraped: {len(products)}")
 
# Handle infinite scroll (requires JavaScript rendering)
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
 
def scrape_infinite_scroll(url, scroll_count=10):
    """Scrape site with infinite scroll"""
    driver = webdriver.Chrome()
    driver.get(url)
    
    all_items = []
    last_height = driver.execute_script("return document.body.scrollHeight")
    
    for i in range(scroll_count):
        # Scroll to bottom
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(2)  # Wait for new items to load
        
        # Get current items
        items = driver.find_elements(By.CLASS_NAME, "product")
        all_items = items
        
        # Check if we've reached the end
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            logger.info("Reached end of infinite scroll")
            break
        
        last_height = new_height
    
    driver.quit()
    return all_items
 
# Usage
products = scrape_infinite_scroll("https://example.com/products")
logger.info(f"Total products found: {len(products)}")

Mistake 12: Not Using or Checking for Available APIs

The Problem

Many sites offer official APIs that are faster, more reliable, and legal to use. Scraping the HTML when an API exists is often wasteful and violates terms of service.

What NOT to Do

# ❌ Scraping HTML when API exists
from bs4 import BeautifulSoup
import requests
 
# Spending hours building HTML scraper
url = "https://api.twitter.com/1.1/search/tweets"
# When Twitter has a public API: https://developer.twitter.com/

What TO Do Instead

# ✅ Check for and use official APIs
# Common APIs to check:
#
# Twitter/X: https://developer.twitter.com/
# GitHub: https://docs.github.com/en/rest
# YouTube: https://developers.google.com/youtube
# Reddit: https://www.reddit.com/dev/api
# Amazon: Product Advertising API
# Google: Custom Search API, Maps API, etc.
# Weather: OpenWeatherMap, WeatherAPI, etc.
# News: NewsAPI, Guardian API
# E-commerce: Many have partner APIs
 
import requests
import json
 
# Example: Using NewsAPI instead of scraping news sites
def get_news_from_api():
    """Use official API instead of scraping"""
    api_key = "your-newsapi-key"
    url = "https://newsapi.org/v2/top-headlines"
    
    params = {
        'country': 'us',
        'apiKey': api_key
    }
    
    response = requests.get(url, params=params)
    articles = response.json()['articles']
    
    return articles
 
# Benefits of using APIs:
# - Faster and more reliable
# - Legal and compliant
# - Better support and documentation
# - Structured data (no parsing needed)
# - Rate limits are clear and respected
 
# Decision tree:
# 1. Does the site have an official API? USE IT
# 2. Is scraping allowed in their ToS? Check robots.txt
# 3. Are you being respectful (delays, low volume)? Proceed with caution
# 4. Is it commercial/competitive scraping? Likely illegal - use API instead

Quick Reference: Common Mistakes Checklist

Before you start scraping, check this list:

  • Checked robots.txt and terms of service
  • Set a proper User-Agent header
  • Implemented rate limiting and delays
  • Handled JavaScript-rendered content correctly
  • Added comprehensive error handling and retries
  • Validated scraped data with a schema
  • Implemented backoff for rate limiting
  • Stored credentials in environment variables, not code
  • Have a legitimate purpose for scraping
  • Monitor for website structure changes
  • Handle pagination correctly
  • Checked for an official API first

Conclusion: Scrape Responsibly

Most scraping problems stem from either:

  1. Technical mistakes (no error handling, ignoring JavaScript)
  2. Ethical mistakes (ignoring ToS, aggressive scraping)

By avoiding these 12 common mistakes, you'll build scrapers that are:

  • Reliable: Handle errors and recover gracefully
  • Respectful: Don't overwhelm servers or violate terms
  • Maintainable: Easy to debug when sites change
  • Legal: Compliant with ToS and regulations
  • Efficient: Get data quickly without wasting resources

Remember: The best scraper is one that works reliably, doesn't get blocked, and provides high-quality data. Take time to do it right.


What mistakes have you made when scraping? Share your lessons learned in the comments below. Learning from others' mistakes is the fastest way to improve.

Give your AI Agent superpowers with lightning-fast web data!