Blog/ScrapeGraphAI in Finance: Automating Financial Data Collection

ScrapeGraphAI in Finance: Automating Financial Data Collection

Learn how to scrape finance websites using ScrapeGraphAI. Discover the best tools and techniques for web scraping finance data.

Tutorials14 min read min readMarco VinciguerraBy Marco Vinciguerra
ScrapeGraphAI in Finance: Automating Financial Data Collection

Financial data is everywhere, but getting it into a usable format is a nightmare. Stock prices on Yahoo Finance look different from Bloomberg, SEC filings are buried in terrible HTML, and every broker displays portfolio data in their own special way.

I've spent way too many hours writing scrapers for financial sites, and they break constantly. Yahoo Finance changes their layout, a broker updates their authentication system, or the SEC website decides to load everything with JavaScript. By the time you fix one scraper, three others have stopped working.

Let's look at how to build financial data collection systems that actually stay working.

Table of Contents

The Traditional Financial Scraping Nightmare

Here's what most people try when they need financial data:

Stock Prices from Yahoo Finance

python
import requests
from bs4 import BeautifulSoup
import re

def get_yahoo_stock_price(symbol):
    url = f"https://finance.yahoo.com/quote/{symbol}"
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'}
    
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Try multiple selectors because Yahoo keeps changing them
    price_selectors = [
        'fin-streamer[data-field="regularMarketPrice"]',
        'span[data-reactid*="regularMarketPrice"]',
        'div[data-test="qsp-price"] span'
    ]
    
    for selector in price_selectors:
        price_elem = soup.select_one(selector)
        if price_elem:
            price_text = price_elem.text.strip()
            # Extract number from text like "$150.25"
            price_match = re.search(r'[d,]+.?d*', price_text.replace(',', ''))
            if price_match:
                return float(price_match.group())
    
    return None

# This breaks every few months when Yahoo updates their site

Company Financial Statements

python
def scrape_sec_filing(ticker, form_type="10-K"):
    # Navigate SEC EDGAR system
    search_url = f"https://www.sec.gov/cgi-bin/browse-edgar"
    
    # This is already getting complicated...
    params = {
        'action': 'getcompany',
        'CIK': ticker,
        'type': form_type,
        'dateb': '',
        'owner': 'exclude',
        'count': '10'
    }
    
    response = requests.get(search_url, params=params)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Find the latest filing link
    filing_links = soup.find_all('a', id='documentsbutton')
    if not filing_links:
        return None
    
    # Follow the link to get the actual document
    filing_url = "https://www.sec.gov" + filing_links[0]['href']
    # ... and this goes on for 50+ more lines

The problem is obvious - every financial site has different HTML structures, authentication requirements, and anti-bot protections. You spend more time maintaining scrapers than analyzing data.

ScrapeGraphAI Approach to Financial Data

As we saw in the traditional approach above, manual scraping is fragile and time-consuming. ScrapeGraphAI offers a completely different approach.

Instead of wrestling with selectors and site-specific quirks, describe what you need:

python
from scrapegraph_py import Client
from datetime import datetime

class FinancialDataCollector:
    def __init__(self, api_key):
        self.client = Client(api_key=api_key)
    
    def get_stock_data(self, symbol):
        # Works on any financial site
        response = self.client.smartscraper(
            website_url=f"https://finance.yahoo.com/quote/{symbol}",
            user_prompt=f"Get the current stock price, daily change, volume, market cap, P/E ratio, and 52-week high/low for {symbol}"
        )
        
        return response.get('result', {})
    
    def get_company_financials(self, ticker):
        # SEC filings made simple
        response = self.client.smartscraper(
            website_url=f"https://www.sec.gov/edgar/browse/?CIK={ticker}",
            user_prompt="Find the latest 10-K annual report and extract revenue, net income, total assets, and debt information"
        )
        
        return response.get('result', {})

# Usage
collector = FinancialDataCollector("your-api-key")

# Get real-time stock data
apple_data = collector.get_stock_data("AAPL")

# Get fundamental data
apple_financials = collector.get_company_financials("AAPL")
print(f"Revenue: {apple_financials.get('revenue')}")

Same code works across different financial sites because ScrapeGraphAI understands what financial data looks like.

Real-World Financial Use Cases

Now that we've seen the ScrapeGraphAI approach, let's explore practical applications in finance. These examples show how to build robust financial data collection systems that work across multiple sources.

Portfolio Tracking

python
class PortfolioTracker:
    def __init__(self, api_key):
        self.client = Client(api_key=api_key)
        self.holdings = {}
    
    def add_holding(self, symbol, shares, cost_basis):
        self.holdings[symbol] = {
            'shares': shares,
            'cost_basis': cost_basis
        }
    
    def get_current_value(self):
        total_value = 0
        portfolio_data = []
        
        for symbol, holding in self.holdings.items():
            response = self.client.smartscraper(
                website_url=f"https://finance.yahoo.com/quote/{symbol}",
                user_prompt=f"Get current price and daily change percentage for {symbol}"
            )
            
            data = response.get('result', {})
            current_price = float(str(data.get('price', 0)).replace('$', '').replace(',', ''))
            
            position_value = current_price * holding['shares']
            total_cost = holding['cost_basis'] * holding['shares']
            gain_loss = position_value - total_cost
            
            portfolio_data.append({
                'symbol': symbol,
                'shares': holding['shares'],
                'current_price': current_price,
                'position_value': position_value,
                'cost_basis': holding['cost_basis'],
                'gain_loss': gain_loss,
                'gain_loss_pct': (gain_loss / total_cost) * 100
            })
            
            total_value += position_value
        
        return {
            'total_value': total_value,
            'positions': portfolio_data
        }

# Usage
tracker = PortfolioTracker("your-api-key")
tracker.add_holding("AAPL", 100, 150.00)
tracker.add_holding("GOOGL", 50, 2500.00)
tracker.add_holding("TSLA", 25, 200.00)

portfolio = tracker.get_current_value()

Market Screening

python
def screen_stocks(self, criteria):
    """Screen stocks based on financial criteria"""
    screener_urls = [
        "https://finviz.com/screener.ashx",
        "https://finance.yahoo.com/screener",
        "https://www.marketwatch.com/tools/screener"
    ]
    
    all_results = []
    
    for url in screener_urls:
        try:
            response = self.client.smartscraper(
                website_url=url,
                user_prompt=f"Find stocks that meet these criteria: {criteria}. Return ticker symbols, company names, prices, and key metrics like P/E ratio and market cap."
            )
            
            results = response.get('result', [])
            if results:
                all_results.extend(results)
                break  # Found results, no need to try other screeners
                
        except Exception as e:
            print(f"Screener {url} failed: {e}")
    
    return all_results

# Usage
criteria = "P/E ratio under 15, market cap over $1B, revenue growth over 10%"
stocks = collector.screen_stocks(criteria)

for stock in stocks[:10]:  # Top 10 results

Economic Indicators

python
def get_economic_data(self):
    """Collect key economic indicators"""
    sources = {
        "Federal Reserve": "https://www.federalreserve.gov/releases/h15/",
        "Bureau of Labor Statistics": "https://www.bls.gov/",
        "Treasury": "https://www.treasury.gov/resource-center/data-chart-center/"
    }
    
    economic_data = {}
    
    for source_name, url in sources.items():
        try:
            response = self.client.smartscraper(
                website_url=url,
                user_prompt="Extract current economic indicators including interest rates, unemployment rate, inflation rate, and GDP growth"
            )
            
            data = response.get('result', {})
            economic_data[source_name] = data
            
        except Exception as e:
            print(f"Failed to get data from {source_name}: {e}")
    
    return economic_data

# Get current economic indicators
econ_data = collector.get_economic_data()
fed_data = econ_data.get("Federal Reserve", {})
print(f"Federal Funds Rate: {fed_data.get('fed_funds_rate')}")
print(f"10-Year Treasury: {fed_data.get('10_year_treasury')}")

Earnings Calendar

python
def get_earnings_calendar(self, weeks_ahead=2):
    """Get upcoming earnings announcements"""
    earnings_sites = [
        "https://finance.yahoo.com/calendar/earnings",
        "https://www.earningswhispers.com/calendar",
        "https://www.marketwatch.com/tools/earningscalendar"
    ]
    
    for site in earnings_sites:
        try:
            response = self.client.smartscraper(
                website_url=site,
                user_prompt=f"Get upcoming earnings announcements for the next {weeks_ahead} weeks. Include company ticker, company name, earnings date, estimated EPS, and previous EPS"
            )
            
            earnings = response.get('result', [])
            if earnings:
                return sorted(earnings, key=lambda x: x.get('date', ''))
                
        except Exception as e:
            print(f"Failed to get earnings from {site}: {e}")
    
    return []

# Get this week's earnings
upcoming_earnings = collector.get_earnings_calendar(1)

print("This Week's Earnings:")
for earning in upcoming_earnings[:10]:
    print(f"{earning.get('date')}: {earning.get('ticker')} - {earning.get('company')}")
    print(f"  Est. EPS: {earning.get('estimated_eps')}")

Advanced Financial Data Collection

Building on the basic use cases, here are more sophisticated approaches for professional financial data collection. These techniques add reliability, validation, and comprehensive coverage to your financial data systems.

Multi-Source Price Validation

python
def get_validated_price(self, symbol):
    """Get price from multiple sources for validation"""
    sources = [
        f"https://finance.yahoo.com/quote/{symbol}",
        f"https://www.marketwatch.com/investing/stock/{symbol}",
        f"https://www.google.com/finance/quote/{symbol}:NASDAQ"
    ]
    
    prices = []
    
    for url in sources:
        try:
            response = self.client.smartscraper(
                website_url=url,
                user_prompt=f"Get the current stock price for {symbol}"
            )
            
            result = response.get('result', {})
            price_str = str(result.get('price', ''))
            
            # Extract numeric price
            price_match = re.search(r'[d,]+.?d*', price_str.replace(',', ''))
            if price_match:
                price = float(price_match.group())
                prices.append(price)
                
        except Exception as e:
            print(f"Failed to get price from {url}: {e}")
    
    if not prices:
        return None
    
    # Return average if prices are close, otherwise flag discrepancy
    avg_price = sum(prices) / len(prices)
    max_deviation = max(abs(p - avg_price) for p in prices)
    
    if max_deviation > avg_price * 0.01:  # More than 1% difference
        print(f"Warning: Price discrepancy for {symbol}: {prices}")
    
    return avg_price

Options Data

python
def get_options_data(self, symbol, expiration_date=None):
    """Get options chain data"""
    response = self.client.smartscraper(
        website_url=f"https://finance.yahoo.com/quote/{symbol}/options",
        user_prompt=f"Get options data for {symbol} including call and put options with strike prices, bid/ask prices, volume, and open interest"
    )
    
    return response.get('result', {})

# Get AAPL options
aapl_options = collector.get_options_data("AAPL")
print("AAPL Call Options:")
for option in aapl_options.get('calls', [])[:5]:
    print(f"Strike: {option.get('strike')}, Bid: {option.get('bid')}, Ask: {option.get('ask')}")

Insider Trading Data

python
def get_insider_trades(self, symbol):
    """Get recent insider trading activity"""
    insider_sites = [
        f"https://www.sec.gov/edgar/browse/?CIK={symbol}",
        f"https://finviz.com/quote.ashx?t={symbol}",
        f"https://www.nasdaq.com/market-activity/stocks/{symbol.lower()}/insider-activity"
    ]
    
    for site in insider_sites:
        try:
            response = self.client.smartscraper(
                website_url=site,
                user_prompt=f"Find recent insider trading activity for {symbol} including insider name, position, transaction type (buy/sell), number of shares, and transaction date"
            )
            
            trades = response.get('result', [])
            if trades:
                return trades
                
        except Exception as e:
            print(f"Failed to get insider data from {site}: {e}")
    
    return []

# Check insider activity
insider_trades = collector.get_insider_trades("TSLA")
print("Recent TSLA Insider Trades:")
for trade in insider_trades[:5]:
    print(f"{trade.get('date')}: {trade.get('insider')} - {trade.get('transaction')} {trade.get('shares')} shares")

Ready to Scale Your Data Collection?

Join thousands of businesses using ScrapeGrapAI to automate their web scraping needs. Start your journey today with our powerful API.

Real-Time Market Monitoring

python
import schedule
import time
import json
from datetime import datetime

class MarketMonitor:
    def __init__(self, api_key):
        self.client = Client(api_key=api_key)
        self.watchlist = []
        self.alerts = []
    
    def add_to_watchlist(self, symbol, alert_conditions=None):
        """Add stock to monitoring watchlist"""
        self.watchlist.append({
            'symbol': symbol,
            'conditions': alert_conditions or {},
            'last_price': None
        })
    
    def check_market_conditions(self):
        """Check market-wide conditions"""
        response = self.client.smartscraper(
            website_url="https://finance.yahoo.com",
            user_prompt="Get major market indices (S&P 500, Dow Jones, NASDAQ) with current values and daily changes"
        )
        
        market_data = response.get('result', {})
        
        # Check for significant market moves
        for index, data in market_data.items():
            if 'change_percent' in data:
                change_pct = float(str(data['change_percent']).replace('%', ''))
                if abs(change_pct) > 2.0:  # More than 2% move
                    self.alerts.append({
                        'type': 'market_move',
                        'message': f"{index} moved {change_pct:+.1f}%",
                        'timestamp': datetime.now()
                    })
        
        return market_data
    
    def monitor_watchlist(self):
        """Check all stocks in watchlist"""
        print(f"Monitoring {len(self.watchlist)} stocks...")
        
        for item in self.watchlist:
            symbol = item['symbol']
            conditions = item['conditions']
            
            try:
                response = self.client.smartscraper(
                    website_url=f"https://finance.yahoo.com/quote/{symbol}",
                    user_prompt=f"Get current price, daily change, and volume for {symbol}"
                )
                
                data = response.get('result', {})
                current_price = float(str(data.get('price', 0)).replace('$', '').replace(',', ''))
                
                # Check alert conditions
                if 'price_above' in conditions and current_price > conditions['price_above']:
                    self.alerts.append({
                        'type': 'price_alert',
                        'symbol': symbol,
                        'timestamp': datetime.now()
                    })
                
                if 'price_below' in conditions and current_price < conditions['price_below']:
                    self.alerts.append({
                        'type': 'price_alert',
                        'symbol': symbol,
                        'timestamp': datetime.now()
                    })
                
                # Check for unusual volume
                volume = data.get('volume', 0)
                if isinstance(volume, str):
                    volume = float(volume.replace(',', ''))
                
                # You'd typically compare to average volume here
                
                item['last_price'] = current_price
                
            except Exception as e:
                print(f"Failed to monitor {symbol}: {e}")
    
    def get_news_sentiment(self, symbol):
        """Get recent news and sentiment for a stock"""
        response = self.client.smartscraper(
            website_url=f"https://finance.yahoo.com/quote/{symbol}/news",
            user_prompt=f"Get recent news headlines about {symbol} and determine if the overall sentiment is positive, negative, or neutral"
        )
        
        return response.get('result', {})
    
    def start_monitoring(self, check_interval_minutes=5):
        """Start continuous market monitoring"""
        def run_checks():
            print(f"
--- Market Check at {datetime.now()} ---")
            
            # Check overall market
            market_data = self.check_market_conditions()
            
            # Check watchlist
            self.monitor_watchlist()
            
            # Process any new alerts
            if self.alerts:
                print(f"
New Alerts ({len(self.alerts)}):")
                for alert in self.alerts[-5:]:  # Show last 5 alerts
                    print(f"  {alert['type']}: {alert['message']}")
        
        # Schedule regular checks
        schedule.every(check_interval_minutes).minutes.do(run_checks)
        
        # Also check at market open/close
        schedule.every().day.at("09:30").do(run_checks)  # Market open
        schedule.every().day.at("16:00").do(run_checks)  # Market close
        
        print(f"Starting market monitoring (checking every {check_interval_minutes} minutes)")
        print("Press Ctrl+C to stop")
        
        while True:
            schedule.run_pending()
            time.sleep(60)

# Usage
monitor = MarketMonitor("your-api-key")

# Add stocks to watch
monitor.add_to_watchlist("AAPL", {"price_above": 200, "price_below": 180})
monitor.add_to_watchlist("TSLA", {"price_above": 300, "price_below": 250})
monitor.add_to_watchlist("GOOGL", {"price_above": 3000, "price_below": 2800})

# Start monitoring
monitor.start_monitoring(check_interval_minutes=10)

JavaScript Version for Trading Dashboards

For web-based applications and trading dashboards, here's how to implement the same financial data collection using JavaScript. This builds on the Python examples above but provides a complete React-based trading interface.

Compliance and Best Practices

When building production financial data systems, it's crucial to follow best practices for reliability and compliance. These techniques ensure your real-time monitoring and advanced data collection systems work reliably in production environments.

Rate Limiting for Financial Sites

python
import time
from functools import wraps

def rate_limit(calls_per_minute=30):
    """Decorator to rate limit API calls"""
    min_interval = 60.0 / calls_per_minute
    last_called = [0.0]
    
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            elapsed = time.time() - last_called[0]
            left_to_wait = min_interval - elapsed
            if left_to_wait > 0:
                time.sleep(left_to_wait)
            ret = func(*args, **kwargs)
            last_called[0] = time.time()
            return ret
        return wrapper
    return decorator

class RateLimitedFinanceCollector(FinancialDataCollector):
    @rate_limit(calls_per_minute=20)  # 20 calls per minute max
    def get_stock_data(self, symbol):
        return super().get_stock_data(symbol)

Data Validation

python
def validate_financial_data(data, symbol):
    """Validate scraped financial data"""
    errors = []
    
    # Check if price is reasonable
    if 'price' in data:
        try:
            price = float(str(data['price']).replace('$', '').replace(',', ''))
            if price <= 0 or price > 100000:  # Sanity check
                errors.append(f"Unrealistic price ")
        except ValueError:
            errors.append(f"Invalid price format for {symbol}: {data['price']}")
    
    # Check market cap
    if 'market_cap' in data:
        market_cap_str = str(data['market_cap'])
        if not any(suffix in market_cap_str.upper() for suffix in ['B', 'M', 'T']):
            errors.append(f"Unusual market cap format for {symbol}: {market_cap_str}")
    
    # Check P/E ratio
    if 'pe_ratio' in data:
        try:
            pe = float(data['pe_ratio'])
            if pe < 0 or pe > 1000:
                errors.append(f"Unusual P/E ratio for {symbol}: {pe}")
        except (ValueError, TypeError):
            pass  # P/E might be N/A for some stocks
    
    return errors

# Usage
data = collector.get_stock_data("AAPL")
validation_errors = validate_financial_data(data, "AAPL")

if validation_errors:
    print("Data validation warnings:")
    for error in validation_errors:
        print(f"  - {error}")

Error Recovery and Fallbacks

python
def get_stock_data_with_fallbacks(self, symbol):
      """Get stock data with multiple fallback sources"""
    sources = [
        f"https://finance.yahoo.com/quote/{symbol}",
        f"https://www.marketwatch.com/investing/stock/{symbol}",
        f"https://www.google.com/finance/quote/{symbol}:NASDAQ"
    ]
    
    for i, url in enumerate(sources):
        try:
            response = self.client.smartscraper(
                website_url=url,
                user_prompt=f"Get stock data for {symbol} including price, change, volume, and market cap"
            )
            
            data = response.get('result', {})
            
            # Validate the data
            validation_errors = validate_financial_data(data, symbol)
            if not validation_errors:
                data['source'] = url
                return data
            else:
                print(f"Data validation failed for {url}: {validation_errors}")
                
        except Exception as e:
            print(f"Failed to get data from source {i+1}/{len(sources)}: {e}")
            if i < len(sources) - 1:
                time.sleep(2)  # Wait before trying next source
    
    return None  # All sources failed

Frequently Asked Questions

Yes, ScrapeGraphAI is designed to respect website terms of service and robots.txt files. However, you should always:

  • Check the specific terms of service for each financial site
  • Implement appropriate rate limiting (see our rate limiting section)
  • Consider using official APIs when available for high-frequency trading applications
  • Review compliance requirements for your specific use case

How accurate is the financial data compared to official sources?

ScrapeGraphAI extracts data directly from the same sources you'd see in your browser, so accuracy depends on the source website. For critical applications, we recommend:

Can I use this for real-time trading?

While ScrapeGraphAI can provide real-time data, it's not designed for high-frequency trading (HFT) where millisecond delays matter. For trading applications:

What's the difference between this and paid financial APIs?

ScrapeGraphAI advantages:

  • No per-API costs or rate limits
  • Works with any financial website
  • No need to learn multiple API formats
  • Automatic handling of site changes

Paid API advantages:

  • Guaranteed uptime and support
  • Structured data formats
  • Historical data access
  • Real-time streaming for HFT

For most applications, ScrapeGraphAI provides the best balance of flexibility and cost-effectiveness.

How do I handle rate limiting for financial sites?

See our detailed rate limiting section for implementation examples. Key strategies:

  • Implement delays between requests
  • Use multiple data sources to distribute load
  • Cache data when appropriate
  • Monitor for rate limit responses

Can I scrape cryptocurrency data?

Yes! ScrapeGraphAI works great for crypto data. Check out the cryptocurrency example in our JavaScript section. Popular crypto sources include:

  • CoinMarketCap
  • CoinGecko
  • Binance
  • Coinbase

What about SEC filings and regulatory data?

Absolutely! ScrapeGraphAI excels at extracting structured data from complex documents. See our SEC filings example for how to extract financial statements, insider trading data, and regulatory filings.

How do I validate the data quality?

We provide comprehensive data validation tools. Key validation checks:

  • Price reasonableness checks
  • Market cap format validation
  • P/E ratio sanity checks
  • Cross-source data comparison

Can I build a complete trading dashboard?

Yes! Check out our JavaScript trading dashboard example. You can build:

  • Real-time portfolio tracking
  • Market overview dashboards
  • Watchlist management
  • Alert systems

What programming languages are supported?

ScrapeGraphAI supports multiple languages:

  • Python: Full support with our Python client
  • JavaScript/Node.js: Complete API support
  • Other languages: REST API access available

How do I get started?

  1. Start with basic portfolio tracking
  2. Add market screening capabilities
  3. Implement real-time monitoring
  4. Scale up with advanced features

The Bottom Line

Financial data collection used to mean maintaining dozens of fragile scrapers, each one breaking whenever a financial site updated their layout. You'd spend more time fixing scrapers than analyzing markets.

ScrapeGraphAI changes this completely. Instead of fighting with CSS selectors and site-specific authentication, you just describe what financial data you need. When Yahoo Finance redesigns their site, your code keeps working because it understands what stock prices and financial metrics look like.

The examples above cover everything from basic portfolio tracking to complex market monitoring systems. Start with simple portfolio tracking, add real-time monitoring as you need it, then scale up with multi-source validation for production-ready financial data systems.

For more information on getting started, check out our FAQ section or explore our other guides on web scraping with AI and building AI agents.