Financial data is everywhere, but getting it into a usable format is a nightmare. Stock prices on Yahoo Finance look different from Bloomberg, SEC filings are buried in terrible HTML, and every broker displays portfolio data in their own special way.
I've spent way too many hours writing scrapers for financial sites, and they break constantly. Yahoo Finance changes their layout, a broker updates their authentication system, or the SEC website decides to load everything with JavaScript. By the time you fix one scraper, three others have stopped working.
Let's look at how to build financial data collection systems that actually stay working.
Table of Contents
- The Traditional Financial Scraping Nightmare
- ScrapeGraphAI Approach to Financial Data
- Real-World Financial Use Cases
- Advanced Financial Data Collection
- Real-Time Market Monitoring
- JavaScript Version for Trading Dashboards
- Compliance and Best Practices
- Frequently Asked Questions
- The Bottom Line
The Traditional Financial Scraping Nightmare
Here's what most people try when they need financial data:
Stock Prices from Yahoo Finance
import requests
from bs4 import BeautifulSoup
import re
def get_yahoo_stock_price(symbol):
url = f"https://finance.yahoo.com/quote/{symbol}"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
# Try multiple selectors because Yahoo keeps changing them
price_selectors = [
'fin-streamer[data-field="regularMarketPrice"]',
'span[data-reactid*="regularMarketPrice"]',
'div[data-test="qsp-price"] span'
]
for selector in price_selectors:
price_elem = soup.select_one(selector)
if price_elem:
price_text = price_elem.text.strip()
# Extract number from text like "$150.25"
price_match = re.search(r'[\d,]+\.?\d*', price_text.replace(',', ''))
if price_match:
return float(price_match.group())
return None
# This breaks every few months when Yahoo updates their site
Company Financial Statements
def scrape_sec_filing(ticker, form_type="10-K"):
# Navigate SEC EDGAR system
search_url = f"https://www.sec.gov/cgi-bin/browse-edgar"
# This is already getting complicated...
params = {
'action': 'getcompany',
'CIK': ticker,
'type': form_type,
'dateb': '',
'owner': 'exclude',
'count': '10'
}
response = requests.get(search_url, params=params)
soup = BeautifulSoup(response.content, 'html.parser')
# Find the latest filing link
filing_links = soup.find_all('a', id='documentsbutton')
if not filing_links:
return None
# Follow the link to get the actual document
filing_url = "https://www.sec.gov" + filing_links[0]['href']
# ... and this goes on for 50+ more lines
The problem is obvious - every financial site has different HTML structures, authentication requirements, and anti-bot protections. You spend more time maintaining scrapers than analyzing data.
ScrapeGraphAI Approach to Financial Data
As we saw in the traditional approach above, manual scraping is fragile and time-consuming. ScrapeGraphAI offers a completely different approach.
Instead of wrestling with selectors and site-specific quirks, describe what you need:
from scrapegraph_py import Client
from datetime import datetime
class FinancialDataCollector:
def __init__(self, api_key):
self.client = Client(api_key=api_key)
def get_stock_data(self, symbol):
# Works on any financial site
response = self.client.smartscraper(
website_url=f"https://finance.yahoo.com/quote/{symbol}",
user_prompt=f"Get the current stock price, daily change, volume, market cap, P/E ratio, and 52-week high/low for {symbol}"
)
return response.get('result', {})
def get_company_financials(self, ticker):
# SEC filings made simple
response = self.client.smartscraper(
website_url=f"https://www.sec.gov/edgar/browse/?CIK={ticker}",
user_prompt="Find the latest 10-K annual report and extract revenue, net income, total assets, and debt information"
)
return response.get('result', {})
# Usage
collector = FinancialDataCollector("your-api-key")
# Get real-time stock data
apple_data = collector.get_stock_data("AAPL")
# Get fundamental data
apple_financials = collector.get_company_financials("AAPL")
print(f"Revenue: {apple_financials.get('revenue')}")
Same code works across different financial sites because ScrapeGraphAI understands what financial data looks like.
Real-World Financial Use Cases
Now that we've seen the ScrapeGraphAI approach, let's explore practical applications in finance. These examples show how to build robust financial data collection systems that work across multiple sources.
Portfolio Tracking
class PortfolioTracker:
def __init__(self, api_key):
self.client = Client(api_key=api_key)
self.holdings = {}
def add_holding(self, symbol, shares, cost_basis):
self.holdings[symbol] = {
'shares': shares,
'cost_basis': cost_basis
}
def get_current_value(self):
total_value = 0
portfolio_data = []
for symbol, holding in self.holdings.items():
response = self.client.smartscraper(
website_url=f"https://finance.yahoo.com/quote/{symbol}",
user_prompt=f"Get current price and daily change percentage for {symbol}"
)
data = response.get('result', {})
current_price = float(str(data.get('price', 0)).replace('$', '').replace(',', ''))
position_value = current_price * holding['shares']
total_cost = holding['cost_basis'] * holding['shares']
gain_loss = position_value - total_cost
portfolio_data.append({
'symbol': symbol,
'shares': holding['shares'],
'current_price': current_price,
'position_value': position_value,
'cost_basis': holding['cost_basis'],
'gain_loss': gain_loss,
'gain_loss_pct': (gain_loss / total_cost) * 100
})
total_value += position_value
return {
'total_value': total_value,
'positions': portfolio_data
}
# Usage
tracker = PortfolioTracker("your-api-key")
tracker.add_holding("AAPL", 100, 150.00)
tracker.add_holding("GOOGL", 50, 2500.00)
tracker.add_holding("TSLA", 25, 200.00)
portfolio = tracker.get_current_value()
Market Screening
def screen_stocks(self, criteria):
"""Screen stocks based on financial criteria"""
screener_urls = [
"https://finviz.com/screener.ashx",
"https://finance.yahoo.com/screener",
"https://www.marketwatch.com/tools/screener"
]
all_results = []
for url in screener_urls:
try:
response = self.client.smartscraper(
website_url=url,
user_prompt=f"Find stocks that meet these criteria: {criteria}. Return ticker symbols, company names, prices, and key metrics like P/E ratio and market cap."
)
results = response.get('result', [])
if results:
all_results.extend(results)
break # Found results, no need to try other screeners
except Exception as e:
print(f"Screener {url} failed: {e}")
return all_results
# Usage
criteria = "P/E ratio under 15, market cap over $1B, revenue growth over 10%"
stocks = collector.screen_stocks(criteria)
for stock in stocks[:10]: # Top 10 results
Economic Indicators
def get_economic_data(self):
"""Collect key economic indicators"""
sources = {
"Federal Reserve": "https://www.federalreserve.gov/releases/h15/",
"Bureau of Labor Statistics": "https://www.bls.gov/",
"Treasury": "https://www.treasury.gov/resource-center/data-chart-center/"
}
economic_data = {}
for source_name, url in sources.items():
try:
response = self.client.smartscraper(
website_url=url,
user_prompt="Extract current economic indicators including interest rates, unemployment rate, inflation rate, and GDP growth"
)
data = response.get('result', {})
economic_data[source_name] = data
except Exception as e:
print(f"Failed to get data from {source_name}: {e}")
return economic_data
# Get current economic indicators
econ_data = collector.get_economic_data()
fed_data = econ_data.get("Federal Reserve", {})
print(f"Federal Funds Rate: {fed_data.get('fed_funds_rate')}")
print(f"10-Year Treasury: {fed_data.get('10_year_treasury')}")
Earnings Calendar
def get_earnings_calendar(self, weeks_ahead=2):
"""Get upcoming earnings announcements"""
earnings_sites = [
"https://finance.yahoo.com/calendar/earnings",
"https://www.earningswhispers.com/calendar",
"https://www.marketwatch.com/tools/earningscalendar"
]
for site in earnings_sites:
try:
response = self.client.smartscraper(
website_url=site,
user_prompt=f"Get upcoming earnings announcements for the next {weeks_ahead} weeks. Include company ticker, company name, earnings date, estimated EPS, and previous EPS"
)
earnings = response.get('result', [])
if earnings:
return sorted(earnings, key=lambda x: x.get('date', ''))
except Exception as e:
print(f"Failed to get earnings from {site}: {e}")
return []
# Get this week's earnings
upcoming_earnings = collector.get_earnings_calendar(1)
print("This Week's Earnings:")
for earning in upcoming_earnings[:10]:
print(f"{earning.get('date')}: {earning.get('ticker')} - {earning.get('company')}")
print(f" Est. EPS: {earning.get('estimated_eps')}")
Advanced Financial Data Collection
Building on the basic use cases, here are more sophisticated approaches for professional financial data collection. These techniques add reliability, validation, and comprehensive coverage to your financial data systems.
Multi-Source Price Validation
def get_validated_price(self, symbol):
"""Get price from multiple sources for validation"""
sources = [
f"https://finance.yahoo.com/quote/{symbol}",
f"https://www.marketwatch.com/investing/stock/{symbol}",
f"https://www.google.com/finance/quote/{symbol}:NASDAQ"
]
prices = []
for url in sources:
try:
response = self.client.smartscraper(
website_url=url,
user_prompt=f"Get the current stock price for {symbol}"
)
result = response.get('result', {})
price_str = str(result.get('price', ''))
# Extract numeric price
price_match = re.search(r'[\d,]+\.?\d*', price_str.replace(',', ''))
if price_match:
price = float(price_match.group())
prices.append(price)
except Exception as e:
print(f"Failed to get price from {url}: {e}")
if not prices:
return None
# Return average if prices are close, otherwise flag discrepancy
avg_price = sum(prices) / len(prices)
max_deviation = max(abs(p - avg_price) for p in prices)
if max_deviation > avg_price * 0.01: # More than 1% difference
print(f"Warning: Price discrepancy for {symbol}: {prices}")
return avg_price
Options Data
def get_options_data(self, symbol, expiration_date=None):
"""Get options chain data"""
response = self.client.smartscraper(
website_url=f"https://finance.yahoo.com/quote/{symbol}/options",
user_prompt=f"Get options data for {symbol} including call and put options with strike prices, bid/ask prices, volume, and open interest"
)
return response.get('result', {})
# Get AAPL options
aapl_options = collector.get_options_data("AAPL")
print("AAPL Call Options:")
for option in aapl_options.get('calls', [])[:5]:
print(f"Strike: {option.get('strike')}, Bid: {option.get('bid')}, Ask: {option.get('ask')}")
Insider Trading Data
def get_insider_trades(self, symbol):
"""Get recent insider trading activity"""
insider_sites = [
f"https://www.sec.gov/edgar/browse/?CIK={symbol}",
f"https://finviz.com/quote.ashx?t={symbol}",
f"https://www.nasdaq.com/market-activity/stocks/{symbol.lower()}/insider-activity"
]
for site in insider_sites:
try:
response = self.client.smartscraper(
website_url=site,
user_prompt=f"Find recent insider trading activity for {symbol} including insider name, position, transaction type (buy/sell), number of shares, and transaction date"
)
trades = response.get('result', [])
if trades:
return trades
except Exception as e:
print(f"Failed to get insider data from {site}: {e}")
return []
# Check insider activity
insider_trades = collector.get_insider_trades("TSLA")
print("Recent TSLA Insider Trades:")
for trade in insider_trades[:5]:
print(f"{trade.get('date')}: {trade.get('insider')} - {trade.get('transaction')} {trade.get('shares')} shares")
Real-Time Market Monitoring
import schedule
import time
import json
from datetime import datetime
class MarketMonitor:
def __init__(self, api_key):
self.client = Client(api_key=api_key)
self.watchlist = []
self.alerts = []
def add_to_watchlist(self, symbol, alert_conditions=None):
"""Add stock to monitoring watchlist"""
self.watchlist.append({
'symbol': symbol,
'conditions': alert_conditions or {},
'last_price': None
})
def check_market_conditions(self):
"""Check market-wide conditions"""
response = self.client.smartscraper(
website_url="https://finance.yahoo.com",
user_prompt="Get major market indices (S&P 500, Dow Jones, NASDAQ) with current values and daily changes"
)
market_data = response.get('result', {})
# Check for significant market moves
for index, data in market_data.items():
if 'change_percent' in data:
change_pct = float(str(data['change_percent']).replace('%', ''))
if abs(change_pct) > 2.0: # More than 2% move
self.alerts.append({
'type': 'market_move',
'message': f"{index} moved {change_pct:+.1f}%",
'timestamp': datetime.now()
})
return market_data
def monitor_watchlist(self):
"""Check all stocks in watchlist"""
print(f"Monitoring {len(self.watchlist)} stocks...")
for item in self.watchlist:
symbol = item['symbol']
conditions = item['conditions']
try:
response = self.client.smartscraper(
website_url=f"https://finance.yahoo.com/quote/{symbol}",
user_prompt=f"Get current price, daily change, and volume for {symbol}"
)
data = response.get('result', {})
current_price = float(str(data.get('price', 0)).replace('$', '').replace(',', ''))
# Check alert conditions
if 'price_above' in conditions and current_price > conditions['price_above']:
self.alerts.append({
'type': 'price_alert',
'symbol': symbol,
'timestamp': datetime.now()
})
if 'price_below' in conditions and current_price < conditions['price_below']:
self.alerts.append({
'type': 'price_alert',
'symbol': symbol,
'timestamp': datetime.now()
})
# Check for unusual volume
volume = data.get('volume', 0)
if isinstance(volume, str):
volume = float(volume.replace(',', ''))
# You'd typically compare to average volume here
item['last_price'] = current_price
except Exception as e:
print(f"Failed to monitor {symbol}: {e}")
def get_news_sentiment(self, symbol):
"""Get recent news and sentiment for a stock"""
response = self.client.smartscraper(
website_url=f"https://finance.yahoo.com/quote/{symbol}/news",
user_prompt=f"Get recent news headlines about {symbol} and determine if the overall sentiment is positive, negative, or neutral"
)
return response.get('result', {})
def start_monitoring(self, check_interval_minutes=5):
"""Start continuous market monitoring"""
def run_checks():
print(f"\n--- Market Check at {datetime.now()} ---")
# Check overall market
market_data = self.check_market_conditions()
# Check watchlist
self.monitor_watchlist()
# Process any new alerts
if self.alerts:
print(f"\nNew Alerts ({len(self.alerts)}):")
for alert in self.alerts[-5:]: # Show last 5 alerts
print(f" {alert['type']}: {alert['message']}")
# Schedule regular checks
schedule.every(check_interval_minutes).minutes.do(run_checks)
# Also check at market open/close
schedule.every().day.at("09:30").do(run_checks) # Market open
schedule.every().day.at("16:00").do(run_checks) # Market close
print(f"Starting market monitoring (checking every {check_interval_minutes} minutes)")
print("Press Ctrl+C to stop")
while True:
schedule.run_pending()
time.sleep(60)
# Usage
monitor = MarketMonitor("your-api-key")
# Add stocks to watch
monitor.add_to_watchlist("AAPL", {"price_above": 200, "price_below": 180})
monitor.add_to_watchlist("TSLA", {"price_above": 300, "price_below": 250})
monitor.add_to_watchlist("GOOGL", {"price_above": 3000, "price_below": 2800})
# Start monitoring
monitor.start_monitoring(check_interval_minutes=10)
JavaScript Version for Trading Dashboards
For web-based applications and trading dashboards, here's how to implement the same financial data collection using JavaScript. This builds on the Python examples above but provides a complete React-based trading interface.
Compliance and Best Practices
When building production financial data systems, it's crucial to follow best practices for reliability and compliance. These techniques ensure your real-time monitoring and advanced data collection systems work reliably in production environments.
Rate Limiting for Financial Sites
import time
from functools import wraps
def rate_limit(calls_per_minute=30):
"""Decorator to rate limit API calls"""
min_interval = 60.0 / calls_per_minute
last_called = [0.0]
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
elapsed = time.time() - last_called[0]
left_to_wait = min_interval - elapsed
if left_to_wait > 0:
time.sleep(left_to_wait)
ret = func(*args, **kwargs)
last_called[0] = time.time()
return ret
return wrapper
return decorator
class RateLimitedFinanceCollector(FinancialDataCollector):
@rate_limit(calls_per_minute=20) # 20 calls per minute max
def get_stock_data(self, symbol):
return super().get_stock_data(symbol)
Data Validation
def validate_financial_data(data, symbol):
"""Validate scraped financial data"""
errors = []
# Check if price is reasonable
if 'price' in data:
try:
price = float(str(data['price']).replace('$', '').replace(',', ''))
if price <= 0 or price > 100000: # Sanity check
errors.append(f"Unrealistic price ")
except ValueError:
errors.append(f"Invalid price format for {symbol}: {data['price']}")
# Check market cap
if 'market_cap' in data:
market_cap_str = str(data['market_cap'])
if not any(suffix in market_cap_str.upper() for suffix in ['B', 'M', 'T']):
errors.append(f"Unusual market cap format for {symbol}: {market_cap_str}")
# Check P/E ratio
if 'pe_ratio' in data:
try:
pe = float(data['pe_ratio'])
if pe < 0 or pe > 1000:
errors.append(f"Unusual P/E ratio for {symbol}: {pe}")
except (ValueError, TypeError):
pass # P/E might be N/A for some stocks
return errors
# Usage
data = collector.get_stock_data("AAPL")
validation_errors = validate_financial_data(data, "AAPL")
if validation_errors:
print("Data validation warnings:")
for error in validation_errors:
print(f" - {error}")
Error Recovery and Fallbacks
def get_stock_data_with_fallbacks(self, symbol):
"""Get stock data with multiple fallback sources"""
sources = [
f"https://finance.yahoo.com/quote/{symbol}",
f"https://www.marketwatch.com/investing/stock/{symbol}",
f"https://www.google.com/finance/quote/{symbol}:NASDAQ"
]
for i, url in enumerate(sources):
try:
response = self.client.smartscraper(
website_url=url,
user_prompt=f"Get stock data for {symbol} including price, change, volume, and market cap"
)
data = response.get('result', {})
# Validate the data
validation_errors = validate_financial_data(data, symbol)
if not validation_errors:
data['source'] = url
return data
else:
print(f"Data validation failed for {url}: {validation_errors}")
except Exception as e:
print(f"Failed to get data from source {i+1}/{len(sources)}: {e}")
if i < len(sources) - 1:
time.sleep(2) # Wait before trying next source
return None # All sources failed
Frequently Asked Questions
Is ScrapeGraphAI legal for financial data collection?
Yes, ScrapeGraphAI is designed to respect website terms of service and robots.txt files. However, you should always:
- Check the specific terms of service for each financial site
- Implement appropriate rate limiting (see our rate limiting section)
- Consider using official APIs when available for high-frequency trading applications
- Review compliance requirements for your specific use case
How accurate is the financial data compared to official sources?
ScrapeGraphAI extracts data directly from the same sources you'd see in your browser, so accuracy depends on the source website. For critical applications, we recommend:
- Using multi-source validation to cross-check data
- Implementing data validation checks
- Comparing against official APIs when available
- Setting up alerts for unusual data patterns
Can I use this for real-time trading?
While ScrapeGraphAI can provide real-time data, it's not designed for high-frequency trading (HFT) where millisecond delays matter. For trading applications:
- Use the real-time monitoring features for alerts
- Implement proper error handling and fallbacks
- Consider latency requirements for your specific trading strategy
- Always test thoroughly before using with real money
What's the difference between this and paid financial APIs?
ScrapeGraphAI advantages:
- No per-API costs or rate limits
- Works with any financial website
- No need to learn multiple API formats
- Automatic handling of site changes
Paid API advantages:
- Guaranteed uptime and support
- Structured data formats
- Historical data access
- Real-time streaming for HFT
For most applications, ScrapeGraphAI provides the best balance of flexibility and cost-effectiveness.
How do I handle rate limiting for financial sites?
See our detailed rate limiting section for implementation examples. Key strategies:
- Implement delays between requests
- Use multiple data sources to distribute load
- Cache data when appropriate
- Monitor for rate limit responses
Can I scrape cryptocurrency data?
Yes! ScrapeGraphAI works great for crypto data. Check out the cryptocurrency example in our JavaScript section. Popular crypto sources include:
- CoinMarketCap
- CoinGecko
- Binance
- Coinbase
What about SEC filings and regulatory data?
Absolutely! ScrapeGraphAI excels at extracting structured data from complex documents. See our SEC filings example for how to extract financial statements, insider trading data, and regulatory filings.
How do I validate the data quality?
We provide comprehensive data validation tools. Key validation checks:
- Price reasonableness checks
- Market cap format validation
- P/E ratio sanity checks
- Cross-source data comparison
Can I build a complete trading dashboard?
Yes! Check out our JavaScript trading dashboard example. You can build:
- Real-time portfolio tracking
- Market overview dashboards
- Watchlist management
- Alert systems
What programming languages are supported?
ScrapeGraphAI supports multiple languages:
- Python: Full support with our Python client
- JavaScript/Node.js: Complete API support
- Other languages: REST API access available
How do I get started?
- Start with basic portfolio tracking
- Add market screening capabilities
- Implement real-time monitoring
- Scale up with advanced features
The Bottom Line
Financial data collection used to mean maintaining dozens of fragile scrapers, each one breaking whenever a financial site updated their layout. You'd spend more time fixing scrapers than analyzing markets.
ScrapeGraphAI changes this completely. Instead of fighting with CSS selectors and site-specific authentication, you just describe what financial data you need. When Yahoo Finance redesigns their site, your code keeps working because it understands what stock prices and financial metrics look like.
The examples above cover everything from basic portfolio tracking to complex market monitoring systems. Start with simple portfolio tracking, add real-time monitoring as you need it, then scale up with multi-source validation for production-ready financial data systems.
For more information on getting started, check out our FAQ section or explore our other guides on web scraping with AI and building AI agents.