ScrapeGraphAIScrapeGraphAI
Dark

What are Regex Patterns?

Last updated: Apr 5, 2025

Definition

Regular expressions (regex) are sequences of characters that define search patterns for matching text. In web scraping and data extraction, regex is used to find, validate, and extract specific pieces of information — like email addresses, phone numbers, prices, or dates — from raw text or HTML content.

Common Regex Patterns for Scraping

Email Addresses

[\w.-]+@[\w.-]+\.\w{2,}

Matches patterns like user@example.com by looking for word characters around an @ symbol.

Prices

\$[\d,]+\.?\d*

Captures dollar amounts like $79.99 or $1,299.

Phone Numbers

\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}

Matches various US phone formats: (555) 123-4567, 555.123.4567, 555-123-4567.

URLs

https?://[\w.-]+(?:/[\w./-]*)?

Captures HTTP and HTTPS URLs with their paths.

Regex Strengths

Regex excels at extracting data that follows a predictable textual pattern. When you need to pull all phone numbers from a page of text, or validate that extracted emails are well-formed, regex is precise and efficient. It works on raw text without requiring HTML parsing or DOM structure.

Regex Limitations

Brittleness

Regex patterns are literal and inflexible. A price pattern expecting $ fails on pages using USD or localized formats. Minor variations in formatting break matches.

HTML Is Not Regular

A well-known computer science principle: HTML is not a regular language. Attempting to parse HTML structure with regex (matching tags, extracting attributes across lines) leads to fragile, unmaintainable patterns. Use a proper parser for structural queries.

Maintenance

Complex regex patterns are notoriously difficult to read, debug, and modify. What seems clear when written becomes cryptic months later.

Regex and ScrapeGraphAI

While regex remains useful for post-processing validation, ScrapeGraphAI's AI extraction largely replaces the need for regex-based data extraction. The AI understands content semantically, extracting phone numbers, prices, and emails by meaning rather than character patterns — handling format variations naturally.