Definition
Link discovery is the process of identifying URLs within a web page that point to other pages. It is the mechanism by which web crawlers find new pages to visit — each crawled page yields links that expand the set of known URLs, which the crawler can then explore based on its configuration and strategy.
How Link Discovery Works
HTML Anchor Tags
The primary source of links is the <a href="..."> element. Crawlers parse HTML to extract all href attribute values, which may be absolute URLs, relative paths, or protocol-relative references.
Other Link Sources
- Form actions —
<form action="...">points to pages accessible via form submission - JavaScript-generated links — URLs constructed dynamically by JavaScript, visible only after rendering
- CSS references —
url()values in stylesheets (typically for assets, not pages) - Meta redirects —
<meta http-equiv="refresh" content="0;url=...">directs to another page - Sitemap references — links in XML sitemaps and sitemap index files
- robots.txt sitemap directives — sitemap URLs declared in the robots.txt file
URL Resolution
Discovered links must be resolved to absolute URLs. A relative link like /products/widget found on https://example.com/catalog/ must be resolved to https://example.com/products/widget. Base tags, protocol-relative URLs, and various path formats all require correct handling.
Filtering Discovered Links
Not every discovered URL should be crawled. Effective crawlers filter links based on:
- Domain scope — stay within the target domain or allowed domains
- URL patterns — include or exclude paths matching specific patterns
- File types — skip links to images, PDFs, or other non-HTML resources (unless specifically targeted)
- Already visited — deduplicate against the set of known URLs
- Depth limits — stop discovering links beyond the configured crawl depth
Link Discovery in ScrapeGraphAI
ScrapeGraphAI's crawler performs intelligent link discovery that identifies navigation patterns, pagination links, and content URLs while filtering out irrelevant resources. Combined with configurable URL filters, this ensures crawls stay focused on the content that matters to your extraction goals.