What is Link Discovery?

Q: What is Link Discovery?

Link discovery is the process of identifying and extracting URLs from web pages during crawling to find new pages to visit.

Definition

Link discovery is the process of identifying URLs within a web page that point to other pages. It is the mechanism by which web crawlers find new pages to visit — each crawled page yields links that expand the set of known URLs, which the crawler can then explore based on its configuration and strategy.

How Link Discovery Works

HTML Anchor Tags

The primary source of links is the <a href="https://example.com/page"> element. Crawlers parse HTML to extract all href attribute values, which may be absolute URLs, relative paths, or protocol-relative references.

URL Resolution

Discovered links must be resolved to absolute URLs. A relative link like /products/widget found on https://example.com/catalog/ must be resolved to https://example.com/products/widget. Base tags, protocol-relative URLs, and various path formats all require correct handling.

Filtering Discovered Links

Not every discovered URL should be crawled. Effective crawlers filter links based on:

Domain scope — stay within the target domain or allowed domains
URL patterns — include or exclude paths matching specific patterns
File types — skip links to images, PDFs, or other non-HTML resources (unless specifically targeted)
Already visited — deduplicate against the set of known URLs
Depth limits — stop discovering links beyond the configured crawl depth

Link Discovery in ScrapeGraphAI

ScrapeGraphAI's crawler performs intelligent link discovery that identifies navigation patterns, pagination links, and content URLs while filtering out irrelevant resources. Combined with configurable URL filters, this ensures crawls stay focused on the content that matters to your extraction goals.