ScrapeGraphAIScrapeGraphAI
Dark

What is Link Discovery?

Last updated: Apr 5, 2025

Definition

Link discovery is the process of identifying URLs within a web page that point to other pages. It is the mechanism by which web crawlers find new pages to visit — each crawled page yields links that expand the set of known URLs, which the crawler can then explore based on its configuration and strategy.

HTML Anchor Tags

The primary source of links is the <a href="..."> element. Crawlers parse HTML to extract all href attribute values, which may be absolute URLs, relative paths, or protocol-relative references.

  • Form actions<form action="..."> points to pages accessible via form submission
  • JavaScript-generated links — URLs constructed dynamically by JavaScript, visible only after rendering
  • CSS referencesurl() values in stylesheets (typically for assets, not pages)
  • Meta redirects<meta http-equiv="refresh" content="0;url=..."> directs to another page
  • Sitemap references — links in XML sitemaps and sitemap index files
  • robots.txt sitemap directives — sitemap URLs declared in the robots.txt file

URL Resolution

Discovered links must be resolved to absolute URLs. A relative link like /products/widget found on https://example.com/catalog/ must be resolved to https://example.com/products/widget. Base tags, protocol-relative URLs, and various path formats all require correct handling.

Not every discovered URL should be crawled. Effective crawlers filter links based on:

  • Domain scope — stay within the target domain or allowed domains
  • URL patterns — include or exclude paths matching specific patterns
  • File types — skip links to images, PDFs, or other non-HTML resources (unless specifically targeted)
  • Already visited — deduplicate against the set of known URLs
  • Depth limits — stop discovering links beyond the configured crawl depth

ScrapeGraphAI's crawler performs intelligent link discovery that identifies navigation patterns, pagination links, and content URLs while filtering out irrelevant resources. Combined with configurable URL filters, this ensures crawls stay focused on the content that matters to your extraction goals.