Definition
Crawl depth refers to how many links away from the starting URL a web crawler will follow before stopping. A depth of 0 means only the starting page is crawled. A depth of 1 includes the starting page plus all pages directly linked from it. A depth of 2 adds pages linked from those pages, and so on.
Why Crawl Depth Matters
Resource Management
Every additional level of depth can exponentially increase the number of pages to crawl. A site with an average of 50 links per page generates up to 50 pages at depth 1, 2,500 at depth 2, and 125,000 at depth 3. Without depth limits, crawlers can spiral into millions of pages, consuming time, bandwidth, and storage.
Relevance
Content relevance often decreases with depth. The starting page is presumably the most relevant target. Pages one click away are closely related. By depth 3 or 4, you may be crawling tangentially related or entirely irrelevant content.
Completeness vs Efficiency
There is an inherent trade-off between thoroughness and efficiency. A shallow crawl is fast but may miss important content buried deeper in the site structure. A deep crawl is comprehensive but expensive and slow.
Choosing the Right Depth
The optimal crawl depth depends on the site structure and your goals:
- Depth 0 — single-page scraping, when you only need one specific page
- Depth 1 — collecting a page and its immediate links, common for category pages linking to product pages
- Depth 2-3 — exploring a section of a site, suitable for most content collection tasks
- Unlimited depth — full site archival or comprehensive crawling (use with careful filtering)
Combining Depth with Filters
Depth alone is a blunt instrument. Effective crawlers combine depth limits with URL pattern filters, content type restrictions, and domain boundaries to target specific content regardless of its position in the link graph.
Crawl Depth in ScrapeGraphAI
ScrapeGraphAI allows you to configure crawl depth when initiating crawl operations, giving you precise control over how extensively the crawler explores a site. Combined with URL filtering, this ensures efficient, targeted data collection.