Definition
Breadth-first crawling (BFS) is a web crawling strategy that visits all pages at the current link depth before proceeding to the next level. Starting from a seed URL, the crawler first visits every page linked from that seed, then every page linked from those pages, and so on — exploring the site layer by layer.
How Breadth-First Crawling Works
The algorithm maintains a queue (first-in, first-out) of URLs to visit:
- Add the seed URL to the queue
- Dequeue the next URL and fetch the page
- Extract all links from the page
- Add unvisited links to the back of the queue
- Repeat from step 2
This queue structure ensures that all depth-1 pages are visited before any depth-2 pages, all depth-2 before depth-3, and so on.
Advantages
Finds Important Pages First
Pages closer to the homepage or starting URL tend to be more important and content-rich. BFS naturally prioritizes these high-value pages, ensuring you capture the most relevant content early in the crawl.
Broader Coverage Early
If a crawl is interrupted or hits a budget limit, BFS provides the broadest possible coverage up to the current depth. You get a representative sample of the site rather than a deep dive into one branch.
Effective for Shallow Sites
Sites with a flat structure (most content within 2-3 clicks of the homepage) are efficiently covered by BFS.
Disadvantages
Memory Usage
BFS requires storing the entire frontier — all discovered but unvisited URLs. For large sites, this queue can grow to millions of URLs, consuming significant memory.
Slower for Deep Content
If your target content is buried deep in the site (e.g., archived articles accessible only through many levels of navigation), BFS must crawl through all shallower content first.
Breadth-First Crawling in ScrapeGraphAI
ScrapeGraphAI's crawling engine uses breadth-first traversal as its default strategy, ensuring that the most accessible and typically most important pages are processed first. Combined with configurable depth limits and URL filters, this provides efficient, predictable coverage of target sites.