robots.txt is a standard file that websites use to communicate which pages web crawlers and scrapers should or should not access.

What is robots.txt?

Definition

robots.txt is a plain text file placed at the root of a website (e.g., example.com/robots.txt) that provides instructions to web crawlers and bots about which parts of the site they are allowed or disallowed from accessing. It follows the Robots Exclusion Protocol, a standard that has been in use since 1994.

How robots.txt Works

The file contains directives organized by user agent (the identifier a bot presents). Each block specifies allowed and disallowed URL paths.

User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /public/

User-agent: Googlebot
Allow: /

Sitemap: https://example.com/sitemap.xml

In this example, all bots are blocked from /admin/ and /private/, while Googlebot is granted full access. The Sitemap directive points crawlers to the site's XML sitemap for efficient discovery.

Key Directives

User-agent — specifies which bot the rules apply to (* means all)
Disallow — blocks access to specified paths
Allow — explicitly permits access (overrides broader Disallow rules)
Sitemap — provides the location of the site's XML sitemap
Crawl-delay — suggests a minimum delay between requests (not universally supported)

robots.txt and Web Scraping

It is important to understand that robots.txt is advisory, not enforceable. It relies on crawlers voluntarily respecting its directives. Search engine crawlers like Googlebot consistently honor these rules because doing so is fundamental to their relationship with webmasters.

For web scraping, robots.txt represents an important ethical signal. Respecting these directives demonstrates good faith and helps maintain access to sites that might otherwise implement more aggressive anti-bot measures.

ScrapeGraphAI and robots.txt

ScrapeGraphAI allows users to be aware of a site's robots.txt directives when planning data collection. Understanding these guidelines helps you make informed decisions about which pages to target and how to structure your crawling patterns responsibly.