Definition
robots.txt is a plain text file placed at the root of a website (e.g., example.com/robots.txt) that provides instructions to web crawlers and bots about which parts of the site they are allowed or disallowed from accessing. It follows the Robots Exclusion Protocol, a standard that has been in use since 1994.
How robots.txt Works
The file contains directives organized by user agent (the identifier a bot presents). Each block specifies allowed and disallowed URL paths.
User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /public/
User-agent: Googlebot
Allow: /
Sitemap: https://example.com/sitemap.xml
In this example, all bots are blocked from /admin/ and /private/, while Googlebot is granted full access. The Sitemap directive points crawlers to the site's XML sitemap for efficient discovery.
Key Directives
- User-agent — specifies which bot the rules apply to (
*means all) - Disallow — blocks access to specified paths
- Allow — explicitly permits access (overrides broader Disallow rules)
- Sitemap — provides the location of the site's XML sitemap
- Crawl-delay — suggests a minimum delay between requests (not universally supported)
robots.txt and Web Scraping
It is important to understand that robots.txt is advisory, not enforceable. It relies on crawlers voluntarily respecting its directives. Search engine crawlers like Googlebot consistently honor these rules because doing so is fundamental to their relationship with webmasters.
For web scraping, robots.txt represents an important ethical signal. Respecting these directives demonstrates good faith and helps maintain access to sites that might otherwise implement more aggressive anti-bot measures.
ScrapeGraphAI and robots.txt
ScrapeGraphAI allows users to be aware of a site's robots.txt directives when planning data collection. Understanding these guidelines helps you make informed decisions about which pages to target and how to structure your crawling patterns responsibly.