robots.txt is a plain text file placed at the root of a website (e.g. https://example.com/robots.txt) that tells web crawlers which pages or sections they can and cannot access. It follows the Robots Exclusion Protocol, originally proposed by Martijn Koster in 1994 and formally standardized as RFC 9309 in June 2022. When a crawler like Googlebot, Bingbot, or GPTBot visits a domain, it checks /robots.txt before requesting any other URL. The file uses simple directives (User-agent, Disallow, Allow, Sitemap, and Crawl-delay) to control which paths each bot can access. One important thing to understand: this is purely a courtesy mechanism. Well-behaved bots respect the file, but it provides no actual security or access control. Malicious crawlers will ignore it, and the file itself is publicly visible to anyone.

What does User-agent mean in robots.txt?

The User-agent directive identifies which crawler or bot the following rules apply to. Setting User-agent: * targets all bots with a single set of rules. You can also target specific bots by name. For example, User-agent: Googlebot applies rules only to Google's primary crawler, while User-agent: GPTBot targets OpenAI's web crawler. There is an important precedence rule here: as defined in RFC 9309 Section 2.2.1, when both a wildcard group and a bot-specific group exist, the crawler must follow only its own specific group and ignore the wildcard entirely. You can also stack multiple User-agent lines before a shared set of Disallow and Allow rules, which lets you apply identical restrictions to several bots at once without duplicating directives.

What does Disallow do?

Disallow tells the specified User-agent not to crawl the listed path and everything beneath it. For example, Disallow: /private/ blocks all URLs starting with /private/, including nested paths like /private/docs/report.pdf. An empty Disallow value (Disallow: with nothing after the colon) means no paths are blocked, so the bot can crawl the entire site. Disallow: / does the opposite and blocks everything. Path matching is case-sensitive and prefix-based, as specified in RFC 9309 Section 2.2.2. Google and Bing extend the standard with wildcard (*) characters within paths and dollar sign ($) end-of-URL anchors, allowing patterns like Disallow: /*.pdf$ to block all PDF files regardless of which directory they sit in.

Allow overrides a Disallow rule for a more specific path. This lets you permit access to certain files or directories within an otherwise blocked area. For example, Disallow: /private/ combined with Allow: /private/public.html lets crawlers access that one file while blocking the rest of the directory. When Allow and Disallow patterns match the same URL, the most specific (longest matching) rule wins, as defined in RFC 9309 Section 2.2.2. Allow was not part of the original 1994 Robots Exclusion Protocol proposal, but it was formally included when RFC 9309 was published. All major crawlers support it, including Googlebot, Bingbot, Yandex, and AI crawlers like GPTBot and ClaudeBot.

How does the Sitemap directive work in robots.txt?

The Sitemap directive tells crawlers the full URL of your XML sitemap, helping them discover pages that might not be linked from your main navigation. You can include multiple Sitemap lines for different files or sitemap indexes. Separate sitemaps for blog posts, product pages, and images are common. Place the Sitemap directive outside any User-agent block since it applies globally to all crawlers. The Sitemap directive is not part of the core RFC 9309 specification, but Google, Bing, and Yandex all support it universally. Declaring your sitemap in robots.txt is a solid backup discovery method alongside direct submission through Google Search Console or Bing Webmaster Tools.

Crawl-delay instructs a bot to wait the specified number of seconds between consecutive requests to your server. For example, Crawl-delay: 10 means the bot pauses 10 seconds between page fetches, reducing server load from aggressive crawling. Bingbot, Yandex, and several other crawlers support this directive. Googlebot, however, ignores Crawl-delay completely. To control Googlebot's crawl rate, you need to configure it through Google Search Console instead. Be careful with high values. Setting Crawl-delay above 30 seconds can dramatically slow indexing since the bot may only fetch a few pages per hour. For sites with thousands of pages, that can mean weeks or months before all content gets discovered and indexed.

Is robots.txt the same as a noindex meta tag?

No, they serve fundamentally different purposes. Disallow in robots.txt prevents a crawler from fetching the page at all, but here is the catch: if another site links to that blocked URL, it can still appear in search results as a URL-only listing without a title or snippet. The noindex meta tag (or X-Robots-Tag HTTP header) works differently. It lets the crawler fetch and read the page but instructs it not to include the page in search results. There is a critical interaction between the two. If you Disallow a page in robots.txt, Googlebot cannot access it to read any noindex tag on that page, so the noindex gets effectively ignored. For reliable exclusion from search results, use noindex on pages that crawlers can access rather than relying on Disallow alone.

What are common robots.txt mistakes?

The most common mistake is blocking CSS and JavaScript files that search engines need to render your pages. This directly harms indexing quality and rankings. Another frequent one: using Disallow: / under User-agent: * accidentally blocks all crawlers from the entire site, and pages can disappear from search results within days. Serving robots.txt with Content-Type text/html instead of text/plain causes parse failures because crawlers expect plain text as required by RFC 9309 Section 2.2. Placing the file anywhere other than the domain root (e.g. /pages/robots.txt instead of /robots.txt) means crawlers will never find it. Other errors include using wildcards incorrectly, forgetting that path matching is case-sensitive, and treating robots.txt as a security measure. The file is publicly visible and provides zero access control.

Robots.txt Checker

Fetch, validate, and parse any domain's robots.txt file. See which bots are allowed or blocked, and inspect all crawl directives at a glance.

Enter a domain above, or drop a robots.txt file here

Domain lookup fetches the live file (1 credit). File upload and paste are free — analyzed locally in your browser.

“robots.txt is the internet's honor system. Bots don't have to obey it. Good ones do.”

Written by Ishan Karunaratne · Last reviewed: March 11, 2026

What Is robots.txt?

robots.txt is a plain text file located at the root of a website — for example, https://example.com/robots.txt. It uses the Robots Exclusion Protocol, standardized in RFC 9309 (published June 2022 by the IETF), to tell web crawlers which parts of your site they are and are not allowed to access. The protocol was originally proposed by Martijn Koster in 1994 and remained an informal convention for nearly three decades before being formally standardized.

Search engines like Google, Bing, and Yandex check robots.txt before crawling any page on your domain. According to Google's documentation, if the file returns HTTP 200, the rules within it are enforced. A 404 means the entire site is considered crawlable. A 5xx response causes Google to temporarily stop crawling — and after 30 days of consecutive 5xx responses, Google treats the last cached version of robots.txt as authoritative (RFC 9309 §2.3).

A misconfigured robots.txt can have serious consequences. Accidentally blocking Disallow: / on your wildcard user-agent will cause search engine bots to stop crawling your entire site, and pages can disappear from search results within days. Similarly, blocking CSS and JavaScript resources prevents search engines from rendering your pages, which can harm indexing and rankings even if the HTML itself is crawlable.

It's important to understand that robots.txt is an advisory protocol, not a security mechanism. It does not prevent access — anyone can view your robots.txt file, and malicious crawlers will ignore it entirely. For pages that must not be indexed, use the noindex meta tag or X-Robots-Tag HTTP header instead.

How Does This Checker Work?

This tool fetches your robots.txt file server-side and performs a comprehensive analysis based on RFC 9309 and search engine best practices:

HTTP status check. The file must return HTTP 200. A 404 means no robots.txt is present — crawlers will treat the entire site as allowed. A 5xx means the file is temporarily inaccessible, which may cause crawlers to pause (RFC 9309 §2.3).
Content-Type validation. The server must return Content-Type: text/plain. Per RFC 9309 §2.2, crawlers should parse the file as UTF-8 encoded text. If it returns HTML, most bots will fail to parse it correctly.
HTML detection. Some servers return a 200 with an HTML error page instead of a proper robots.txt. This tool detects and flags that condition.
Directive parsing. The file is parsed into User-agent blocks, each with its Disallow, Allow, and Crawl-delay rules. Sitemap and Host declarations are also extracted. Duplicate rules, empty directives, and invalid sitemap URLs are flagged.
Sitemap reachability. Each declared Sitemap URL is checked with an HTTP HEAD request to verify it responds successfully. This confirms the URL is reachable — it does not validate the XML contents of the sitemap itself.
CMS detection. Common path patterns are analyzed to detect your CMS platform (WordPress, Next.js, Shopify, Drupal, etc.) and provide platform-specific recommendations for optimizing your robots.txt configuration.

You can also drag and drop a local robots.txt file or paste content from your clipboard to analyze it offline without using any credits. Local analysis performs all checks except HTTP status, Content-Type, and sitemap reachability.

What Directives Does robots.txt Support?

The robots.txt file supports several directives, each controlling a different aspect of crawler behavior. Below are all directives recognized by major search engines and AI crawlers, with references to the relevant specifications.

User-agent

Identifies which bot the following rules apply to. User-agent: * targets all bots. Use specific names like Googlebot or Bingbot for bot-specific rules. Per RFC 9309 §2.2.1, a crawler must use the most specific matching group — if both * and Googlebot groups exist, Googlebot only follows its own group.

Disallow

Blocks the bot from crawling the specified path and everything under it. Disallow: / blocks the entire site. Disallow: (empty value) allows all paths. Path matching is case-sensitive and prefix-based (RFC 9309 §2.2.2). Google and Bing also support * wildcards and $ end-of-URL anchors in paths.

Allow

Overrides a Disallow rule for a more specific path. Useful when blocking a directory but allowing certain files within it. For example, Disallow: /private/ combined with Allow: /private/public.html lets crawlers access that one file. Formally defined in RFC 9309 §2.2.2 — when Allow and Disallow match the same URL, the most specific (longest) rule wins.

Sitemap

Declares the full URL of your XML sitemap. Multiple Sitemap lines are allowed — useful for sitemap index files or separate sitemaps for different content types. This directive is not part of the core RFC 9309 specification but is universally supported by Google, Bing, and Yandex. See Google's sitemap documentation for format requirements.

Crawl-delay

Requests the bot wait the specified number of seconds between requests. Supported by Bingbot, Yandex, and others. Not supported by Googlebot — you must configure Googlebot's crawl rate directly in Google Search Console. Values over 30 seconds can significantly slow indexing and are flagged by this checker.

Host

A non-standard directive historically used by Yandex to specify the preferred domain (e.g. www vs non-www). Not recognized by Google or Bing. If you need to set a canonical domain, use rel="canonical" or 301 redirects instead, which are universally supported.

Which Crawlers Support Which Directives?

Not every crawler supports every robots.txt directive. The table below shows which directives are recognized by major search engine and AI crawlers, based on official documentation and observed behavior as of March 2026.

Directive	Googlebot	Bingbot	Yandex	GPTBot	ClaudeBot
User-agent	Yes	Yes	Yes	Yes	Yes
Disallow	Yes	Yes	Yes	Yes	Yes
Allow	Yes	Yes	Yes	Yes	Yes
Sitemap	Yes	Yes	Yes	—	—
Crawl-delay	No	Yes	Yes	—	—
Host	No	No	Yes	No	No
Wildcards (*,$)	Yes	Yes	No	—	—

A dash (—) indicates the directive is not relevant to that crawler's function. AI crawlers like GPTBot and ClaudeBot primarily respect User-agent, Disallow, and Allow directives.

How Do AI Crawlers Handle robots.txt?

With the rise of large language models (LLMs), a new generation of web crawlers has emerged. Most vendors run two or three separate bots for different jobs: a training crawler that learns from your content (GPTBot, ClaudeBot, Meta-ExternalAgent), a citation crawler that powers AI-search results and sends real referral traffic (OAI-SearchBot, PerplexityBot, Claude-SearchBot), and a user-triggered fetcher invoked only when a person clicks a citation or asks the model to read a URL (ChatGPT-User, Claude-User, Perplexity-User). Google and Apple add a third pattern: a robots.txt token (Google-Extended, Applebot-Extended) that is not a crawler at all, just an opt-out switch for model training.

Most respect robots.txt. You can block specific bots with a User-agent directive, for example User-agent: GPTBot followed by Disallow: /. The Generator tab includes a "Block AI Training Crawlers" preset that blocks the training bots and opt-out tokens while leaving citation and user-triggered bots allowed, so AI-search referrals still work. For the full implementation guide across robots.txt, Nginx, Apache, Cloudflare, and WordPress, see How to Block AI Bots. Before you decide whether you even should block them, Should You Block AI Bots? walks the strategic side.

Two distinctions matter. Training versus citation: blocking GPTBot has no effect on OAI-SearchBot, because they are different bots from the same company. If you want AI Overviews citations or ChatGPT search referrals, leave the citation crawlers allowed. Crawler versus token: Google-Extended and Applebot-Extended are not crawlers. Disallowing them opts your content out of Gemini and Apple Intelligence training without blocking Googlebot or Applebot. These are the only two bots where "block" means "do not train on" rather than "do not crawl".

Two crawlers do not reliably honor robots.txt. Bytespider (ByteDance / TikTok) is the documented non-compliant case and has been observed fetching paths it was told to skip; pair the robots.txt rule with a server- or CDN-level block. Perplexity-User may also ignore robots.txt because the fetch is user-initiated; if you want a hard block on user-triggered Perplexity fetches, enforce it at the web server.

The table below lists the major AI crawlers and tokens active as of 2026, their User-agent strings, role, and whether they respect robots.txt.

User-agent	Owner	Role	Purpose	Respects robots.txt
GPTBot	OpenAI	Training	Trains OpenAI models on your content	Yes
OAI-SearchBot	OpenAI	Citation	ChatGPT search index (sends referral traffic)	Yes
ChatGPT-User	OpenAI	User fetch	Fetches a URL when a ChatGPT user clicks a citation	Yes
ClaudeBot	Anthropic	Training	Trains Anthropic Claude on your content	Yes
anthropic-ai	Anthropic	Training	Legacy Anthropic training user-agent	Yes
Claude-SearchBot	Anthropic	Citation	Powers Claude citations	Yes
Claude-User	Anthropic	User fetch	Fetches a URL when a Claude user clicks a citation	Yes
PerplexityBot	Perplexity	Citation	Perplexity citation index (sends referral traffic)	Yes
Perplexity-User	Perplexity	User fetch	Fetches a URL when a Perplexity user requests it	Sometimes
Google-Extended	Google	Token	Opt-out token for Gemini training (not a crawler)	Yes (token only)
Applebot-Extended	Apple	Token	Opt-out token for Apple Intelligence training (not a crawler)	Yes (token only)
Bytespider	ByteDance	Training	ByteDance / TikTok AI training (block at server or CDN)	No
CCBot	Common Crawl	Training	Open dataset used by many AI labs to train models	Yes
Meta-ExternalAgent	Meta	Training	Meta AI training crawler	Yes
Amazonbot	Amazon	Training	Alexa and Amazon AI training	Yes
cohere-ai	Cohere	Training	Cohere LLM training	Yes
MistralAI-User	Mistral	User fetch	Fetches a URL when a Mistral user requests it	Yes
DuckAssistBot	DuckDuckGo	Citation	Powers DuckAssist citations	Yes

What Are robots.txt Best Practices?

A well-configured robots.txt file helps search engines and AI crawlers index your site efficiently while protecting server resources and preventing sensitive paths from appearing in search results. These best practices are based on RFC 9309 and official guidance from Google, Bing, and other major crawlers.

Always place robots.txt at the domain root. The file must be accessible at https://yourdomain.com/robots.txt — no subdirectory, no alternate filename. Crawlers only check this exact path.
Serve it as Content-Type: text/plain. If your server returns HTML (common with custom 404 pages or reverse proxies), crawlers cannot parse the directives and may treat the file as invalid.
Never block CSS, JavaScript, or image files that search engines need to render your pages. Google renders pages like a browser. Blocking render-critical resources causes indexing issues and can drop your rankings.
Use specific User-agent blocks for bot-specific rules. Rather than applying overly broad restrictions to all bots via User-agent: *, create targeted blocks for individual crawlers — especially when managing AI crawler access separately from search engines.
Declare your Sitemap URL. Adding Sitemap: https://yourdomain.com/sitemap.xml helps crawlers discover all your pages, even those not linked from your navigation.
Keep the file under 500KB. Google enforces a 500KB size limit. Content beyond this limit is ignored, which could leave critical rules unprocessed.
Do not rely on robots.txt for security. The file is publicly accessible and provides no access control. For pages with sensitive content, use authentication, server-side access rules, or the noindex meta tag.
Test changes before deploying. Use a robots.txt testing tool (like this one) or Google Search Console's robots.txt tester to verify that your changes don't accidentally block important pages.
Review your robots.txt periodically. As your site grows and your CMS changes, outdated rules can block new sections or allow paths that should be restricted. Audit the file at least quarterly.

Every check this tool runs is also available via the robots.txt API with examples in cURL, JavaScript, Python, PHP, Ruby, and Java.

API docs

Built and maintained alongside this tool. Free, no signup required.

On-page SEO audit70+ ranking factors graded for any URL.Redirect chain tracerEvery hop with status codes and TLS info.HTTP header checkerFull timing waterfall plus headers and TLS details.Page speed testCore Web Vitals — LCP, INP, CLS.

Robots.txt Checker

“robots.txt is the internet's honor system. Bots don't have to obey it. Good ones do.”

What Is robots.txt?

How Does This Checker Work?

What Directives Does robots.txt Support?

User-agent

Disallow

Allow

Sitemap

Crawl-delay

Host

Which Crawlers Support Which Directives?

How Do AI Crawlers Handle robots.txt?

Further reading

What Are robots.txt Best Practices?

Where Can You Learn More About robots.txt?

RFC 9309 — Robots Exclusion Protocol

Google Robots.txt Specification

Bing Robots.txt Guide

Google Sitemaps Documentation

Yandex Robots.txt Documentation

robotstxt.org

What Other Tools Help With Crawl Configuration?

DNS Inspector

HTTP Security Headers

Security Scanner

WHOIS Lookup

On-Page SEO Checker

Frequently Asked Questions

Robots.txt Checker

“robots.txt is the internet's honor system. Bots don't have to obey it. Good ones do.”

What Is robots.txt?

How Does This Checker Work?

What Directives Does robots.txt Support?

User-agent

Disallow

Allow

Sitemap

Crawl-delay

Host

Which Crawlers Support Which Directives?

How Do AI Crawlers Handle robots.txt?

Further reading

What Are robots.txt Best Practices?

Where Can You Learn More About robots.txt?

RFC 9309 — Robots Exclusion Protocol

Google Robots.txt Specification

Bing Robots.txt Guide

Google Sitemaps Documentation

Yandex Robots.txt Documentation

robotstxt.org

What Other Tools Help With Crawl Configuration?

DNS Inspector

HTTP Security Headers

Security Scanner

WHOIS Lookup

On-Page SEO Checker

Related tools you might need

Frequently Asked Questions

What is robots.txt?

What does User-agent mean in robots.txt?

What does Disallow do?

What does Allow do?

How does the Sitemap directive work in robots.txt?

What is Crawl-delay?

Is robots.txt the same as a noindex meta tag?

What are common robots.txt mistakes?