Skip to main content
DNS Checker(beta)

Robots.txt Checker

Fetch, validate, and parse any domain's robots.txt file. See which bots are allowed or blocked, and inspect all crawl directives at a glance.

1 credit

Enter a domain above, or drop a robots.txt file here

Domain lookup fetches the live file (1 credit). File upload and paste are free — analyzed locally in your browser.

robots.txt is the internet's honor system. Bots don't have to obey it. Good ones do.

Written by Ishan Karunaratne · Last reviewed:

What Is robots.txt?

robots.txt is a plain text file located at the root of a website — for example, https://example.com/robots.txt. It uses the Robots Exclusion Protocol, standardized in RFC 9309 (published June 2022 by the IETF), to tell web crawlers which parts of your site they are and are not allowed to access. The protocol was originally proposed by Martijn Koster in 1994 and remained an informal convention for nearly three decades before being formally standardized.

Search engines like Google, Bing, and Yandex check robots.txt before crawling any page on your domain. According to Google's documentation, if the file returns HTTP 200, the rules within it are enforced. A 404 means the entire site is considered crawlable. A 5xx response causes Google to temporarily stop crawling — and after 30 days of consecutive 5xx responses, Google treats the last cached version of robots.txt as authoritative (RFC 9309 §2.3).

A misconfigured robots.txt can have serious consequences. Accidentally blocking Disallow: / on your wildcard user-agent will cause search engine bots to stop crawling your entire site, and pages can disappear from search results within days. Similarly, blocking CSS and JavaScript resources prevents search engines from rendering your pages, which can harm indexing and rankings even if the HTML itself is crawlable.

It's important to understand that robots.txt is an advisory protocol, not a security mechanism. It does not prevent access — anyone can view your robots.txt file, and malicious crawlers will ignore it entirely. For pages that must not be indexed, use the noindex meta tag or X-Robots-Tag HTTP header instead.

How Does This Checker Work?

This tool fetches your robots.txt file server-side and performs a comprehensive analysis based on RFC 9309 and search engine best practices:

  1. HTTP status check. The file must return HTTP 200. A 404 means no robots.txt is present — crawlers will treat the entire site as allowed. A 5xx means the file is temporarily inaccessible, which may cause crawlers to pause (RFC 9309 §2.3).
  2. Content-Type validation. The server must return Content-Type: text/plain. Per RFC 9309 §2.2, crawlers should parse the file as UTF-8 encoded text. If it returns HTML, most bots will fail to parse it correctly.
  3. HTML detection. Some servers return a 200 with an HTML error page instead of a proper robots.txt. This tool detects and flags that condition.
  4. Directive parsing. The file is parsed into User-agent blocks, each with its Disallow, Allow, and Crawl-delay rules. Sitemap and Host declarations are also extracted. Duplicate rules, empty directives, and invalid sitemap URLs are flagged.
  5. Sitemap reachability. Each declared Sitemap URL is checked with an HTTP HEAD request to verify it responds successfully. This confirms the URL is reachable — it does not validate the XML contents of the sitemap itself.
  6. CMS detection. Common path patterns are analyzed to detect your CMS platform (WordPress, Next.js, Shopify, Drupal, etc.) and provide platform-specific recommendations for optimizing your robots.txt configuration.

You can also drag and drop a local robots.txt file or paste content from your clipboard to analyze it offline without using any credits. Local analysis performs all checks except HTTP status, Content-Type, and sitemap reachability.

What Directives Does robots.txt Support?

The robots.txt file supports several directives, each controlling a different aspect of crawler behavior. Below are all directives recognized by major search engines and AI crawlers, with references to the relevant specifications.

User-agent

Identifies which bot the following rules apply to. User-agent: * targets all bots. Use specific names like Googlebot or Bingbot for bot-specific rules. Per RFC 9309 §2.2.1, a crawler must use the most specific matching group — if both * and Googlebot groups exist, Googlebot only follows its own group.

Disallow

Blocks the bot from crawling the specified path and everything under it. Disallow: / blocks the entire site. Disallow: (empty value) allows all paths. Path matching is case-sensitive and prefix-based (RFC 9309 §2.2.2). Google and Bing also support * wildcards and $ end-of-URL anchors in paths.

Allow

Overrides a Disallow rule for a more specific path. Useful when blocking a directory but allowing certain files within it. For example, Disallow: /private/ combined with Allow: /private/public.html lets crawlers access that one file. Formally defined in RFC 9309 §2.2.2 — when Allow and Disallow match the same URL, the most specific (longest) rule wins.

Sitemap

Declares the full URL of your XML sitemap. Multiple Sitemap lines are allowed — useful for sitemap index files or separate sitemaps for different content types. This directive is not part of the core RFC 9309 specification but is universally supported by Google, Bing, and Yandex. See Google's sitemap documentation for format requirements.

Crawl-delay

Requests the bot wait the specified number of seconds between requests. Supported by Bingbot, Yandex, and others. Not supported by Googlebot — you must configure Googlebot's crawl rate directly in Google Search Console. Values over 30 seconds can significantly slow indexing and are flagged by this checker.

Host

A non-standard directive historically used by Yandex to specify the preferred domain (e.g. www vs non-www). Not recognized by Google or Bing. If you need to set a canonical domain, use rel="canonical" or 301 redirects instead, which are universally supported.

Which Crawlers Support Which Directives?

Not every crawler supports every robots.txt directive. The table below shows which directives are recognized by major search engine and AI crawlers, based on official documentation and observed behavior as of March 2026.

DirectiveGooglebotBingbotYandexGPTBotClaudeBot
User-agentYesYesYesYesYes
DisallowYesYesYesYesYes
AllowYesYesYesYesYes
SitemapYesYesYes
Crawl-delayNoYesYes
HostNoNoYesNoNo
Wildcards (*,$)YesYesNo

A dash (—) indicates the directive is not relevant to that crawler's function. AI crawlers like GPTBot and ClaudeBot primarily respect User-agent, Disallow, and Allow directives.

How Do AI Crawlers Handle robots.txt?

With the rise of large language models (LLMs), a new generation of web crawlers has emerged. Companies like OpenAI (GPTBot, ChatGPT-User), Anthropic (ClaudeBot, anthropic-ai), Google (Google-Extended), Meta (FacebookBot for AI training), and others now crawl the web to build training datasets and power AI-powered search features.

These crawlers generally respect robots.txt rules. You can block them using specific User-agent directives — for example, User-agent: GPTBot followed by Disallow: / prevents OpenAI from crawling your site. The Generator tab in this tool includes presets for blocking all known AI crawlers.

Note that blocking AI crawlers is separate from blocking search engine crawlers. You can allow Googlebot to index your site for search results while simultaneously blocking GPTBot from using your content for AI training. Each bot uses its own User-agent string and follows its own group in robots.txt.

The table below lists the major AI crawlers active as of 2026, their User-agent strings, and whether they respect robots.txt directives.

User-agentOwnerPurposeRespects robots.txt
GPTBotOpenAITraining data collectionYes
OAI-SearchBotOpenAIChatGPT search resultsYes
ChatGPT-UserOpenAIReal-time browsing in ChatGPTYes
ClaudeBotAnthropicTraining data and web featuresYes
PerplexityBotPerplexityAI-powered search answersYes
Google-ExtendedGoogleGemini AI training (not Search)Yes
BytespiderByteDanceTikTok / Douyin AI trainingYes
CCBotCommon CrawlOpen dataset used by many AI labsYes

What Are robots.txt Best Practices?

A well-configured robots.txt file helps search engines and AI crawlers index your site efficiently while protecting server resources and preventing sensitive paths from appearing in search results. These best practices are based on RFC 9309 and official guidance from Google, Bing, and other major crawlers.

  1. Always place robots.txt at the domain root. The file must be accessible at https://yourdomain.com/robots.txt — no subdirectory, no alternate filename. Crawlers only check this exact path.
  2. Serve it as Content-Type: text/plain. If your server returns HTML (common with custom 404 pages or reverse proxies), crawlers cannot parse the directives and may treat the file as invalid.
  3. Never block CSS, JavaScript, or image files that search engines need to render your pages. Google renders pages like a browser. Blocking render-critical resources causes indexing issues and can drop your rankings.
  4. Use specific User-agent blocks for bot-specific rules. Rather than applying overly broad restrictions to all bots via User-agent: *, create targeted blocks for individual crawlers — especially when managing AI crawler access separately from search engines.
  5. Declare your Sitemap URL. Adding Sitemap: https://yourdomain.com/sitemap.xml helps crawlers discover all your pages, even those not linked from your navigation.
  6. Keep the file under 500KB. Google enforces a 500KB size limit. Content beyond this limit is ignored, which could leave critical rules unprocessed.
  7. Do not rely on robots.txt for security. The file is publicly accessible and provides no access control. For pages with sensitive content, use authentication, server-side access rules, or the noindex meta tag.
  8. Test changes before deploying. Use a robots.txt testing tool (like this one) or Google Search Console's robots.txt tester to verify that your changes don't accidentally block important pages.
  9. Review your robots.txt periodically. As your site grows and your CMS changes, outdated rules can block new sections or allow paths that should be restricted. Audit the file at least quarterly.

Where Can You Learn More About robots.txt?

Need this in code?

Every check this tool runs is also available via the robots.txt API with examples in cURL, JavaScript, Python, PHP, Ruby, and Java.

API docs

Built and maintained alongside this tool. Free, no signup required.

Frequently Asked Questions