Robots.txt Checker
Fetch, validate, and parse any domain's robots.txt file. See which bots are allowed or blocked, and inspect all crawl directives at a glance.
Enter a domain above, or drop a robots.txt file here
Domain lookup fetches the live file (1 credit). File upload and paste are free — analyzed locally in your browser.
“robots.txt is the internet's honor system. Bots don't have to obey it. Good ones do.”
Written by Ishan Karunaratne · Last reviewed:
What Is robots.txt?
robots.txt is a plain text file located at the root of a website — for example, https://example.com/robots.txt. It uses the Robots Exclusion Protocol, standardized in RFC 9309 (published June 2022 by the IETF), to tell web crawlers which parts of your site they are and are not allowed to access. The protocol was originally proposed by Martijn Koster in 1994 and remained an informal convention for nearly three decades before being formally standardized.
Search engines like Google, Bing, and Yandex check robots.txt before crawling any page on your domain. According to Google's documentation, if the file returns HTTP 200, the rules within it are enforced. A 404 means the entire site is considered crawlable. A 5xx response causes Google to temporarily stop crawling — and after 30 days of consecutive 5xx responses, Google treats the last cached version of robots.txt as authoritative (RFC 9309 §2.3).
A misconfigured robots.txt can have serious consequences. Accidentally blocking Disallow: / on your wildcard user-agent will cause search engine bots to stop crawling your entire site, and pages can disappear from search results within days. Similarly, blocking CSS and JavaScript resources prevents search engines from rendering your pages, which can harm indexing and rankings even if the HTML itself is crawlable.
It's important to understand that robots.txt is an advisory protocol, not a security mechanism. It does not prevent access — anyone can view your robots.txt file, and malicious crawlers will ignore it entirely. For pages that must not be indexed, use the noindex meta tag or X-Robots-Tag HTTP header instead.
How Does This Checker Work?
This tool fetches your robots.txt file server-side and performs a comprehensive analysis based on RFC 9309 and search engine best practices:
- HTTP status check. The file must return HTTP 200. A 404 means no robots.txt is present — crawlers will treat the entire site as allowed. A 5xx means the file is temporarily inaccessible, which may cause crawlers to pause (RFC 9309 §2.3).
- Content-Type validation. The server must return
Content-Type: text/plain. Per RFC 9309 §2.2, crawlers should parse the file as UTF-8 encoded text. If it returns HTML, most bots will fail to parse it correctly. - HTML detection. Some servers return a 200 with an HTML error page instead of a proper robots.txt. This tool detects and flags that condition.
- Directive parsing. The file is parsed into User-agent blocks, each with its Disallow, Allow, and Crawl-delay rules. Sitemap and Host declarations are also extracted. Duplicate rules, empty directives, and invalid sitemap URLs are flagged.
- Sitemap reachability. Each declared Sitemap URL is checked with an HTTP HEAD request to verify it responds successfully. This confirms the URL is reachable — it does not validate the XML contents of the sitemap itself.
- CMS detection. Common path patterns are analyzed to detect your CMS platform (WordPress, Next.js, Shopify, Drupal, etc.) and provide platform-specific recommendations for optimizing your robots.txt configuration.
You can also drag and drop a local robots.txt file or paste content from your clipboard to analyze it offline without using any credits. Local analysis performs all checks except HTTP status, Content-Type, and sitemap reachability.
What Directives Does robots.txt Support?
The robots.txt file supports several directives, each controlling a different aspect of crawler behavior. Below are all directives recognized by major search engines and AI crawlers, with references to the relevant specifications.
User-agent
Identifies which bot the following rules apply to. User-agent: * targets all bots. Use specific names like Googlebot or Bingbot for bot-specific rules. Per RFC 9309 §2.2.1, a crawler must use the most specific matching group — if both * and Googlebot groups exist, Googlebot only follows its own group.
Disallow
Blocks the bot from crawling the specified path and everything under it. Disallow: / blocks the entire site. Disallow: (empty value) allows all paths. Path matching is case-sensitive and prefix-based (RFC 9309 §2.2.2). Google and Bing also support * wildcards and $ end-of-URL anchors in paths.
Allow
Overrides a Disallow rule for a more specific path. Useful when blocking a directory but allowing certain files within it. For example, Disallow: /private/ combined with Allow: /private/public.html lets crawlers access that one file. Formally defined in RFC 9309 §2.2.2 — when Allow and Disallow match the same URL, the most specific (longest) rule wins.
Sitemap
Declares the full URL of your XML sitemap. Multiple Sitemap lines are allowed — useful for sitemap index files or separate sitemaps for different content types. This directive is not part of the core RFC 9309 specification but is universally supported by Google, Bing, and Yandex. See Google's sitemap documentation for format requirements.
Crawl-delay
Requests the bot wait the specified number of seconds between requests. Supported by Bingbot, Yandex, and others. Not supported by Googlebot — you must configure Googlebot's crawl rate directly in Google Search Console. Values over 30 seconds can significantly slow indexing and are flagged by this checker.
Host
A non-standard directive historically used by Yandex to specify the preferred domain (e.g. www vs non-www). Not recognized by Google or Bing. If you need to set a canonical domain, use rel="canonical" or 301 redirects instead, which are universally supported.
Which Crawlers Support Which Directives?
Not every crawler supports every robots.txt directive. The table below shows which directives are recognized by major search engine and AI crawlers, based on official documentation and observed behavior as of March 2026.
| Directive | Googlebot | Bingbot | Yandex | GPTBot | ClaudeBot |
|---|---|---|---|---|---|
| User-agent | Yes | Yes | Yes | Yes | Yes |
| Disallow | Yes | Yes | Yes | Yes | Yes |
| Allow | Yes | Yes | Yes | Yes | Yes |
| Sitemap | Yes | Yes | Yes | — | — |
| Crawl-delay | No | Yes | Yes | — | — |
| Host | No | No | Yes | No | No |
| Wildcards (*,$) | Yes | Yes | No | — | — |
A dash (—) indicates the directive is not relevant to that crawler's function. AI crawlers like GPTBot and ClaudeBot primarily respect User-agent, Disallow, and Allow directives.
How Do AI Crawlers Handle robots.txt?
With the rise of large language models (LLMs), a new generation of web crawlers has emerged. Companies like OpenAI (GPTBot, ChatGPT-User), Anthropic (ClaudeBot, anthropic-ai), Google (Google-Extended), Meta (FacebookBot for AI training), and others now crawl the web to build training datasets and power AI-powered search features.
These crawlers generally respect robots.txt rules. You can block them using specific User-agent directives — for example, User-agent: GPTBot followed by Disallow: / prevents OpenAI from crawling your site. The Generator tab in this tool includes presets for blocking all known AI crawlers.
Note that blocking AI crawlers is separate from blocking search engine crawlers. You can allow Googlebot to index your site for search results while simultaneously blocking GPTBot from using your content for AI training. Each bot uses its own User-agent string and follows its own group in robots.txt.
The table below lists the major AI crawlers active as of 2026, their User-agent strings, and whether they respect robots.txt directives.
| User-agent | Owner | Purpose | Respects robots.txt |
|---|---|---|---|
| GPTBot | OpenAI | Training data collection | Yes |
| OAI-SearchBot | OpenAI | ChatGPT search results | Yes |
| ChatGPT-User | OpenAI | Real-time browsing in ChatGPT | Yes |
| ClaudeBot | Anthropic | Training data and web features | Yes |
| PerplexityBot | Perplexity | AI-powered search answers | Yes |
| Google-Extended | Gemini AI training (not Search) | Yes | |
| Bytespider | ByteDance | TikTok / Douyin AI training | Yes |
| CCBot | Common Crawl | Open dataset used by many AI labs | Yes |
What Are robots.txt Best Practices?
A well-configured robots.txt file helps search engines and AI crawlers index your site efficiently while protecting server resources and preventing sensitive paths from appearing in search results. These best practices are based on RFC 9309 and official guidance from Google, Bing, and other major crawlers.
- Always place robots.txt at the domain root. The file must be accessible at
https://yourdomain.com/robots.txt— no subdirectory, no alternate filename. Crawlers only check this exact path. - Serve it as
Content-Type: text/plain. If your server returns HTML (common with custom 404 pages or reverse proxies), crawlers cannot parse the directives and may treat the file as invalid. - Never block CSS, JavaScript, or image files that search engines need to render your pages. Google renders pages like a browser. Blocking render-critical resources causes indexing issues and can drop your rankings.
- Use specific User-agent blocks for bot-specific rules. Rather than applying overly broad restrictions to all bots via
User-agent: *, create targeted blocks for individual crawlers — especially when managing AI crawler access separately from search engines. - Declare your Sitemap URL. Adding
Sitemap: https://yourdomain.com/sitemap.xmlhelps crawlers discover all your pages, even those not linked from your navigation. - Keep the file under 500KB. Google enforces a 500KB size limit. Content beyond this limit is ignored, which could leave critical rules unprocessed.
- Do not rely on robots.txt for security. The file is publicly accessible and provides no access control. For pages with sensitive content, use authentication, server-side access rules, or the
noindexmeta tag. - Test changes before deploying. Use a robots.txt testing tool (like this one) or Google Search Console's robots.txt tester to verify that your changes don't accidentally block important pages.
- Review your robots.txt periodically. As your site grows and your CMS changes, outdated rules can block new sections or allow paths that should be restricted. Audit the file at least quarterly.
Where Can You Learn More About robots.txt?
RFC 9309 — Robots Exclusion Protocol
The official IETF internet standard (published June 2022) defining robots.txt syntax, file access rules, precedence logic, and crawler behavior. The authoritative reference for all robots.txt implementations.
Google Robots.txt Specification
Google's implementation details including supported directives (User-agent, Allow, Disallow, Sitemap only), the 500KB file size limit, and wildcard pattern matching.
Bing Robots.txt Guide
Bing's official guide covering Crawl-delay support, BingBot-specific behavior, and how Bing processes robots.txt differently from Google.
Google Sitemaps Documentation
Best practices for XML sitemaps, sitemap indexes, the Sitemap directive in robots.txt, and how to submit sitemaps to Google Search Console.
Yandex Robots.txt Documentation
Yandex's robots.txt guide covering Host directive support, Clean-param, and Yandex-specific extensions not available in other search engines.
robotstxt.org
The original community resource for the Robots Exclusion Protocol, including the historical 1994 specification and practical usage examples.
What Other Tools Help With Crawl Configuration?
DNS Inspector
Look up DNS records (A, MX, TXT, NS, CNAME) for any domain.
HTTP Security Headers
Analyze a site's HTTP security headers and get a letter grade.
Security Scanner
Check if your domain is flagged for malware or phishing across 17 vendors.
WHOIS Lookup
Look up domain registration details, registrar, and expiry dates.
On-Page SEO Checker
Audit 70+ on-page SEO factors including robots.txt access, meta tags, and structured data.
Need this in code?
Every check this tool runs is also available via the robots.txt API with examples in cURL, JavaScript, Python, PHP, Ruby, and Java.
Related tools you might need
Built and maintained alongside this tool. Free, no signup required.