Skip to main content
DNS Checker(beta)
11 min read

How to Identify and Manage Web Crawlers: A Sysadmin's Guide to robots.txt, AI Bots, and SEO Crawlers

Ishan Karunaratne

Ishan Karunaratne

Software Architect & Infrastructure Engineer

Last month I got a frantic message from a colleague: "Someone is hammering our production server — thousands of requests per minute from a single IP. Should I file an abuse report?" I looked up the IP with our IP Location tool, checked the access logs, and found the User-Agent string: Mozilla/5.0 (compatible; Screaming Frog SEO Spider/21.0). The marketing team had kicked off a full-site audit without telling IT.

This happens more than you would think. Sysadmins see an IP generating massive traffic, look it up, and immediately reach for the abuse report template from my complete guide to reporting IP abuse. But not every heavy hitter is malicious. Web crawlers — from SEO audit tools to AI training bots to search engine indexers — can generate traffic patterns that look almost identical to a denial-of-service attack. Filing an abuse report against Googlebot or your own company's SEO tool is not just embarrassing; it wastes the ISP's abuse desk time and your own.

This guide covers how to identify what is actually crawling your server, the full landscape of crawlers you will encounter in 2025, how to manage them with robots.txt and server-level controls, and a decision framework for when to block, allow, or escalate to an abuse report.


How to Identify What Is Crawling Your Server

The single most important step before taking any action is checking the User-Agent string in your access logs. Every legitimate crawler identifies itself (or at least attempts to), and the User-Agent tells you exactly who is responsible.

Extracting User-Agents from Access Logs

Nginx/Apache combined log format:

# List all unique User-Agents, sorted by request count
awk -F'"' '{print $6}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -30

# Find all requests from a specific IP and show their User-Agents
awk -v ip="198.51.100.47" '$1 == ip {split($0, a, "\""); print a[6]}' /var/log/nginx/access.log | sort -u

# Count requests per IP per hour (find the heavy hitters)
awk '{print $1, substr($4,2,14)}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -20

Quick one-liner to find bot traffic:

# Extract all requests with "bot" or "crawler" or "spider" in User-Agent
grep -iE '(bot|crawler|spider|scraper|fetcher)' /var/log/nginx/access.log | awk -F'"' '{print $6}' | sort | uniq -c | sort -rn

Verifying Crawler Identity with IP Lookup

User-Agent strings can be spoofed. Anyone can send a request claiming to be Googlebot. To verify a crawler is genuine:

  1. Look up the IP with DNS Checker's IP Location tool — check if the owning organization matches the claimed crawler. A "Googlebot" request from an IP owned by a residential ISP in Romania is fake.

  2. Reverse DNS verification — legitimate search engine crawlers have verifiable PTR records:

# Step 1: Reverse DNS lookup
host 66.249.66.1
# Returns: crawl-66-249-66-1.googlebot.com

# Step 2: Forward DNS to confirm
host crawl-66-249-66-1.googlebot.com
# Returns: 66.249.66.1  ← matches, this is genuine Googlebot

# Combined verification script
dig -x 198.51.100.47 +short | xargs -I{} dig {} +short

This two-step verification (reverse lookup, then forward lookup to confirm the IP matches) is the gold standard for confirming Google, Bing, and other major search engine crawlers.


SEO Crawlers: The Usual Suspects

SEO crawlers are the most common source of "is this an attack?" false alarms. They hit your server hard because they are designed to crawl entire sites quickly — sometimes thousands of pages in minutes.

CrawlerUser-Agent ContainsOperatorTypical Behavior
Screaming FrogScreaming Frog SEO SpiderDesktop tool (run internally)Aggressive, runs from your office IP or employee laptop
AhrefsBotAhrefsBotAhrefsContinuous backlink crawling, 5-10 req/sec
SemrushBotSemrushBotSemrushSEO auditing and competitive analysis
DotBotDotBotMozBacklink and domain analysis
MJ12botMJ12botMajesticLarge-scale link index building
SitebulbSitebulbDesktop toolTechnical SEO auditing
LumarDeepCrawl or LumarLumar (formerly DeepCrawl)Enterprise site crawling
BotifybotifyBotifyEnterprise SEO platform

How to Tell Internal vs. External Crawlers

The critical distinction is whether someone in your organization triggered the crawl:

  • Screaming Frog, Sitebulb — these are desktop applications. If the source IP is your office network or a VPN exit node, someone on your team is running it. Talk to your marketing or SEO team before filing anything.
  • AhrefsBot, SemrushBot, MJ12bot — these are third-party services crawling the open web. They typically respect robots.txt and have published IP ranges. If their crawl rate is causing problems, robots.txt or rate limiting is the appropriate response, not an abuse report.
  • Botify, Lumar — enterprise tools that may be run by an agency working for your company. Check with whoever manages your SEO.

AI Crawlers: The New Heavy Hitters

AI crawlers are the fastest-growing category of bot traffic and the most likely to trigger abuse report instincts. They hit harder than traditional search engine crawlers because many operate at aggressive rates, lack politeness delays, and some ignore robots.txt entirely.

CrawlerUser-Agent ContainsOperatorPurpose
GPTBotGPTBotOpenAITraining data collection
ChatGPT-UserChatGPT-UserOpenAIReal-time browsing for ChatGPT
OAI-SearchBotOAI-SearchBotOpenAISearchGPT web search
Google-ExtendedGoogle-ExtendedGoogleGemini AI training
ClaudeBotClaudeBotAnthropicClaude AI training
anthropic-aianthropic-aiAnthropicAnthropic web fetcher
Claude-WebClaude-WebAnthropicClaude web access
Meta-ExternalAgentMeta-ExternalAgentMetaAI training crawling
FacebookBotFacebookBotMetaLink previews and AI
Applebot-ExtendedApplebot-ExtendedAppleApple Intelligence training
AmazonbotAmazonbotAmazonAlexa and AI services
PerplexityBotPerplexityBotPerplexityAI search engine
BytespiderBytespiderByteDanceTikTok/AI training
CCBotCCBotCommon CrawlOpen dataset for AI training
DiffbotDiffbotDiffbotWeb data extraction
YouBotYouBotYou.comAI search engine
OmgilibotOmgilibotWebz.ioData-as-a-service for AI

Why AI Crawlers Hit Harder Than Search Engines

Traditional search engine crawlers like Googlebot are designed for long-term relationships with websites. They implement polite crawl rates, back off when servers return 429 or 503 status codes, and adjust their crawl budget based on server responsiveness.

Many AI crawlers operate differently:

  • Aggressive crawl rates — some AI crawlers send 50-100+ requests per second without built-in politeness delays
  • Full-site scraping — instead of indexing a few pages, they try to download everything for training data
  • Ignoring signals — some crawlers do not respect Crawl-delay directives or slow down in response to HTTP 429 responses
  • Bandwidth impact — a single AI crawler can consume 10-50 GB of bandwidth per day on a medium-sized site, compared to 1-5 GB for Googlebot

I have seen AI crawlers generate enough traffic to cause visible CPU spikes on production servers. But this is not malicious behavior — it is aggressive but legitimate crawling. The correct response is robots.txt or server-level rate limiting, not an abuse report. If a crawler continues after you have blocked it in robots.txt and at the server level, that is a different story — and that does warrant an abuse report.


Search Engine Crawlers (Do Not Block These)

Search engine crawlers drive your organic traffic. Blocking them means your site disappears from search results.

CrawlerUser-Agent ContainsOperatorVerification Method
GooglebotGooglebotGoogleReverse DNS → *.googlebot.com or *.google.com
BingbotbingbotMicrosoftReverse DNS → *.search.msn.com
YandexBotYandexBotYandexReverse DNS → *.yandex.com or *.yandex.net
BaiduspiderBaiduspiderBaiduReverse DNS → *.baidu.com or *.baidu.jp
DuckDuckBotDuckDuckBotDuckDuckGoPublished IP ranges

Verification Technique

Always verify before trusting a search engine User-Agent string. The reverse-then-forward DNS technique catches impersonators:

# Verify Googlebot
ip="66.249.66.1"
ptr=$(dig -x $ip +short)
echo "PTR: $ptr"
# Should be: crawl-66-249-66-1.googlebot.com.

fwd=$(dig $ptr +short)
echo "Forward: $fwd"
# Should match the original IP: 66.249.66.1

# If forward DNS matches original IP → genuine
# If it doesn't match or PTR doesn't resolve → fake

If you are seeing high traffic from a "Googlebot" that fails this verification, that is an impersonator and worth investigating further. Fake Googlebot traffic can indicate scraping, competitive intelligence gathering, or actual malicious activity.


How to Manage Crawlers with robots.txt

The robots.txt file is the standard mechanism for telling crawlers what they can and cannot access on your site. It lives at the root of your domain (e.g., https://example.com/robots.txt) and uses a simple directive syntax.

robots.txt Basics

# Allow all crawlers to access everything (default if no robots.txt exists)
User-agent: *
Allow: /

# Block a specific crawler from everything
User-agent: GPTBot
Disallow: /

# Block a crawler from specific paths
User-agent: SemrushBot
Disallow: /admin/
Disallow: /api/
Allow: /

# Set a crawl rate (seconds between requests)
User-agent: AhrefsBot
Crawl-delay: 10

Key directives:

  • User-agent — which crawler this rule applies to (* means all)
  • Disallow — paths the crawler should not access
  • Allow — paths the crawler can access (overrides Disallow for more specific paths)
  • Crawl-delay — seconds to wait between requests (not supported by all crawlers)
  • Sitemap — URL of your XML sitemap

Block All AI Crawlers

If you want to opt out of AI training while keeping search engines and SEO tools working:

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Diffbot
Disallow: /

User-agent: YouBot
Disallow: /

User-agent: Omgilibot
Disallow: /

# Allow search engines (default: everything allowed)
User-agent: Googlebot
Allow: /

User-agent: bingbot
Allow: /

Sitemap: https://example.com/sitemap.xml

Rate-Limit Aggressive SEO Crawlers

If you do not want to block SEO crawlers entirely but need them to slow down:

User-agent: AhrefsBot
Crawl-delay: 10

User-agent: SemrushBot
Crawl-delay: 10

User-agent: MJ12bot
Crawl-delay: 30

User-agent: DotBot
Crawl-delay: 10

Crawl-delay: 10 tells the crawler to wait 10 seconds between requests. Note that Googlebot and Bingbot do not support Crawl-delay — Google uses Search Console's crawl rate settings instead.

Limitations of robots.txt

This is the most important thing to understand: robots.txt is voluntary. It is a convention, not a technical enforcement mechanism. Well-behaved crawlers obey it; malicious scrapers and some aggressive bots ignore it entirely.

robots.txt does not:

  • Prevent access (it is publicly readable at /robots.txt, so it can actually reveal paths you might prefer to keep less visible)
  • Enforce rate limits (a crawler that ignores the file will ignore Crawl-delay too)
  • Block direct URL requests (users and bots can still access Disallowed URLs directly)

When robots.txt is not enough, you need server-level controls.


Beyond robots.txt: Server-Level Controls

When a crawler ignores your robots.txt or you need hard enforcement, server-level controls are the answer.

Nginx Rate Limiting by User-Agent

# Create a rate limit zone for known aggressive crawlers
map $http_user_agent $is_bot {
    default 0;
    "~*AhrefsBot"      1;
    "~*SemrushBot"      1;
    "~*MJ12bot"         1;
    "~*GPTBot"          1;
    "~*ClaudeBot"       1;
    "~*Bytespider"      1;
    "~*CCBot"           1;
    "~*PerplexityBot"   1;
}

limit_req_zone $binary_remote_addr zone=bot_limit:10m rate=1r/s;

server {
    # Apply rate limit only to identified bots
    if ($is_bot) {
        set $limit_key $binary_remote_addr;
    }

    location / {
        limit_req zone=bot_limit burst=5 nodelay;
        # ... your normal config
    }
}

Apache Rate Limiting

# Block specific crawlers entirely
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (GPTBot|ClaudeBot|Bytespider|CCBot) [NC]
RewriteRule .* - [F,L]

# Or return 429 Too Many Requests
RewriteCond %{HTTP_USER_AGENT} (GPTBot|ClaudeBot|Bytespider|CCBot) [NC]
RewriteRule .* - [R=429,L]

Cloudflare Bot Management

If you use Cloudflare, their Bot Management and WAF rules provide the easiest server-level controls:

  1. Super Bot Fight Mode (free plans) — automatically challenges or blocks bots categorized as "automated"
  2. WAF Custom Rules — create rules that match specific User-Agent strings and block, challenge, or rate-limit them
  3. AI Bot blocking — Cloudflare added a one-click toggle to block known AI crawlers in 2024

Example Cloudflare WAF rule expression:

(http.user_agent contains "GPTBot") or
(http.user_agent contains "ClaudeBot") or
(http.user_agent contains "Bytespider") or
(http.user_agent contains "CCBot")

Set the action to "Block" or "Managed Challenge" depending on how aggressively you want to handle these.

Firewall Rules for Known CIDR Ranges

Some AI crawlers publish their IP ranges, which lets you block at the network level regardless of User-Agent:

# OpenAI published ranges (check their documentation for current ranges)
# Block GPTBot at the firewall
sudo ufw deny from 20.15.240.0/24 comment "OpenAI GPTBot"
sudo ufw deny from 20.15.241.0/24 comment "OpenAI GPTBot"

# Or with iptables
iptables -A INPUT -s 20.15.240.0/24 -j DROP -m comment --comment "OpenAI GPTBot"

When to use firewall-level blocks:

  • The crawler ignores robots.txt
  • The crawler spoofs its User-Agent to bypass User-Agent-based rules
  • You need to block before the request even reaches your web server (saves CPU)
  • The crawler is generating enough traffic to impact server performance

The Decision Framework: Block, Allow, or Report?

Here is the framework I use when I see an IP generating heavy traffic on a server:

Step 1: Identify the Source

Check the User-Agent in your access logs and look up the IP with the IP Location tool. This immediately tells you if it is a known crawler, which organization owns the IP, and whether the User-Agent matches the IP owner.

Step 2: Classify the Crawler

Based on what you find:

Known search engine (Googlebot, Bingbot, etc.)Allow. These drive your traffic. If they are crawling too aggressively, adjust your crawl budget in Google Search Console or Bing Webmaster Tools. Do not file an abuse report.

SEO tool from your own companyTalk to your team. Screaming Frog, Sitebulb, or a Botify crawl triggered by your marketing or SEO team is internal traffic. Coordinate crawl schedules to avoid impacting production. Definitely do not file an abuse report.

Third-party SEO crawler (AhrefsBot, SemrushBot)Rate-limit or block via robots.txt. These respect robots.txt. Set a Crawl-delay or block specific paths. An abuse report is not appropriate here.

AI training crawler (GPTBot, ClaudeBot, CCBot)Block via robots.txt if you do not want your content used for AI training. Most AI crawlers respect robots.txt. If they do not, escalate to server-level blocks.

Crawler ignoring robots.txtServer-level block. Use Nginx/Apache rules or firewall rules. If the behavior is persistent and the operator is unresponsive to contact, this may warrant an abuse report.

Unknown bot with malicious behavior patternsInvestigate and potentially report. If the traffic does not have a legitimate crawler User-Agent, the IP does not belong to a known crawling service, and the behavior looks like scraping, DDoS, or reconnaissance — that is when you file an abuse report using the complete guide to reporting IP abuse.

When an Abuse Report Is Warranted

To be clear, there are situations where crawling behavior does cross the line into abuse:

  • The crawler is ignoring robots.txt and you have contacted the operator without response
  • The traffic is causing service degradation and the operator will not implement rate limiting
  • The crawler is spoofing User-Agent strings to impersonate legitimate bots
  • The IP does not belong to any known crawling service and the behavior is clearly automated scraping
  • The "crawler" is actually performing port scanning or brute-force attacks alongside its requests

Should You Block AI Crawlers? Pros and Cons

This is the question every sysadmin is wrestling with in 2025. There is no universally right answer — it depends on your organization's priorities.

Reasons to Block

  • Bandwidth and server load — AI crawlers can consume significant resources. If you are paying for bandwidth, this has a direct cost.
  • Content used without consent — your content may be used to train models that compete with you or generate derivative content without attribution.
  • No direct benefit — unlike search engine crawlers, AI training crawlers do not drive traffic to your site.
  • Legal uncertainty — the legal status of AI training on web content is still being litigated in multiple jurisdictions.

Reasons to Allow

  • AI search visibility — Perplexity, ChatGPT with browsing, and Google AI Overviews are becoming significant traffic sources. Blocking their crawlers may reduce your visibility in AI-powered search.
  • Future discoverability — as AI becomes a primary way people find information, content that has been crawled and indexed by AI systems may have better long-term reach.
  • Selective blocking is possible — you can block training crawlers (GPTBot, Google-Extended) while allowing search-adjacent ones (ChatGPT-User, OAI-SearchBot, PerplexityBot).

My Recommendation

Block the pure training crawlers (GPTBot, Google-Extended, ClaudeBot, Bytespider, CCBot, Meta-ExternalAgent) via robots.txt. Consider allowing the search-adjacent crawlers (ChatGPT-User, PerplexityBot, OAI-SearchBot) if AI search traffic matters to your business. Monitor your access logs monthly to stay on top of new crawlers as they appear — this landscape changes fast.


Frequently Asked Questions

This article was researched and structured by the author with AI assistance for drafting and technical verification.

About the Author

Ishan Karunaratne
Ishan Karunaratne

Software Architect & Infrastructure Engineer

US Army veteran with a B.S. in Information Technology, CompTIA A+, Network+, and Security+ certified. 20+ years building and securing web infrastructure.

B.S. Information Technology — Online SystemsCompTIA A+ (2009)CompTIA Network+ (2009)CompTIA Security+ (2009)US Army Veteran — Operation Iraqi Freedom

Share this article

Related Articles

145,061 Domains Delegated to a Misspelled Name Server — Here's How the Attack Works

A single typo in a name server hostname gives an attacker full DNS authority over your domain. I built a detection pipeline that scans 260 million domains daily and found that one missing character in ResellerClub's NS hostname has left 145,061 domains exposed to silent DNS hijacking.

What Happens When One DNS Provider Goes Down: The Hidden Fragility of TLD Ecosystems

The Dyn attack took down Twitter and Netflix because they shared a DNS provider. I analyzed 240 million domains and found 112 TLDs where a single provider controls over half the domains. The next Dyn-scale event isn't a question of if, but which TLD.

How Expired Name Servers Become Domain Hijacking Vectors

When a name server domain expires, every domain that still delegates to it becomes vulnerable to hijacking. I found 503,000 domains pointing to expired NS domains — and a single re-registration could compromise hundreds of thousands of them.

Why DNSSEC Is Still Failing: Lessons from 240 Million Domains

After 20 years, only 4.27% of domains have DNSSEC. I analyzed 240 million domains to understand why — the answer isn't technical, it's structural. Registrar defaults, invisible benefits, and operational fear are holding back the one protocol that could fix DNS authentication.