Last month I got a frantic message from a colleague: "Someone is hammering our production server — thousands of requests per minute from a single IP. Should I file an abuse report?" I looked up the IP with our IP Location tool, checked the access logs, and found the User-Agent string: Mozilla/5.0 (compatible; Screaming Frog SEO Spider/21.0). The marketing team had kicked off a full-site audit without telling IT.
This happens more than you would think. Sysadmins see an IP generating massive traffic, look it up, and immediately reach for the abuse report template from my complete guide to reporting IP abuse. But not every heavy hitter is malicious. Web crawlers — from SEO audit tools to AI training bots to search engine indexers — can generate traffic patterns that look almost identical to a denial-of-service attack. Filing an abuse report against Googlebot or your own company's SEO tool is not just embarrassing; it wastes the ISP's abuse desk time and your own.
This guide covers how to identify what is actually crawling your server, the full landscape of crawlers you will encounter in 2025, how to manage them with robots.txt and server-level controls, and a decision framework for when to block, allow, or escalate to an abuse report.
How to Identify What Is Crawling Your Server
The single most important step before taking any action is checking the User-Agent string in your access logs. Every legitimate crawler identifies itself (or at least attempts to), and the User-Agent tells you exactly who is responsible.
Extracting User-Agents from Access Logs
Nginx/Apache combined log format:
# List all unique User-Agents, sorted by request count
awk -F'"' '{print $6}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -30
# Find all requests from a specific IP and show their User-Agents
awk -v ip="198.51.100.47" '$1 == ip {split($0, a, "\""); print a[6]}' /var/log/nginx/access.log | sort -u
# Count requests per IP per hour (find the heavy hitters)
awk '{print $1, substr($4,2,14)}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -20
Quick one-liner to find bot traffic:
# Extract all requests with "bot" or "crawler" or "spider" in User-Agent
grep -iE '(bot|crawler|spider|scraper|fetcher)' /var/log/nginx/access.log | awk -F'"' '{print $6}' | sort | uniq -c | sort -rn
Verifying Crawler Identity with IP Lookup
User-Agent strings can be spoofed. Anyone can send a request claiming to be Googlebot. To verify a crawler is genuine:
-
Look up the IP with DNSChkr's IP Location tool — check if the owning organization matches the claimed crawler. A "Googlebot" request from an IP owned by a residential ISP in Romania is fake.
-
Reverse DNS verification — legitimate search engine crawlers have verifiable PTR records:
# Step 1: Reverse DNS lookup
host 66.249.66.1
# Returns: crawl-66-249-66-1.googlebot.com
# Step 2: Forward DNS to confirm
host crawl-66-249-66-1.googlebot.com
# Returns: 66.249.66.1 ← matches, this is genuine Googlebot
# Combined verification script
dig -x 198.51.100.47 +short | xargs -I{} dig {} +short
This two-step verification (reverse lookup, then forward lookup to confirm the IP matches) is the gold standard for confirming Google, Bing, and other major search engine crawlers.
SEO Crawlers: The Usual Suspects
SEO crawlers are the most common source of "is this an attack?" false alarms. They hit your server hard because they are designed to crawl entire sites quickly — sometimes thousands of pages in minutes.
| Crawler | User-Agent Contains | Operator | Typical Behavior |
|---|---|---|---|
| Screaming Frog | Screaming Frog SEO Spider | Desktop tool (run internally) | Aggressive, runs from your office IP or employee laptop |
| AhrefsBot | AhrefsBot | Ahrefs | Continuous backlink crawling, 5-10 req/sec |
| SemrushBot | SemrushBot | Semrush | SEO auditing and competitive analysis |
| DotBot | DotBot | Moz | Backlink and domain analysis |
| MJ12bot | MJ12bot | Majestic | Large-scale link index building |
| Sitebulb | Sitebulb | Desktop tool | Technical SEO auditing |
| Lumar | DeepCrawl or Lumar | Lumar (formerly DeepCrawl) | Enterprise site crawling |
| Botify | botify | Botify | Enterprise SEO platform |
How to Tell Internal vs. External Crawlers
The critical distinction is whether someone in your organization triggered the crawl:
- Screaming Frog, Sitebulb — these are desktop applications. If the source IP is your office network or a VPN exit node, someone on your team is running it. Talk to your marketing or SEO team before filing anything.
- AhrefsBot, SemrushBot, MJ12bot — these are third-party services crawling the open web. They typically respect robots.txt and have published IP ranges. If their crawl rate is causing problems, robots.txt or rate limiting is the appropriate response, not an abuse report.
- Botify, Lumar — enterprise tools that may be run by an agency working for your company. Check with whoever manages your SEO.
AI Crawlers: The New Heavy Hitters
AI crawlers are the fastest-growing category of bot traffic and the most likely to trigger abuse report instincts. They hit harder than traditional search engine crawlers because many operate at aggressive rates, lack politeness delays, and some ignore robots.txt entirely.
| Crawler | User-Agent Contains | Operator | Purpose |
|---|---|---|---|
| GPTBot | GPTBot | OpenAI | Training data collection |
| ChatGPT-User | ChatGPT-User | OpenAI | Real-time browsing for ChatGPT |
| OAI-SearchBot | OAI-SearchBot | OpenAI | SearchGPT web search |
| Google-Extended | Google-Extended | Gemini AI training | |
| ClaudeBot | ClaudeBot | Anthropic | Claude AI training |
| anthropic-ai | anthropic-ai | Anthropic | Anthropic web fetcher |
| Claude-Web | Claude-Web | Anthropic | Claude web access |
| Meta-ExternalAgent | Meta-ExternalAgent | Meta | AI training crawling |
| FacebookBot | FacebookBot | Meta | Link previews and AI |
| Applebot-Extended | Applebot-Extended | Apple | Apple Intelligence training |
| Amazonbot | Amazonbot | Amazon | Alexa and AI services |
| PerplexityBot | PerplexityBot | Perplexity | AI search engine |
| Bytespider | Bytespider | ByteDance | TikTok/AI training |
| CCBot | CCBot | Common Crawl | Open dataset for AI training |
| Diffbot | Diffbot | Diffbot | Web data extraction |
| YouBot | YouBot | You.com | AI search engine |
| Omgilibot | Omgilibot | Webz.io | Data-as-a-service for AI |
Why AI Crawlers Hit Harder Than Search Engines
Traditional search engine crawlers like Googlebot are designed for long-term relationships with websites. They implement polite crawl rates, back off when servers return 429 or 503 status codes, and adjust their crawl budget based on server responsiveness.
Many AI crawlers operate differently:
- Aggressive crawl rates — some AI crawlers send 50-100+ requests per second without built-in politeness delays
- Full-site scraping — instead of indexing a few pages, they try to download everything for training data
- Ignoring signals — some crawlers do not respect
Crawl-delaydirectives or slow down in response to HTTP 429 responses - Bandwidth impact — a single AI crawler can consume 10-50 GB of bandwidth per day on a medium-sized site, compared to 1-5 GB for Googlebot
I have seen AI crawlers generate enough traffic to cause visible CPU spikes on production servers. But this is not malicious behavior — it is aggressive but legitimate crawling. The correct response is robots.txt or server-level rate limiting, not an abuse report. If a crawler continues after you have blocked it in robots.txt and at the server level, that is a different story — and that does warrant an abuse report.
Search Engine Crawlers (Do Not Block These)
Search engine crawlers drive your organic traffic. Blocking them means your site disappears from search results.
| Crawler | User-Agent Contains | Operator | Verification Method |
|---|---|---|---|
| Googlebot | Googlebot | Reverse DNS → *.googlebot.com or *.google.com | |
| Bingbot | bingbot | Microsoft | Reverse DNS → *.search.msn.com |
| YandexBot | YandexBot | Yandex | Reverse DNS → *.yandex.com or *.yandex.net |
| Baiduspider | Baiduspider | Baidu | Reverse DNS → *.baidu.com or *.baidu.jp |
| DuckDuckBot | DuckDuckBot | DuckDuckGo | Published IP ranges |
Verification Technique
Always verify before trusting a search engine User-Agent string. The reverse-then-forward DNS technique catches impersonators:
# Verify Googlebot
ip="66.249.66.1"
ptr=$(dig -x $ip +short)
echo "PTR: $ptr"
# Should be: crawl-66-249-66-1.googlebot.com.
fwd=$(dig $ptr +short)
echo "Forward: $fwd"
# Should match the original IP: 66.249.66.1
# If forward DNS matches original IP → genuine
# If it doesn't match or PTR doesn't resolve → fake
If you are seeing high traffic from a "Googlebot" that fails this verification, that is an impersonator and worth investigating further. Fake Googlebot traffic can indicate scraping, competitive intelligence gathering, or actual malicious activity.
How to Manage Crawlers with robots.txt
The robots.txt file is the standard mechanism for telling crawlers what they can and cannot access on your site. It lives at the root of your domain (e.g., https://example.com/robots.txt) and uses a simple directive syntax.
robots.txt Basics
# Allow all crawlers to access everything (default if no robots.txt exists)
User-agent: *
Allow: /
# Block a specific crawler from everything
User-agent: GPTBot
Disallow: /
# Block a crawler from specific paths
User-agent: SemrushBot
Disallow: /admin/
Disallow: /api/
Allow: /
# Set a crawl rate (seconds between requests)
User-agent: AhrefsBot
Crawl-delay: 10
Key directives:
User-agent— which crawler this rule applies to (*means all)Disallow— paths the crawler should not accessAllow— paths the crawler can access (overrides Disallow for more specific paths)Crawl-delay— seconds to wait between requests (not supported by all crawlers)Sitemap— URL of your XML sitemap
Block All AI Crawlers
If you want to opt out of AI training while keeping search engines and SEO tools working:
# Block AI training crawlers
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Diffbot
Disallow: /
User-agent: YouBot
Disallow: /
User-agent: Omgilibot
Disallow: /
# Allow search engines (default: everything allowed)
User-agent: Googlebot
Allow: /
User-agent: bingbot
Allow: /
Sitemap: https://example.com/sitemap.xml
Rate-Limit Aggressive SEO Crawlers
If you do not want to block SEO crawlers entirely but need them to slow down:
User-agent: AhrefsBot
Crawl-delay: 10
User-agent: SemrushBot
Crawl-delay: 10
User-agent: MJ12bot
Crawl-delay: 30
User-agent: DotBot
Crawl-delay: 10
Crawl-delay: 10 tells the crawler to wait 10 seconds between requests. Note that Googlebot and Bingbot do not support Crawl-delay — Google uses Search Console's crawl rate settings instead.
Limitations of robots.txt
This is the most important thing to understand: robots.txt is voluntary. It is a convention, not a technical enforcement mechanism. Well-behaved crawlers obey it; malicious scrapers and some aggressive bots ignore it entirely.
robots.txt does not:
- Prevent access (it is publicly readable at
/robots.txt, so it can actually reveal paths you might prefer to keep less visible) - Enforce rate limits (a crawler that ignores the file will ignore
Crawl-delaytoo) - Block direct URL requests (users and bots can still access Disallowed URLs directly)
When robots.txt is not enough, you need server-level controls.
Beyond robots.txt: Server-Level Controls
When a crawler ignores your robots.txt or you need hard enforcement, server-level controls are the answer.
Nginx Rate Limiting by User-Agent
# Create a rate limit zone for known aggressive crawlers
map $http_user_agent $is_bot {
default 0;
"~*AhrefsBot" 1;
"~*SemrushBot" 1;
"~*MJ12bot" 1;
"~*GPTBot" 1;
"~*ClaudeBot" 1;
"~*Bytespider" 1;
"~*CCBot" 1;
"~*PerplexityBot" 1;
}
limit_req_zone $binary_remote_addr zone=bot_limit:10m rate=1r/s;
server {
# Apply rate limit only to identified bots
if ($is_bot) {
set $limit_key $binary_remote_addr;
}
location / {
limit_req zone=bot_limit burst=5 nodelay;
# ... your normal config
}
}
Apache Rate Limiting
# Block specific crawlers entirely
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (GPTBot|ClaudeBot|Bytespider|CCBot) [NC]
RewriteRule .* - [F,L]
# Or return 429 Too Many Requests
RewriteCond %{HTTP_USER_AGENT} (GPTBot|ClaudeBot|Bytespider|CCBot) [NC]
RewriteRule .* - [R=429,L]
Cloudflare Bot Management
If you use Cloudflare, their Bot Management and WAF rules provide the easiest server-level controls:
- Super Bot Fight Mode (free plans) — automatically challenges or blocks bots categorized as "automated"
- WAF Custom Rules — create rules that match specific User-Agent strings and block, challenge, or rate-limit them
- AI Bot blocking — Cloudflare added a one-click toggle to block known AI crawlers in 2024
Example Cloudflare WAF rule expression:
(http.user_agent contains "GPTBot") or
(http.user_agent contains "ClaudeBot") or
(http.user_agent contains "Bytespider") or
(http.user_agent contains "CCBot")
Set the action to "Block" or "Managed Challenge" depending on how aggressively you want to handle these.
Firewall Rules for Known CIDR Ranges
Some AI crawlers publish their IP ranges, which lets you block at the network level regardless of User-Agent:
# OpenAI published ranges (check their documentation for current ranges)
# Block GPTBot at the firewall
sudo ufw deny from 20.15.240.0/24 comment "OpenAI GPTBot"
sudo ufw deny from 20.15.241.0/24 comment "OpenAI GPTBot"
# Or with iptables
iptables -A INPUT -s 20.15.240.0/24 -j DROP -m comment --comment "OpenAI GPTBot"
When to use firewall-level blocks:
- The crawler ignores robots.txt
- The crawler spoofs its User-Agent to bypass User-Agent-based rules
- You need to block before the request even reaches your web server (saves CPU)
- The crawler is generating enough traffic to impact server performance
The Decision Framework: Block, Allow, or Report?
Here is the framework I use when I see an IP generating heavy traffic on a server:
Step 1: Identify the Source
Check the User-Agent in your access logs and look up the IP with the IP Location tool. This immediately tells you if it is a known crawler, which organization owns the IP, and whether the User-Agent matches the IP owner.
Step 2: Classify the Crawler
Based on what you find:
Known search engine (Googlebot, Bingbot, etc.) → Allow. These drive your traffic. If they are crawling too aggressively, adjust your crawl budget in Google Search Console or Bing Webmaster Tools. Do not file an abuse report.
SEO tool from your own company → Talk to your team. Screaming Frog, Sitebulb, or a Botify crawl triggered by your marketing or SEO team is internal traffic. Coordinate crawl schedules to avoid impacting production. Definitely do not file an abuse report.
Third-party SEO crawler (AhrefsBot, SemrushBot) → Rate-limit or block via robots.txt. These respect robots.txt. Set a Crawl-delay or block specific paths. An abuse report is not appropriate here.
AI training crawler (GPTBot, ClaudeBot, CCBot) → Block via robots.txt if you do not want your content used for AI training. Most AI crawlers respect robots.txt. If they do not, escalate to server-level blocks.
Crawler ignoring robots.txt → Server-level block. Use Nginx/Apache rules or firewall rules. If the behavior is persistent and the operator is unresponsive to contact, this may warrant an abuse report.
Unknown bot with malicious behavior patterns → Investigate and potentially report. If the traffic does not have a legitimate crawler User-Agent, the IP does not belong to a known crawling service, and the behavior looks like scraping, DDoS, or reconnaissance — that is when you file an abuse report using the complete guide to reporting IP abuse.
When an Abuse Report Is Warranted
To be clear, there are situations where crawling behavior does cross the line into abuse:
- The crawler is ignoring robots.txt and you have contacted the operator without response
- The traffic is causing service degradation and the operator will not implement rate limiting
- The crawler is spoofing User-Agent strings to impersonate legitimate bots
- The IP does not belong to any known crawling service and the behavior is clearly automated scraping
- The "crawler" is actually performing port scanning or brute-force attacks alongside its requests
Should You Block AI Crawlers? Pros and Cons
This is the question every sysadmin is wrestling with in 2025. There is no universally right answer — it depends on your organization's priorities.
Reasons to Block
- Bandwidth and server load — AI crawlers can consume significant resources. If you are paying for bandwidth, this has a direct cost.
- Content used without consent — your content may be used to train models that compete with you or generate derivative content without attribution.
- No direct benefit — unlike search engine crawlers, AI training crawlers do not drive traffic to your site.
- Legal uncertainty — the legal status of AI training on web content is still being litigated in multiple jurisdictions.
Reasons to Allow
- AI search visibility — Perplexity, ChatGPT with browsing, and Google AI Overviews are becoming significant traffic sources. Blocking their crawlers may reduce your visibility in AI-powered search.
- Future discoverability — as AI becomes a primary way people find information, content that has been crawled and indexed by AI systems may have better long-term reach.
- Selective blocking is possible — you can block training crawlers (GPTBot, Google-Extended) while allowing search-adjacent ones (ChatGPT-User, OAI-SearchBot, PerplexityBot).
My Recommendation
Block the pure training crawlers (GPTBot, Google-Extended, ClaudeBot, Bytespider, CCBot, Meta-ExternalAgent) via robots.txt. Consider allowing the search-adjacent crawlers (ChatGPT-User, PerplexityBot, OAI-SearchBot) if AI search traffic matters to your business. Monitor your access logs monthly to stay on top of new crawlers as they appear — this landscape changes fast.
