How to Identify and Manage Web Crawlers: A Sysadmin's Guide to robots.txt, AI Bots, and SEO Crawlers

Last month I got a frantic message from a colleague: "Someone is hammering our production server — thousands of requests per minute from a single IP. Should I file an abuse report?" I looked up the IP with our IP Location tool, checked the access logs, and found the User-Agent string: Mozilla/5.0 (compatible; Screaming Frog SEO Spider/21.0). The marketing team had kicked off a full-site audit without telling IT.

This happens more than you would think. Sysadmins see an IP generating massive traffic, look it up, and immediately reach for the abuse report template from my complete guide to reporting IP abuse. But not every heavy hitter is malicious. Web crawlers — from SEO audit tools to AI training bots to search engine indexers — can generate traffic patterns that look almost identical to a denial-of-service attack. Filing an abuse report against Googlebot or your own company's SEO tool is not just embarrassing; it wastes the ISP's abuse desk time and your own.

This guide covers how to identify what is actually crawling your server, the full landscape of crawlers you will encounter in 2025, how to manage them with robots.txt and server-level controls, and a decision framework for when to block, allow, or escalate to an abuse report.

How to Identify What Is Crawling Your Server

The single most important step before taking any action is checking the User-Agent string in your access logs. Every legitimate crawler identifies itself (or at least attempts to), and the User-Agent tells you exactly who is responsible.

Extracting User-Agents from Access Logs

Nginx/Apache combined log format:

# List all unique User-Agents, sorted by request count
awk -F'"' '{print $6}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -30

# Find all requests from a specific IP and show their User-Agents
awk -v ip="198.51.100.47" '$1 == ip {split($0, a, "\""); print a[6]}' /var/log/nginx/access.log | sort -u

# Count requests per IP per hour (find the heavy hitters)
awk '{print $1, substr($4,2,14)}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -20

Quick one-liner to find bot traffic:

# Extract all requests with "bot" or "crawler" or "spider" in User-Agent
grep -iE '(bot|crawler|spider|scraper|fetcher)' /var/log/nginx/access.log | awk -F'"' '{print $6}' | sort | uniq -c | sort -rn

Verifying Crawler Identity with IP Lookup

User-Agent strings can be spoofed. Anyone can send a request claiming to be Googlebot. To verify a crawler is genuine:

Look up the IP with DNS Checker's IP Location tool — check if the owning organization matches the claimed crawler. A "Googlebot" request from an IP owned by a residential ISP in Romania is fake.
Reverse DNS verification — legitimate search engine crawlers have verifiable PTR records:

# Step 1: Reverse DNS lookup
host 66.249.66.1
# Returns: crawl-66-249-66-1.googlebot.com

# Step 2: Forward DNS to confirm
host crawl-66-249-66-1.googlebot.com
# Returns: 66.249.66.1  ← matches, this is genuine Googlebot

# Combined verification script
dig -x 198.51.100.47 +short | xargs -I{} dig {} +short

This two-step verification (reverse lookup, then forward lookup to confirm the IP matches) is the gold standard for confirming Google, Bing, and other major search engine crawlers.

SEO Crawlers: The Usual Suspects

SEO crawlers are the most common source of "is this an attack?" false alarms. They hit your server hard because they are designed to crawl entire sites quickly — sometimes thousands of pages in minutes.

Crawler	User-Agent Contains	Operator	Typical Behavior
Screaming Frog	`Screaming Frog SEO Spider`	Desktop tool (run internally)	Aggressive, runs from your office IP or employee laptop
AhrefsBot	`AhrefsBot`	Ahrefs	Continuous backlink crawling, 5-10 req/sec
SemrushBot	`SemrushBot`	Semrush	SEO auditing and competitive analysis
DotBot	`DotBot`	Moz	Backlink and domain analysis
MJ12bot	`MJ12bot`	Majestic	Large-scale link index building
Sitebulb	`Sitebulb`	Desktop tool	Technical SEO auditing
Lumar	`DeepCrawl` or `Lumar`	Lumar (formerly DeepCrawl)	Enterprise site crawling
Botify	`botify`	Botify	Enterprise SEO platform

How to Tell Internal vs. External Crawlers

The critical distinction is whether someone in your organization triggered the crawl:

Screaming Frog, Sitebulb — these are desktop applications. If the source IP is your office network or a VPN exit node, someone on your team is running it. Talk to your marketing or SEO team before filing anything.
AhrefsBot, SemrushBot, MJ12bot — these are third-party services crawling the open web. They typically respect robots.txt and have published IP ranges. If their crawl rate is causing problems, robots.txt or rate limiting is the appropriate response, not an abuse report.
Botify, Lumar — enterprise tools that may be run by an agency working for your company. Check with whoever manages your SEO.

AI Crawlers: The New Heavy Hitters

AI crawlers are the fastest-growing category of bot traffic and the most likely to trigger abuse report instincts. They hit harder than traditional search engine crawlers because many operate at aggressive rates, lack politeness delays, and some ignore robots.txt entirely.

Crawler	User-Agent Contains	Operator	Purpose
GPTBot	`GPTBot`	OpenAI	Training data collection
ChatGPT-User	`ChatGPT-User`	OpenAI	Real-time browsing for ChatGPT
OAI-SearchBot	`OAI-SearchBot`	OpenAI	SearchGPT web search
Google-Extended	`Google-Extended`	Google	Gemini AI training
ClaudeBot	`ClaudeBot`	Anthropic	Claude AI training
anthropic-ai	`anthropic-ai`	Anthropic	Anthropic web fetcher
Claude-Web	`Claude-Web`	Anthropic	Claude web access
Meta-ExternalAgent	`Meta-ExternalAgent`	Meta	AI training crawling
FacebookBot	`FacebookBot`	Meta	Link previews and AI
Applebot-Extended	`Applebot-Extended`	Apple	Apple Intelligence training
Amazonbot	`Amazonbot`	Amazon	Alexa and AI services
PerplexityBot	`PerplexityBot`	Perplexity	AI search engine
Bytespider	`Bytespider`	ByteDance	TikTok/AI training
CCBot	`CCBot`	Common Crawl	Open dataset for AI training
Diffbot	`Diffbot`	Diffbot	Web data extraction
YouBot	`YouBot`	You.com	AI search engine
Omgilibot	`Omgilibot`	Webz.io	Data-as-a-service for AI

Why AI Crawlers Hit Harder Than Search Engines

Traditional search engine crawlers like Googlebot are designed for long-term relationships with websites. They implement polite crawl rates, back off when servers return 429 or 503 status codes, and adjust their crawl budget based on server responsiveness.

Many AI crawlers operate differently:

Aggressive crawl rates — some AI crawlers send 50-100+ requests per second without built-in politeness delays
Full-site scraping — instead of indexing a few pages, they try to download everything for training data
Ignoring signals — some crawlers do not respect Crawl-delay directives or slow down in response to HTTP 429 responses
Bandwidth impact — a single AI crawler can consume 10-50 GB of bandwidth per day on a medium-sized site, compared to 1-5 GB for Googlebot

I have seen AI crawlers generate enough traffic to cause visible CPU spikes on production servers. But this is not malicious behavior — it is aggressive but legitimate crawling. The correct response is robots.txt or server-level rate limiting, not an abuse report. If a crawler continues after you have blocked it in robots.txt and at the server level, that is a different story — and that does warrant an abuse report.

Search Engine Crawlers (Do Not Block These)

Search engine crawlers drive your organic traffic. Blocking them means your site disappears from search results.

Crawler	User-Agent Contains	Operator	Verification Method
Googlebot	`Googlebot`	Google	Reverse DNS → `.googlebot.com` or `.google.com`
Bingbot	`bingbot`	Microsoft	Reverse DNS → `*.search.msn.com`
YandexBot	`YandexBot`	Yandex	Reverse DNS → `.yandex.com` or `.yandex.net`
Baiduspider	`Baiduspider`	Baidu	Reverse DNS → `.baidu.com` or `.baidu.jp`
DuckDuckBot	`DuckDuckBot`	DuckDuckGo	Published IP ranges

Verification Technique

Always verify before trusting a search engine User-Agent string. The reverse-then-forward DNS technique catches impersonators:

# Verify Googlebot
ip="66.249.66.1"
ptr=$(dig -x $ip +short)
echo "PTR: $ptr"
# Should be: crawl-66-249-66-1.googlebot.com.

fwd=$(dig $ptr +short)
echo "Forward: $fwd"
# Should match the original IP: 66.249.66.1

# If forward DNS matches original IP → genuine
# If it doesn't match or PTR doesn't resolve → fake

If you are seeing high traffic from a "Googlebot" that fails this verification, that is an impersonator and worth investigating further. Fake Googlebot traffic can indicate scraping, competitive intelligence gathering, or actual malicious activity.

How to Manage Crawlers with robots.txt

The robots.txt file is the standard mechanism for telling crawlers what they can and cannot access on your site. It lives at the root of your domain (e.g., https://example.com/robots.txt) and uses a simple directive syntax.

robots.txt Basics

# Allow all crawlers to access everything (default if no robots.txt exists)
User-agent: *
Allow: /

# Block a specific crawler from everything
User-agent: GPTBot
Disallow: /

# Block a crawler from specific paths
User-agent: SemrushBot
Disallow: /admin/
Disallow: /api/
Allow: /

# Set a crawl rate (seconds between requests)
User-agent: AhrefsBot
Crawl-delay: 10

Key directives:

User-agent — which crawler this rule applies to (* means all)
Disallow — paths the crawler should not access
Allow — paths the crawler can access (overrides Disallow for more specific paths)
Crawl-delay — seconds to wait between requests (not supported by all crawlers)
Sitemap — URL of your XML sitemap

Block All AI Crawlers

If you want to opt out of AI training while keeping search engines and SEO tools working:

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Diffbot
Disallow: /

User-agent: YouBot
Disallow: /

User-agent: Omgilibot
Disallow: /

# Allow search engines (default: everything allowed)
User-agent: Googlebot
Allow: /

User-agent: bingbot
Allow: /

Sitemap: https://example.com/sitemap.xml

Rate-Limit Aggressive SEO Crawlers

If you do not want to block SEO crawlers entirely but need them to slow down:

User-agent: AhrefsBot
Crawl-delay: 10

User-agent: SemrushBot
Crawl-delay: 10

User-agent: MJ12bot
Crawl-delay: 30

User-agent: DotBot
Crawl-delay: 10

Crawl-delay: 10 tells the crawler to wait 10 seconds between requests. Note that Googlebot and Bingbot do not support Crawl-delay — Google uses Search Console's crawl rate settings instead.

Limitations of robots.txt

This is the most important thing to understand: robots.txt is voluntary. It is a convention, not a technical enforcement mechanism. Well-behaved crawlers obey it; malicious scrapers and some aggressive bots ignore it entirely.

robots.txt does not:

Prevent access (it is publicly readable at /robots.txt, so it can actually reveal paths you might prefer to keep less visible)
Enforce rate limits (a crawler that ignores the file will ignore Crawl-delay too)
Block direct URL requests (users and bots can still access Disallowed URLs directly)

When robots.txt is not enough, you need server-level controls.

Beyond robots.txt: Server-Level Controls

When a crawler ignores your robots.txt or you need hard enforcement, server-level controls are the answer.

Nginx Rate Limiting by User-Agent

# Create a rate limit zone for known aggressive crawlers
map $http_user_agent $is_bot {
    default 0;
    "~*AhrefsBot"      1;
    "~*SemrushBot"      1;
    "~*MJ12bot"         1;
    "~*GPTBot"          1;
    "~*ClaudeBot"       1;
    "~*Bytespider"      1;
    "~*CCBot"           1;
    "~*PerplexityBot"   1;
}

limit_req_zone $binary_remote_addr zone=bot_limit:10m rate=1r/s;

server {
    # Apply rate limit only to identified bots
    if ($is_bot) {
        set $limit_key $binary_remote_addr;
    }

    location / {
        limit_req zone=bot_limit burst=5 nodelay;
        # ... your normal config
    }
}

Apache Rate Limiting

# Block specific crawlers entirely
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (GPTBot|ClaudeBot|Bytespider|CCBot) [NC]
RewriteRule .* - [F,L]

# Or return 429 Too Many Requests
RewriteCond %{HTTP_USER_AGENT} (GPTBot|ClaudeBot|Bytespider|CCBot) [NC]
RewriteRule .* - [R=429,L]

Cloudflare Bot Management

If you use Cloudflare, their Bot Management and WAF rules provide the easiest server-level controls:

Super Bot Fight Mode (free plans) — automatically challenges or blocks bots categorized as "automated"
WAF Custom Rules — create rules that match specific User-Agent strings and block, challenge, or rate-limit them
AI Bot blocking — Cloudflare added a one-click toggle to block known AI crawlers in 2024

Example Cloudflare WAF rule expression:

(http.user_agent contains "GPTBot") or
(http.user_agent contains "ClaudeBot") or
(http.user_agent contains "Bytespider") or
(http.user_agent contains "CCBot")

Set the action to "Block" or "Managed Challenge" depending on how aggressively you want to handle these.

Firewall Rules for Known CIDR Ranges

Some AI crawlers publish their IP ranges, which lets you block at the network level regardless of User-Agent:

# OpenAI published ranges (check their documentation for current ranges)
# Block GPTBot at the firewall
sudo ufw deny from 20.15.240.0/24 comment "OpenAI GPTBot"
sudo ufw deny from 20.15.241.0/24 comment "OpenAI GPTBot"

# Or with iptables
iptables -A INPUT -s 20.15.240.0/24 -j DROP -m comment --comment "OpenAI GPTBot"

When to use firewall-level blocks:

The crawler ignores robots.txt
The crawler spoofs its User-Agent to bypass User-Agent-based rules
You need to block before the request even reaches your web server (saves CPU)
The crawler is generating enough traffic to impact server performance

The Decision Framework: Block, Allow, or Report?

Here is the framework I use when I see an IP generating heavy traffic on a server:

Step 1: Identify the Source

Check the User-Agent in your access logs and look up the IP with the IP Location tool. This immediately tells you if it is a known crawler, which organization owns the IP, and whether the User-Agent matches the IP owner.

Step 2: Classify the Crawler

Based on what you find:

Known search engine (Googlebot, Bingbot, etc.) → Allow. These drive your traffic. If they are crawling too aggressively, adjust your crawl budget in Google Search Console or Bing Webmaster Tools. Do not file an abuse report.

SEO tool from your own company → Talk to your team. Screaming Frog, Sitebulb, or a Botify crawl triggered by your marketing or SEO team is internal traffic. Coordinate crawl schedules to avoid impacting production. Definitely do not file an abuse report.

Third-party SEO crawler (AhrefsBot, SemrushBot) → Rate-limit or block via robots.txt. These respect robots.txt. Set a Crawl-delay or block specific paths. An abuse report is not appropriate here.

AI training crawler (GPTBot, ClaudeBot, CCBot) → Block via robots.txt if you do not want your content used for AI training. Most AI crawlers respect robots.txt. If they do not, escalate to server-level blocks.

Crawler ignoring robots.txt → Server-level block. Use Nginx/Apache rules or firewall rules. If the behavior is persistent and the operator is unresponsive to contact, this may warrant an abuse report.

Unknown bot with malicious behavior patterns → Investigate and potentially report. If the traffic does not have a legitimate crawler User-Agent, the IP does not belong to a known crawling service, and the behavior looks like scraping, DDoS, or reconnaissance — that is when you file an abuse report using the complete guide to reporting IP abuse.

When an Abuse Report Is Warranted

To be clear, there are situations where crawling behavior does cross the line into abuse:

The crawler is ignoring robots.txt and you have contacted the operator without response
The traffic is causing service degradation and the operator will not implement rate limiting
The crawler is spoofing User-Agent strings to impersonate legitimate bots
The IP does not belong to any known crawling service and the behavior is clearly automated scraping
The "crawler" is actually performing port scanning or brute-force attacks alongside its requests

Should You Block AI Crawlers? Pros and Cons

This is the question every sysadmin is wrestling with in 2025. There is no universally right answer — it depends on your organization's priorities.

Reasons to Block

Bandwidth and server load — AI crawlers can consume significant resources. If you are paying for bandwidth, this has a direct cost.
Content used without consent — your content may be used to train models that compete with you or generate derivative content without attribution.
No direct benefit — unlike search engine crawlers, AI training crawlers do not drive traffic to your site.
Legal uncertainty — the legal status of AI training on web content is still being litigated in multiple jurisdictions.

Reasons to Allow

AI search visibility — Perplexity, ChatGPT with browsing, and Google AI Overviews are becoming significant traffic sources. Blocking their crawlers may reduce your visibility in AI-powered search.
Future discoverability — as AI becomes a primary way people find information, content that has been crawled and indexed by AI systems may have better long-term reach.
Selective blocking is possible — you can block training crawlers (GPTBot, Google-Extended) while allowing search-adjacent ones (ChatGPT-User, OAI-SearchBot, PerplexityBot).

My Recommendation

Block the pure training crawlers (GPTBot, Google-Extended, ClaudeBot, Bytespider, CCBot, Meta-ExternalAgent) via robots.txt. Consider allowing the search-adjacent crawlers (ChatGPT-User, PerplexityBot, OAI-SearchBot) if AI search traffic matters to your business. Monitor your access logs monthly to stay on top of new crawlers as they appear — this landscape changes fast.

Frequently Asked Questions

This article was researched and structured by the author with AI assistance for drafting and technical verification.

How to Identify and Manage Web Crawlers: A Sysadmin's Guide to robots.txt, AI Bots, and SEO Crawlers

How to Identify What Is Crawling Your Server

Extracting User-Agents from Access Logs

Verifying Crawler Identity with IP Lookup

SEO Crawlers: The Usual Suspects

How to Tell Internal vs. External Crawlers

AI Crawlers: The New Heavy Hitters

Why AI Crawlers Hit Harder Than Search Engines

Search Engine Crawlers (Do Not Block These)

Verification Technique

How to Manage Crawlers with robots.txt

robots.txt Basics

Block All AI Crawlers

Rate-Limit Aggressive SEO Crawlers

Limitations of robots.txt

Beyond robots.txt: Server-Level Controls

Nginx Rate Limiting by User-Agent

Apache Rate Limiting

Cloudflare Bot Management

Firewall Rules for Known CIDR Ranges

The Decision Framework: Block, Allow, or Report?

Step 1: Identify the Source

Step 2: Classify the Crawler

When an Abuse Report Is Warranted

Should You Block AI Crawlers? Pros and Cons

Reasons to Block

Reasons to Allow

My Recommendation

Frequently Asked Questions

Categories:

About the Author

Share this article

DNS terms in this guide

Related Articles

145,061 Domains Delegated to a Misspelled Name Server — Here's How the Attack Works

What Happens When One DNS Provider Goes Down: The Hidden Fragility of TLD Ecosystems

How Expired Name Servers Become Domain Hijacking Vectors

Why DNSSEC Is Still Failing: Lessons from 240 Million Domains

How to Identify and Manage Web Crawlers: A Sysadmin's Guide to robots.txt, AI Bots, and SEO Crawlers

How to Identify What Is Crawling Your Server

Extracting User-Agents from Access Logs

Verifying Crawler Identity with IP Lookup

SEO Crawlers: The Usual Suspects

How to Tell Internal vs. External Crawlers

AI Crawlers: The New Heavy Hitters

Why AI Crawlers Hit Harder Than Search Engines

Search Engine Crawlers (Do Not Block These)

Verification Technique

How to Manage Crawlers with robots.txt

robots.txt Basics

Block All AI Crawlers

Rate-Limit Aggressive SEO Crawlers

Limitations of robots.txt

Beyond robots.txt: Server-Level Controls

Nginx Rate Limiting by User-Agent

Apache Rate Limiting

Cloudflare Bot Management

Firewall Rules for Known CIDR Ranges

The Decision Framework: Block, Allow, or Report?

Step 1: Identify the Source

Step 2: Classify the Crawler

When an Abuse Report Is Warranted

Should You Block AI Crawlers? Pros and Cons

Reasons to Block

Reasons to Allow

My Recommendation

Frequently Asked Questions

How do I block AI crawlers in robots.txt?

Does robots.txt actually stop AI crawlers from accessing my site?

How can I tell if heavy traffic is from a crawler or a DDoS attack?

Should I block Googlebot or other search engine crawlers?

How do I verify that a crawler claiming to be Googlebot is genuine?

What is the difference between Crawl-delay in robots.txt and server-level rate limiting?

Can I block AI crawlers but still appear in AI-powered search results?

When should I file an abuse report against a web crawler instead of just blocking it?

Categories:

About the Author

Share this article

DNS terms in this guide

Related Articles

145,061 Domains Delegated to a Misspelled Name Server — Here's How the Attack Works

What Happens When One DNS Provider Goes Down: The Hidden Fragility of TLD Ecosystems

How Expired Name Servers Become Domain Hijacking Vectors

Why DNSSEC Is Still Failing: Lessons from 240 Million Domains