Complete Reference Guide

The Complete Guide to AI Crawlers

Every AI web crawler explained: what they do, who operates them, and exactly how to control access via robots.txt. Covers all 14 major AI crawlers active in 2026.

What Are AI Crawlers?

AI crawlers are automated programs that visit websites to collect content for artificial intelligence systems. Every major AI company operates at least one: OpenAI has GPTBot, Anthropic has ClaudeBot, Perplexity has PerplexityBot, Google has Google-Extended, and so on.

These crawlers serve three distinct purposes:

  • Training data collection. GPTBot, ClaudeBot, CCBot and others crawl the web to build training datasets for AI models. This is what determines whether the AI "knows" about your site.
  • Real-time browsing. ChatGPT-User and Claude-Web fetch specific pages when a user asks the AI to look something up mid-conversation. These are the ones that enable live citations.
  • AI search indexing. OAI-SearchBot and PerplexityBot build search indexes for AI-powered search engines. Think of them like Googlebot, but for AI search.

All AI crawlers respect robots.txt, which means you have full control over which crawlers can access your site. The challenge is that most website owners don't know these crawlers exist, and many sites accidentally block them through overly broad robots.txt rules.

14

Major AI crawlers active in 2026

3

Categories: training, browsing, search

100%

Respect robots.txt rules

Training Crawlers

These crawlers collect content to train AI models. Blocking them prevents the AI from learning from your content in future training runs.

GPTBotOpenAI

Training data collection for ChatGPT and GPT models

This is the big one for ChatGPT visibility. If you block GPTBot, your site won't be included in future model training. Worth noting: ChatGPT may still know about your brand from older training data, but it won't pick up new content.

Allow if you want AI visibilityUser-agent: GPTBot
ClaudeBotAnthropic

Training data collection for Claude models

Same idea as GPTBot but for Anthropic's Claude. Block it and your content won't appear in future Claude training data. Claude might still know about your site from Common Crawl or earlier training runs though.

Allow if you want Claude visibilityUser-agent: ClaudeBot
anthropic-aiAnthropic

General Anthropic crawler for AI research

Anthropic's older, general-purpose crawler. You'll see it less often than ClaudeBot, but it still shows up. If you're allowing ClaudeBot, allow this one too for consistency.

Allow alongside ClaudeBotUser-agent: anthropic-ai
Google-ExtendedGoogle

Training data for Gemini and other Google AI products

Important distinction: this is NOT Googlebot. Blocking Google-Extended has zero effect on your Google Search rankings. It only controls whether Google uses your content for Gemini and AI Overviews. Your search ranking stays the same either way.

Allow if you want Gemini visibilityUser-agent: Google-Extended
CCBotCommon Crawl Foundation

Open web archive used by many AI companies for training

This one flies under the radar but it's arguably the most important. Common Crawl is the open web archive that trained the first versions of GPT, Claude, and LLaMA. Many AI companies still pull from it. Blocking CCBot means you're less likely to appear in any open-source AI model's training data.

Allow for broad AI visibilityUser-agent: CCBot
cohere-aiCohere

Training data for Cohere's enterprise AI models

Cohere builds AI for enterprise use cases like internal search and document processing. Less consumer-facing than ChatGPT or Claude, but if your content targets B2B audiences, Cohere's models may surface it.

Allow if targeting enterprise AIUser-agent: cohere-ai
BytespiderByteDance (TikTok)

Web crawling for ByteDance AI products

ByteDance (TikTok's parent company) runs this crawler for their AI products. Known for being aggressive with request volume. Most Western sites block it since ByteDance's AI products primarily serve the Chinese market.

Block if not targeting Chinese AI marketUser-agent: Bytespider
DiffbotDiffbot

Structured data extraction and knowledge graph building

Diffbot doesn't train a chatbot. Instead, it builds a structured knowledge graph of the web by extracting facts, entities, and relationships from pages. Other AI products then use this data for things like entity recognition.

Allow for knowledge graph inclusionUser-agent: Diffbot
FacebookBotMeta

Web crawling for Meta AI products (Llama, Meta AI)

Does double duty: collects training data for Meta's Llama models AND generates link previews on Facebook and Instagram. Be careful blocking this one, as it will break your social media card previews too.

Allow — also needed for social previewsUser-agent: FacebookBot
Applebot-ExtendedApple

Training data for Apple Intelligence and Siri

Same logic as Google-Extended: this is separate from the main Applebot. Blocking Applebot-Extended won't affect Safari Suggestions or Siri web results. It only stops Apple from using your content to train Apple Intelligence features.

Allow for Apple Intelligence visibilityUser-agent: Applebot-Extended

Browsing Crawlers

These crawlers fetch pages in real-time when users ask AI assistants to browse the web. Blocking them prevents live citations of your content.

ChatGPT-UserOpenAI

Real-time web browsing when ChatGPT users click 'Browse'

When someone asks ChatGPT to look something up, this is the crawler that fetches the page. Block it and ChatGPT can't read your site during live conversations, so you lose out on real-time citations.

Allow if you want real-time citationsUser-agent: ChatGPT-User
Claude-WebAnthropic

Real-time web access for Claude conversations

Claude's equivalent of ChatGPT-User. Fetches pages in real-time when someone asks Claude to look something up. If you allow ClaudeBot for training but block Claude-Web, Claude will know about your site but can't visit it during conversations.

Allow for real-time citationsUser-agent: Claude-Web

AI Search Crawlers

These crawlers build search indexes for AI-powered search engines. Blocking them removes your site from AI search results entirely.

OAI-SearchBotOpenAI

Powers ChatGPT's search feature and SearchGPT

SearchGPT is OpenAI's search engine, and this crawler builds its index. Block it and your site disappears from ChatGPT search results entirely. If you care about being found through ChatGPT's search feature, this one matters.

Allow for search visibilityUser-agent: OAI-SearchBot
PerplexityBotPerplexity AI

Web indexing for Perplexity's AI search engine

Perplexity is an AI search engine that always links to its sources. If you're blocked, you won't show up in Perplexity results at all. Given that Perplexity actually sends traffic back to your site (unlike pure training crawlers), this one's usually worth allowing.

Allow for Perplexity citationsUser-agent: PerplexityBot

AI Crawler Comparison

CrawlerOperatorTypePurposeRecommendation
GPTBotOpenAItrainingTraining data collection for ChatGPT and GPT modelsAllow if you want AI visibility
ChatGPT-UserOpenAIbrowsingReal-time web browsing when ChatGPT users click 'Browse'Allow if you want real-time citations
OAI-SearchBotOpenAIsearchPowers ChatGPT's search feature and SearchGPTAllow for search visibility
ClaudeBotAnthropictrainingTraining data collection for Claude modelsAllow if you want Claude visibility
Claude-WebAnthropicbrowsingReal-time web access for Claude conversationsAllow for real-time citations
anthropic-aiAnthropictrainingGeneral Anthropic crawler for AI researchAllow alongside ClaudeBot
PerplexityBotPerplexity AIsearchWeb indexing for Perplexity's AI search engineAllow for Perplexity citations
Google-ExtendedGoogletrainingTraining data for Gemini and other Google AI productsAllow if you want Gemini visibility
CCBotCommon Crawl FoundationtrainingOpen web archive used by many AI companies for trainingAllow for broad AI visibility
cohere-aiCoheretrainingTraining data for Cohere's enterprise AI modelsAllow if targeting enterprise AI
BytespiderByteDance (TikTok)trainingWeb crawling for ByteDance AI productsBlock if not targeting Chinese AI market
DiffbotDiffbottrainingStructured data extraction and knowledge graph buildingAllow for knowledge graph inclusion
FacebookBotMetatrainingWeb crawling for Meta AI products (Llama, Meta AI)Allow — also needed for social previews
Applebot-ExtendedAppletrainingTraining data for Apple Intelligence and SiriAllow for Apple Intelligence visibility

robots.txt Examples

Allow all AI crawlers (recommended for most sites)

If your robots.txt does not mention a crawler, it is allowed by default. You only need explicit rules if you previously blocked crawlers. This configuration explicitly allows all AI crawlers:

# Allow all AI crawlers
User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: anthropic-ai
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: CCBot
Allow: /

User-agent: Applebot-Extended
Allow: /

Allow browsing and search, block training

Allow AI assistants to cite your content in real-time, but prevent your content from being used in model training:

# Allow real-time browsing
User-agent: ChatGPT-User
Allow: /

User-agent: Claude-Web
Allow: /

# Allow AI search
User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

# Block training data collection
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

Block all AI crawlers

If you want to prevent all AI access to your content:

# Block all AI crawlers
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Diffbot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

Check Which AI Crawlers Can Access Your Site

BotView checks your robots.txt against all 14 AI crawlers and shows you exactly what is allowed and blocked. Free scan, 30 seconds.

https://

Free scan — no account required. Takes 30 seconds.

Frequently Asked Questions

What is an AI crawler?

An AI crawler is a bot that visits websites to collect content for AI systems. Think of it like Googlebot, but instead of indexing pages for search results, AI crawlers are grabbing content to train language models (GPTBot, ClaudeBot), power AI search engines (PerplexityBot), or let AI assistants browse the web in real-time (ChatGPT-User).

How do AI crawlers differ from Googlebot?

Googlebot crawls to index pages for Google Search results. AI crawlers serve different purposes: some collect training data (GPTBot, ClaudeBot), some power AI search engines (PerplexityBot, OAI-SearchBot), and some enable real-time browsing (ChatGPT-User, Claude-Web). Blocking Googlebot affects your Google rankings. Blocking AI crawlers affects whether AI systems can access or learn from your content, but does not affect traditional search rankings.

Should I block or allow AI crawlers?

It depends on your goals. If you want your content cited by AI assistants, appearing in AI search results, or included in AI training data, allow them. If you are concerned about AI companies using your content without compensation, you can block specific crawlers. Most businesses benefit from AI visibility. You can also selectively allow some crawlers while blocking others.

How do I check which AI crawlers can access my site?

Use BotView to scan your website. We check your robots.txt against all 14 major AI crawlers and show you exactly which ones are allowed and which are blocked. The scan is free and takes about 30 seconds.

Does blocking AI crawlers affect my Google ranking?

No. AI crawlers (GPTBot, ClaudeBot, PerplexityBot, etc.) are completely separate from Googlebot. Blocking AI crawlers has zero effect on your Google Search rankings. The only exception is Google-Extended, which controls AI training usage but not search indexing — blocking it does not affect your Google rankings either.

What is the difference between training crawlers and browsing crawlers?

Training crawlers (GPTBot, ClaudeBot, CCBot) collect content to train AI models. This happens in bulk and affects the AI's general knowledge. Browsing crawlers (ChatGPT-User, Claude-Web) fetch specific pages in real-time when a user asks the AI to look something up. Blocking training crawlers prevents future learning; blocking browsing crawlers prevents live access.

How do I allow all AI crawlers at once?

If your robots.txt doesn't mention a crawler at all, it's allowed by default. So the simplest approach is to just not block them. The problem comes if you have a broad rule like 'User-agent: * / Disallow: /' that blocks everything. In that case, you need to add specific Allow rules for each AI crawler you want to let through.

Can AI crawlers see JavaScript-rendered content?

Most can't. They see the raw HTML your server sends back, not the final page after JavaScript runs. If your site is built with React, Vue, or Angular and renders content client-side, AI crawlers might see a blank page. This catches a lot of people off guard. BotView shows you a side-by-side comparison of what humans see vs what crawlers see, so you can spot this instantly.

Related Guides