robots.txt Guide + Checker

CCBot & robots.txt

Allow or Block Common Crawl

Common Crawl's dataset feeds most AI model training. Control whether your content is included.

What is CCBot?

CCBot is the web crawler run by Common Crawl, a non-profit organization that maintains one of the largest open web datasets in existence. Since 2008, Common Crawl has archived petabytes of web content, and this data is freely available for anyone to download and use.

This matters for AI visibility because most major AI models are trained partly on Common Crawl data. GPT-4, Claude, LLaMA, Gemini, and many open-source models all use Common Crawl as a foundational training dataset. Blocking CCBot is effectively blocking a shared pipeline to multiple AI systems at once.

CCBot identifies itself with the user-agent token CCBot and respects robots.txt rules.

Why CCBot Has Outsized Impact

Blocking GPTBot only blocks OpenAI. Blocking ClaudeBot only blocks Anthropic. But blocking CCBot affects the training data used by all of them — plus dozens of smaller AI companies and open-source models.

robots.txt Syntax for CCBot

Copy-paste these examples into your robots.txt file.

Allow CCBot (Recommended for AI visibility)

# Allow Common Crawl

User-agent: CCBot

Allow: /

Allows your content to be included in Common Crawl's open dataset, which feeds AI model training across multiple platforms.

Block CCBot

# Block Common Crawl

User-agent: CCBot

Disallow: /

Prevents future crawls. Note: Common Crawl archives past data indefinitely, so historical snapshots of your site may still exist in older datasets.

Partial Access with Crawl Delay

# Allow CCBot with rate limiting

User-agent: CCBot

Allow: /blog/

Allow: /docs/

Disallow: /private/

Disallow: /api/

Crawl-delay: 10

CCBot respects Crawl-delay. This lets Common Crawl index your public content without overloading your server.

When to Allow vs Block CCBot

Allow CCBot When...

✓You want maximum AI visibility across all platforms
✓You want your content in open research datasets
✓You publish open-access or educational content
✓You support the open data ecosystem

Block CCBot When...

✗You want maximum control over AI training usage
✗Your content is premium or subscription-based
✗You have strict copyright or licensing terms
✗You want to limit crawl load on your server

Frequently Asked Questions

What is CCBot and Common Crawl?

CCBot is the web crawler operated by Common Crawl, a non-profit that maintains an open dataset of web pages. This dataset is freely available and is used as training data by most major AI models, including those from OpenAI, Anthropic, Google, and Meta. Blocking CCBot doesn't just affect one AI — it affects the training data pipeline for many.

Which AI models use Common Crawl data?

Most major AI models are trained partly on Common Crawl data, including GPT-4 (OpenAI), Claude (Anthropic), LLaMA (Meta), Gemini (Google), and many open-source models. Common Crawl is one of the largest publicly available web datasets, containing petabytes of web pages collected since 2008.

If I block GPTBot but allow CCBot, can OpenAI still use my content?

Potentially, yes. If CCBot crawls your content and includes it in the Common Crawl dataset, that data is publicly available for anyone to use — including OpenAI. For maximum control, you would need to block both GPTBot and CCBot. However, Common Crawl data may already include historical snapshots of your site.

Does blocking CCBot affect my Google Search ranking?

No. CCBot is operated by Common Crawl, which is completely independent from Google. Blocking CCBot has no impact on Googlebot, Google Search ranking, or Google's index. It only affects whether your content appears in Common Crawl's open dataset.

How often does CCBot crawl websites?

Common Crawl runs major crawls approximately monthly, collecting billions of pages per crawl. CCBot respects robots.txt and crawl-delay directives. The full dataset is released publicly after each crawl cycle and is archived indefinitely.

Related robots.txt Guides

GPTBot & robots.txt — OpenAI / ChatGPT
ClaudeBot & robots.txt — Anthropic / Claude
PerplexityBot & robots.txt — Perplexity AI
Google-Extended & robots.txt — Google / Gemini
Check all 14 AI crawlers at once

Check Your CCBot Configuration

See if Common Crawl and other AI crawlers can access your site. Full report on all 14 AI bots.

No credit card required