Research & Data

GPTBot vs ClaudeBot vs CCBot: Most-Blocked AI Crawler?

Jun 13, 2026

When a site decides to block AI, which company's crawler does it refuse first? The sealed data gives a clear ranking. Across the 107 prominent websites that returned a readable robots.txt on June 13, 2026, Common Crawl is blocked by 40 sites, Anthropic by 39, ByteDance by 37, and OpenAI by 35. The "AI crawler" is not blocked as one thing — each operator faces a different reception.

This report measures blocking at the operator level: a site is counted as blocking a company if its robots.txt names any one of that company's crawler user-agents with a Disallow: / rule. That is a deliberately broad test, because several operators run more than one bot — a training crawler, a search-grounding fetcher, a live user-action agent — and blocking any of them signals refusal of that company.

All figures are verbatim counts from robots.txt files fetched and sealed point-in-time across a curated set of 122 prominent sites in 10 categories; 107 returned a parseable file. Percentages are over those 107. The numbers will not change as sites later edit their policies.

The Operator Leaderboard

Ranked by the number of sites blocking at least one of each company's crawlers, the 12 tracked operators sort like this:

Operator	Sites Blocking	Block Rate
Common Crawl	40	37.4%
Anthropic	39	36.4%
ByteDance	37	34.6%
OpenAI	35	32.7%
Meta	35	32.7%
Apple	31	29%
Diffbot	30	28%
Perplexity	29	27.1%
Cohere	27	25.2%
Google	25	23.4%
Amazon	22	20.6%
Mistral	12	11.2%

The headline rivalry — OpenAI versus Anthropic — is closer than the popular narrative suggests. Anthropic is blocked by 39 sites (36.4%) and OpenAI by 35 (32.7%), a four-site gap. Both trail Common Crawl, whose single CCBot crawler is blocked by 40 sites (37.4%) and remains the crawler operators most associate with their content feeding open training corpora.

"At the operator level, Common Crawl (40 sites, 37.4%) and Anthropic (39 sites, 36.4%) are the two most-blocked AI operators across the 107 prominent sites with a readable robots.txt on June 13, 2026."

Why Anthropic Edges Out OpenAI Here

The four-site spread between Anthropic and OpenAI is small, but the shape of it is informative. Anthropic operates the widest fleet of named crawlers in this dataset — ClaudeBot, anthropic-ai, Claude-Web, Claude-User, and Claude-SearchBot — so a site that wants to be thorough about excluding Anthropic has more user-agents to name, and the operator-level test catches any of them. OpenAI's footprint in robots.txt is more concentrated around GPTBot, OAI-SearchBot, and ChatGPT-User.

That is why the single-bot ranking and the operator ranking disagree slightly. Measured as a lone user-agent, ClaudeBot is blocked by 38 sites and GPTBot by 33. Rolled up to the operator, Anthropic reaches 39 and OpenAI 35, because the secondary user-agents add coverage. The lesson for anyone reading these numbers: "is GPTBot blocked" and "is OpenAI blocked" are different questions with different answers.

The Quiet End of the List

The bottom of the leaderboard is as revealing as the top. Mistral is blocked by only 12 sites (11.2%) — barely a third of Common Crawl's rate. Amazon (20.6%) and Google (23.4%) also sit well below the leaders.

Least-Blocked Operators	Sites Blocking	Block Rate
Mistral	12	11.2%
Amazon	22	20.6%
Google	25	23.4%
Cohere	27	25.2%

Two forces explain the low end. The first is awareness: a site can only block a user-agent it knows to name, and newer or lower-profile crawlers simply have not made it onto most block lists yet — Mistral's 11.2% likely reflects that it is not yet on operators' radar as much as a deliberate welcome. The second is entanglement: Google-Extended's relatively low 23.4% block rate reflects how reluctant site owners are to do anything that might feel adjacent to their Google Search relationship, even though Google-Extended is a separate, search-independent control. Amazon's low rate is similar — Amazonbot serves functions operators do not want to lose.

The Crawler-Level Detail

For teams that manage their own robots.txt, the operator roll-up is less actionable than the specific bot names. Here is the per-bot block count for the nine highest-profile crawlers:

AI Crawler	Operator	Sites Blocking	Block Rate
CCBot	Common Crawl	40	37.4%
ClaudeBot	Anthropic	38	35.5%
Bytespider	ByteDance	37	34.6%
GPTBot	OpenAI	33	30.8%
Applebot-Extended	Apple	31	29%
Meta-ExternalAgent	Meta	30	28%
PerplexityBot	Perplexity	29	27.1%
Google-Extended	Google	25	23.4%
Amazonbot	Amazon	22	20.6%

If you maintain a site and want to mirror what the most defensive prominent publishers do, this is effectively the priority order they block in: CCBot first, then ClaudeBot and Bytespider, then GPTBot. If you run AI retrieval or research workflows and want to know which crawlers face the most closed doors, the same ranking tells you where source coverage will be thinnest.

Which Sites Refuse the Most Operators

The operator leaderboard is an aggregate; the per-domain record underneath it shows where the refusals concentrate. A small number of sites refuse nearly every operator, while a large majority refuse none.

Posture toward AI operators	Sites	Examples (from the sealed set)
Refuse all 9 headline crawlers	3	bbc.com, bloomberg.com, usatoday.com
Refuse 8 of 9	15	nytimes.com, cnn.com, forbes.com, wired.com, congress.gov
Refuse none	59	reuters.com, wsj.com, wikipedia.org, github.com, walmart.com, cdc.gov

The three sites that refuse every operator — bbc.com, bloomberg.com, usatoday.com — are the maximalists, and they are all news organizations whose entire business is the reporting an AI assistant would otherwise summarize away. Right behind them, 15 sites refuse eight of the nine, almost always leaving a single search-grounding crawler through so they stay discoverable in AI search while staying out of training runs.

At the other extreme, 59 of the 107 readable-policy sites refuse no operator at all. This is the quiet center of gravity in the operator data: the median prominent site, across retail, finance, travel, education, and government, blocks zero AI operators. reuters.com and wsj.com sit here despite being premium news brands, as do wikipedia.org, github.com, and the full slate of government domains. When you read "Anthropic is blocked by 39 sites," the necessary complement is that most prominent sites block neither Anthropic nor anyone else.

Why the Operator View Is the One That Matters

It is tempting to track a single bot — usually GPTBot, because it is the most recognizable name — and treat its block rate as "the AI blocking number." This dataset shows why that is a mistake. Every major operator now runs more than one crawler, each with a different job: a training scraper, a search-grounding fetcher, and increasingly a live user-action agent that visits a page because a person asked an assistant to. Blocking one of those is a fundamentally different decision from blocking all of them, and a single-bot metric collapses that distinction.

The Anthropic-versus-OpenAI gap is the clearest illustration. As single user-agents, ClaudeBot (38) and GPTBot (33) are five sites apart; rolled up to the operator, Anthropic (39) and OpenAI (35) are four apart. The numbers move because Anthropic's wider fleet of named crawlers gives a thorough site more bots to exclude. Anyone making a sourcing or policy decision off "is GPTBot blocked" is answering a narrower question than they think they are.

For teams that maintain their own robots.txt, the practical takeaway is to decide at the operator level and then enumerate every one of that operator's user-agents, not just its flagship. A block list that names GPTBot but omits OAI-SearchBot and ChatGPT-User, or names ClaudeBot but omits anthropic-ai and Claude-User, is a partial block that may not do what its author intends. The most defensive publishers in this set are thorough precisely because they treat the company, not the bot, as the unit of decision. The same logic applies in reverse for anyone building retrieval or research automations: knowing that an operator is blocked on a given site means every one of that operator's crawlers is likely to be refused there, so the practical question is which companies your pipeline can rely on for a given source, not which individual user-agent string you send.

Put This Data to Work

The operator leaderboard is a moving target, and the movement is the signal. The day a major publisher adds or drops a specific company's crawler is a leading indicator — of a licensing deal, a policy shift, or a legal posture change. Catching that day requires monitoring, not a one-time look.

This is precisely the kind of recurring intelligence US Tech Automations sets up for operations and marketing teams. An automation specialist can build a workflow that re-fetches these robots.txt files on a cadence, diffs each operator's status per domain, and alerts the content owner or analyst the moment a crawler's access changes. For a RevOps or data team running retrieval pipelines, US Tech Automations can keep your allowed-source list synchronized with each operator's real, current access posture — so your automations never depend on a source that has quietly closed. The same US Tech Automations sealed-fetch-and-diff approach behind this research is reusable for any team that needs to track competitor or supplier behavior on a schedule.

Frequently Asked Questions

Why does Anthropic rank above OpenAI when GPTBot and ClaudeBot are close?
Because this leaderboard counts operators, not single bots. Anthropic runs more named crawlers, so a thorough block list catches more of them. Anthropic reaches 39 sites at the operator level versus OpenAI's 35, even though ClaudeBot (38) and GPTBot (33) are closer as individual user-agents.

Is Common Crawl an "AI company"?
Common Crawl is a nonprofit that publishes a large open web-crawl archive, which has been a major training-data source for many LLMs. Site operators treat CCBot as an AI-training proxy, which is why it tops the block list at 37.4%.

Why is Mistral blocked so rarely?
Mistral's 11.2% rate most likely reflects lower awareness — its user-agent is simply not yet on most operators' block lists — rather than a deliberate decision to welcome it. Block lists lag crawler awareness.

Does a low block rate mean a crawler is welcome?
Not necessarily. A low rate can mean genuine acceptance, or it can mean operators have not gotten around to naming that bot yet. The data measures explicit refusal, not explicit welcome.

Why is Google-Extended blocked less than CCBot or ClaudeBot?
At 23.4%, Google-Extended sits well below the leaders, largely because site owners are reluctant to take any action that feels adjacent to their Google Search relationship — even though Google-Extended is a separate, search-independent control that does not affect indexing. That caution keeps its block rate lower than its training-data role alone would predict.

Key Takeaways

At the operator level, Common Crawl (40 sites, 37.4%) and Anthropic (39, 36.4%) are the two most-blocked AI operators across 107 prominent sites with a readable robots.txt.
OpenAI is blocked by 35 sites (32.7%), four behind Anthropic — closer than the popular framing suggests.
Operator-level and single-bot rankings differ: ClaudeBot alone is blocked by 38 sites and GPTBot by 33, but Anthropic's wider crawler fleet lifts its operator total to 39.
Mistral (11.2%), Amazon (20.6%), and Google (23.4%) are the least-blocked, driven by low awareness and reluctance to disturb search relationships.
Every figure is a verbatim count from robots.txt sealed point-in-time on June 13, 2026.

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “GPTBot vs ClaudeBot vs CCBot: Most-Blocked AI Crawler?.” https://ustechautomations.com/resources/blog/gptbot-vs-claudebot-vs-ccbot-most-blocked-ai-crawler-2026

Sealed snapshot sha256: 741353c4304216ee

Machine-readable data: CSV · JSON · All research & methodology