Research & Data

How Many Top Sites Block CCBot? Sealed robots.txt Data

Jun 13, 2026

Key Takeaways

40 of 107 top sites block CCBot.

CCBot is blocked at a 37.4% rate across 107 sites.

48 of 107 sites block at least one AI crawler.

Of the 122 prominent sites in our starting universe, 107 returned a parseable robots.txt file. Of those 107, exactly 40 block CCBot — a block rate of 37.4%. That places CCBot at the top of the nine-crawler leaderboard tracked in this edition. No other measured bot is refused by more of these sites.

CCBot is the Common Crawl archive crawler, whose open dataset trains many downstream models. Common Crawl makes its archived web content available as a public dataset, and that dataset is widely used as a training source for large language models built by many different organizations. That downstream reach is precisely why publishers often treat CCBot as a proxy for broad AI training exposure, even when they have no direct contractual relationship with Common Crawl.

Of 107 prominent sites with a parseable robots.txt, 40 block CCBot — a 37.4% block rate, the highest across all 9 crawlers we measured in June 2026.

48 of 107 sites (44.9%) block at least one AI crawler; CCBot block rate of 37.4% sits above that corpus-wide line.

What Is CCBot and Why Do Publishers Block It

CCBot is operated by Common Crawl, a nonprofit organization that archives large portions of the web and publishes that archive as an open dataset. Because the dataset is openly available, any organization building a language model can use it as training data without negotiating directly with Common Crawl or with individual publishers.

That open-dataset model creates an asymmetry: a publisher who wants to withhold their content from AI training cannot target every downstream model individually. Blocking CCBot in robots.txt is a single lever that can, in principle, reduce exposure across many training pipelines that rely on the Common Crawl archive.

The degree to which that lever is effective depends entirely on whether a given crawler respects robots.txt instructions, which is an honor-system standard. We discuss that limitation in detail in the FAQ below.

Across the operator leaderboard, Common Crawl is blocked by 40 sites — the highest operator count in our measurement. That figure is identical to the CCBot count because Common Crawl operates a single primary crawler under this user-agent string, so the operator and bot numbers coincide here.

For comparison, Anthropic — whose ClaudeBot is the second-highest-blocked individual crawler — is blocked by 39 sites as an operator. You can read the companion report on how many sites block ClaudeBot to see how that crawler-level picture differs from the operator view.

Methodology

US Tech Automations Research collected publicly accessible robots.txt files from 122 prominent sites on June 13, 2026. A site was included in the measurement universe if it is well-known and has public web content. We fetched each robots.txt, parsed the file, and checked whether a User-agent: CCBot or User-agent: * directive was followed by a Disallow: / (or equivalent broad disallow) that would apply to CCBot.

Of the 122 sites in the universe, 107 returned a parseable robots.txt file. The remaining 15 either returned no robots.txt or returned one that could not be parsed. Those 15 sites are excluded from the denominator used throughout this report; all percentages are calculated against 107.

Every figure is a raw read; nothing is estimated, modeled, or extrapolated. Every count in this report is a verbatim read from the raw text of a public file. The snapshot was sealed with sha256 hash 741353c4304216ee and the data window is point-in-time, June 13, 2026.

Metric	Value
Sites in starting universe	122
Sites with a parseable robots.txt	107
Sites blocking CCBot	40
Block rate	37.4%

Sites That Block CCBot

The 40 sites that explicitly block CCBot in their robots.txt span news, media, technology, e-commerce, entertainment, professional networks, and government resources. The breadth of the category representation signals that concern about AI-training crawls is not confined to one sector.

In the news and media segment, blockers include nytimes.com, washingtonpost.com, theguardian.com, bbc.com, cnn.com, apnews.com, bloomberg.com, forbes.com, businessinsider.com, theatlantic.com, usatoday.com, newsweek.com, vox.com, rollingstone.com, variety.com, hollywoodreporter.com, and billboard.com. These outlets derive significant value from their original reporting and have consistently been among the first movers on AI-crawler policy.

Technology and research publications blocking CCBot include techcrunch.com, theverge.com, wired.com, arstechnica.com, cnet.com, zdnet.com, mashable.com, gizmodo.com, venturebeat.com, and dictionary.com. That last site is a useful reminder that reference content — not just journalism — is treated as a protected asset.

The health segment is represented by webmd.com and healthline.com. Among professional and user-generated platforms, goodreads.com, amazon.com, ebay.com, linkedin.com, and tumblr.com all block CCBot. Vimeo.com, tripadvisor.com, and yelp.com round out the media and review category. The entertainment vertical adds espn.com. Government content appears too: congress.gov blocks CCBot, a choice that reflects policy rather than commercial interest.

Notable allowers in this snapshot include reuters.com, wsj.com, latimes.com, wikipedia.org, britannica.com, merriam-webster.com, and github.com. Financial sites such as marketwatch.com and morningstar.com also permit CCBot. Many e-commerce destinations — walmart.com, target.com, bestbuy.com, etsy.com, homedepot.com, wayfair.com, ikea.com, nordstrom.com, nike.com, and shopify.com — do not block CCBot. Government portals including cdc.gov, medlineplus.gov, usa.gov, irs.gov, sec.gov, whitehouse.gov, census.gov, nasa.gov, and uspto.gov likewise allow it.

Streaming and entertainment services netflix.com, spotify.com, youtube.com, and hulu.com all appear in the allower list, as do travel aggregators expedia.com, booking.com, airbnb.com, kayak.com, marriott.com, and hilton.com. Educational institutions mit.edu, harvard.edu, stanford.edu, coursera.org, and edx.org do not block CCBot.

Cross-Bot Leaderboard (all 107 sites)

Measuring nine crawlers against the same corpus of 107 sites produces a directly comparable leaderboard. CCBot holds the top position. The table below shows counts and rates for all 9 bots. For the detailed story on a specific crawler, follow the sibling links in the text.

Bot	Sites Blocking	Block Rate
CCBot	40	37.4%
ClaudeBot	38	35.5%
Bytespider	37	34.6%
GPTBot	33	30.8%
Applebot-Extended	31	29%
Meta-ExternalAgent	30	28%
PerplexityBot	29	27.1%
Google-Extended	25	23.4%
Amazonbot	22	20.6%

The spread from CCBot at 40 down to Amazonbot at 22 is substantial. The top three — CCBot, ClaudeBot, and Bytespider — are clustered closely. The middle tier (GPTBot, Applebot-Extended, Meta-ExternalAgent) is a modest step down. Google-Extended and Amazonbot trail the pack. Note that CCBot being at the top does not mean publishers specifically target it more aggressively than other bots; it may reflect that CCBot was among the earliest AI-related crawlers to attract policy attention, and many robots.txt files carry historically accumulated rules.

If you want to compare the training-crawler picture with ClaudeBot's profile, see how many sites block ClaudeBot. For the ByteDance crawler story, see how many sites block Bytespider.

Operator Leaderboard (all 107 sites)

One operator may run more than one crawler, so blocking all crawlers from an operator requires multiple user-agent rules. The table below aggregates by operator — counting a site once if it blocks at least one crawler from that organization. A high operator count means that publishers are signaling concern about that organization's AI access, regardless of which specific crawler is named.

Rank	Operator	Sites Blocking
1	Common Crawl	40
2	Anthropic	39
3	ByteDance	37
4	OpenAI	35
4	Meta	35
6	Apple	31
7	Diffbot	30
8	Perplexity	29
9	Cohere	27
10	Google	25
11	Amazon	22
12	Mistral	12

Common Crawl leads at 40, which is the same as the CCBot count because Common Crawl operates a single primary crawler token. Anthropic is close behind at 39 — a combined view of ClaudeBot and any other Anthropic-labeled tokens. At the bottom, Mistral appears in only 12 sites' disallow lists, suggesting that publishers are aware of its crawlers but treat them as a lower priority.

The operator perspective matters for decision-making. A publisher concerned about any model trained on Common Crawl data needs only one robots.txt rule. A publisher concerned about all major AI operators would need rules covering every row in this table. You can explore individual crawler profiles — including GPTBot from OpenAI and Applebot-Extended from Apple — using the sibling links in this series.

Frequently Asked Questions

Q: Does blocking CCBot in robots.txt actually prevent Common Crawl from archiving the site?

A: Not with certainty. robots.txt is an honor-system standard. Common Crawl has historically stated that it respects robots.txt disallow directives, but compliance is not technically enforced. A publisher can verify whether a crawler respected their rules only by analyzing server logs. The robots.txt entry is the signal of intent, not a firewall.

Q: Why does CCBot have the highest block rate when Common Crawl is a nonprofit?

A: The nonprofit status of Common Crawl does not determine how downstream model builders use the data. Because the archive is openly available, many commercial training pipelines ingest it. Publishers appear to reason that blocking the source archive reduces downstream exposure across multiple models simultaneously, making CCBot a high-priority target even though Common Crawl itself does not build consumer products.

Q: What does it mean that 48 of 107 sites (44.9%) block at least one AI crawler?

A: It means that close to half of these prominent sites have taken a policy position on at least one AI crawler. CCBot block rate of 37.4% sits above that corpus-wide 44.9% figure — which is the threshold for having any AI block at all. A site that blocks CCBot might or might not block other bots. This report does not count per-site combinations; it counts per-bot across all 107 sites.

Q: How should an SEO or content team interpret the allower list?

A: The allower list does not mean those sites endorse AI crawling — it means their robots.txt does not contain a disallow rule specific to CCBot. Absence of a block is not an affirmative permission statement. Some sites may simply not have updated their robots.txt to address AI crawlers. Others may have made a deliberate choice to remain crawlable. You cannot infer intent from absence of a rule.

Q: How often will this data be updated?

A: This report reflects a point-in-time snapshot sealed June 13, 2026 (sha 741353c4304216ee). robots.txt files change continuously — a site could add or remove a rule at any time. US Tech Automations Research publishes updated editions periodically; subscribe or monitor the Closing Web series for refreshed snapshots.

Put AI-Access Data to Work

An SEO lead managing a portfolio of publisher clients needs to know — quickly and at scale — whether any site in the portfolio has changed its AI-crawler policy and whether those changes align with the client's content-protection strategy. Manually checking robots.txt across dozens of domains every week is error-prone and time-consuming.

A retrieval or data engineer building a training or RAG pipeline needs to know which high-value sources are off-limits before fetching content, not after a legal review surfaces the issue six months later.

US Tech Automations automates this monitoring: scheduled crawlers check robots.txt across a defined site portfolio, parse disallow rules per bot, and surface changes as structured alerts — no manual checking required. Explore the agentic workflow platform to see how continuous AI-access monitoring fits your stack.

Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 13, 2026 (snapshot sha 741353c4304216ee).

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “How Many Top Sites Block CCBot? Sealed robots.txt Data.” https://ustechautomations.com/resources/blog/how-many-sites-block-ccbot-2026

Sealed snapshot sha256: 741353c4304216ee

Machine-readable data: CSV · JSON · All research & methodology