Who Blocks Common Crawl's CCBot? 40 of 107 Top Sites
Common Crawl is a nonprofit that publicly archives the web. Its bot, CCBot, is one of the most widely known crawlers on the internet — and one of the most widely blocked. Across 107 prominent sites with parseable robots.txt files in our June 13, 2026 snapshot, 40 of them have added a CCBot Disallow rule.
40 of 107 sites block CCBot — the highest count among 12 tracked operators.
Common Crawl's data is the raw material that trained many of the frontier AI models now generating controversy — yet publishers increasingly see CCBot as the upstream feeder of AI training pipelines, and are blocking it accordingly. This report breaks down exactly who is blocking it, which industries lead, and what that means for teams who depend on web-sourced data.
Snapshot Methodology
US Tech Automations fetched robots.txt files from 122 prominent sites on June 13, 2026. Of those 122, 107 returned a parseable file; all percentages in this report are computed over that 107-site base. The snapshot is point-in-time and sealed — nothing is estimated, modeled, or extrapolated. Every numeral in this report is a verbatim count from public robots.txt directives as they existed on that date.
The snapshot sha is 741353c4304216ee, which pins the exact state of the dataset. robots.txt is an honor-system standard — it measures a site operator's stated intent, not a technical firewall. These numbers will not change as sites later edit their files; they describe a specific moment in time.
The 122-site panel spans 10 content categories and 21 tracked bot user-agents across 12 AI operators. Across the full corpus, 48 of 107 sites (44.9%) block some AI crawler. An additional 20 of 107 (18.7%) have adopted llms.txt, and 9 sites (8.4%) earned "star" status for the most comprehensive AI-access restrictions observed.
How Often Common Crawl Is Refused
Unlike OpenAI or Anthropic, Common Crawl operates a single user-agent: CCBot. That simplicity makes the per-bot and operator-level numbers identical — 40 sites block CCBot, and 40 sites block Common Crawl. There is no sub-agent distinction, no search crawler vs. training crawler split.
| Common Crawl User-Agent | Sites Blocking (of 107) |
|---|---|
| CCBot | 40 |
The single-agent structure means every one of those 40 blocks is a direct, unambiguous statement: this site does not want Common Crawl archiving its content. There is no nuance available — no way to allow the "search" CCBot while blocking the "training" one, because CCBot is a single entity. Publishers who block CCBot have made an all-or-nothing decision.
CCBot: 40 blocks across 107 sites; every block is an all-or-nothing refusal.
That absoluteness may partly explain why CCBot's block count (40) exceeds Anthropic's (39) and substantially exceeds OpenAI's (35). A publisher that found no value in splitting policy by function blocks CCBot in full. With OpenAI's 3 agents, some publishers may allow OAI-SearchBot while blocking GPTBot. CCBot offers no such split.
Sealed finding: 40 of 107 top sites (37.4%) block CCBot — the highest operator-level block rate among 12 AI-adjacent operators tracked in this corpus as of June 13, 2026.
The structural simplicity of CCBot creates a useful research baseline. Because there is only one user-agent to track, the 40-site count is a clean signal: these publishers have made an explicit, targeted policy choice against Common Crawl specifically, with no ambiguity about which function they are restricting.
Sealed finding: CCBot accounts for 40 of 40 blocks in the Common Crawl operator total — the most concentrated operator-level block pattern in this corpus, with no sub-agent distribution.
Which Industries Block Common Crawl
News publishers are the dominant resisters, with 13 sites in that category blocking CCBot. Tech follows with 9 — notably higher than Tech counts for OpenAI (5) or Anthropic (7). Reference and Entertainment each contribute 5 blockers. Social adds 3, Retail and Travel 2 each, and Government 1.
| Category | Sites Blocking CCBot |
|---|---|
| News | 13 |
| Tech | 9 |
| Reference | 5 |
| Entertainment | 5 |
| Social | 3 |
| Retail | 2 |
| Travel | 2 |
| Government | 1 |
Tech registers 9 CCBot blockers — higher than OpenAI (5) or Anthropic (7).
Tech's 9-site blocking count stands out. In both the OpenAI and Anthropic reports, Tech logged 5 and 7 blockers respectively. CCBot hits 9 — the highest Tech category count in this comparison set. Properties like Wired, Ars Technica, CNET, ZDNet, Mashable, The Verge, TechCrunch, VentureBeat, and Gizmodo have all added CCBot blocks. Many of these outlets write extensively about AI, which may sharpen awareness of how archived content feeds into training pipelines.
News at 13 follows the same pattern seen across the whole Closing Web corpus: journalism outlets treat their archives as their core asset. Reference sites (5 blockers) include health and financial information properties where AI-generated summaries displace high-value organic traffic. For a comparison with how Anthropic crawlers fare in the same categories, see who blocks Anthropic ClaudeBot by industry.
Entertainment's 5 blockers — Rolling Stone, Variety, Hollywood Reporter, Billboard, ESPN — are consistent with the pattern observed for other operators. These properties hold deep original-content archives and treat them as proprietary assets. Social's 3 blockers (LinkedIn, Tumblr, Vimeo) reflect user-content platforms where third-party consent questions arise.
The Named Sites That Block Common Crawl
All 40 sites that block CCBot are named in the sealed dataset. The table below highlights 12, prioritizing those with the highest overall headline-crawlers-blocked scores.
| Site | Category | Headline Crawlers Blocked (of 9) |
|---|---|---|
| bbc.com | News | 9 |
| bloomberg.com | News | 9 |
| usatoday.com | News | 9 |
| nytimes.com | News | 8 |
| cnn.com | News | 8 |
| wired.com | Tech | 8 |
| arstechnica.com | Tech | 8 |
| ebay.com | Retail | 8 |
| congress.gov | Government | 8 |
| rollingstone.com | Entertainment | 8 |
| theguardian.com | News | 7 |
| washingtonpost.com | News | 7 |
The familiar top tier — BBC, Bloomberg, USA Today with 9 headline bots each, then NYT, CNN, Wired, Ars Technica at 8 — appears again. These organizations have adopted comprehensive AI-crawling restrictions and are not making exceptions for Common Crawl.
VentureBeat (4 headline bots) and Gizmodo (3) are notable additions to the Tech blockers that do not appear in the Anthropic or OpenAI named-blocker lists for this corpus. That suggests CCBot specifically triggers blocks at sites that are otherwise more permissive with other operators.
Goodreads (2 headline bots) blocks CCBot but represents the lower end of the spectrum: it has very selective AI restrictions overall. Business Insider (3) and Dictionary.com (3) are moderate restrictors. The full 40-site list also includes The Atlantic (8), The Guardian (7), Forbes (8), Vox (7), Newsweek (7), The Verge (7), Healthline (7), Amazon (7), and others. For comparison with how OpenAI crawlers are treated by this same publisher set, see who blocks OpenAI GPTBot and the per-agent breakdown.
Per-Industry Analysis: What Drives the CCBot Lead
The CCBot total of 40 exceeds every other operator in this 12-operator corpus. Three factors likely explain the lead. First, CCBot has a longer operational history than crawlers like GPTBot or ClaudeBot, giving it more time to accumulate robots.txt entries. Second, Common Crawl datasets — including the C4 and OSCAR corpora — are publicly documented as training sources for many major language models, making CCBot a visible upstream target.
Third, the single-agent structure removes the option for selective blocking. A publisher who wants to block OpenAI training but allow OpenAI search indexing can do so by targeting GPTBot alone. No equivalent split exists for CCBot. The all-or-nothing structure drives some publishers to block CCBot who might otherwise maintain partial access policies.
The News category leads CCBot blocking at 13 sites, and the Tech category follows at 9. Both are categories where publishers treat original text as a primary product, which gives them the strongest incentive to restrict the single bot most associated with historical training datasets. That a News publisher has added a CCBot rule is often the most informative signal about its broader AI-data stance.
Put This Data to Work
If you are a data engineering lead or research-pipeline architect who depends on Common Crawl dumps for model training, retrieval augmentation, or large-scale text analysis, the 40-site block list is directly relevant to data provenance questions. A publisher blocking CCBot at the robots.txt level signals non-consent for archival.
US Tech Automations builds monitoring workflows that track robots.txt policy changes on your behalf. The concrete application: maintain a domain watchlist of content sources your organization depends on, run a nightly fetch-and-parse job, and receive a structured diff the moment CCBot policy changes on any monitored domain.
For legal and compliance teams navigating AI training data questions, robots.txt state is increasingly cited in litigation around consent. Knowing that a given publisher explicitly blocked CCBot as of a specific date — with that as a timestamped, automated record — is the kind of audit trail that matters in discovery.
Frequently Asked Questions
Q: Is Common Crawl a commercial AI company?
A: No. Common Crawl is a nonprofit that publishes free web archive datasets. However, because those datasets have been used to train many commercial AI models, publishers have come to associate CCBot with the broader AI training ecosystem — which is reflected in the 40-site block count.
Q: Does blocking CCBot stop my content from appearing in AI training data?
A: Partially. Blocking CCBot prevents future Common Crawl crawls from including your content. It does not remove content from existing Common Crawl snapshots, and it does not affect crawlers operated by other AI companies (GPTBot, ClaudeBot, etc.) which must be blocked separately. Each operator requires its own robots.txt entry.
Q: Does blocking CCBot affect my Google Search ranking?
A: No. CCBot has no relationship to Googlebot or Google-Extended. Blocking CCBot has zero effect on your Google Search indexing or ranking. See who blocks Google-Extended and how that count compares for the Google-specific picture.
Q: Why is Common Crawl blocked more than OpenAI or Anthropic?
A: Several factors: CCBot has a longer history making it a more established target; it is associated with training datasets that predate many operators' own crawlers; and because it runs a single user-agent with no function split, publishers who want to block AI training have no reason to allow it selectively. All 40 blocks are absolute.
Q: Will these 40 blockers change over time?
A: Yes. robots.txt files are live documents. The 40-site figure is a sealed point-in-time reading from June 13, 2026. Some sites may add CCBot blocks; others may remove them (for instance, if they reach a licensing agreement). Automated monitoring is the only way to track the drift. US Tech Automations can build that workflow for your domain list.
Key Takeaways
40 of 107 top sites block CCBot — the highest operator-level block rate among 12 AI-adjacent operators in this corpus.
CCBot operates as a single user-agent with no sub-agent split, making every block an unambiguous, all-or-nothing policy statement.
News (13 sites) drives the most resistance, followed by Tech (9) — Tech is notably higher for CCBot than for OpenAI (5) or Anthropic (7).
BBC, Bloomberg, and USA Today each block 9 of 9 tracked headline bots; CCBot is part of a comprehensive AI-access lockdown at the top of the publisher tier.
48 of 107 sites (44.9%) block some AI crawler; 40 of those 48 block CCBot, making it the most broadly targeted bot in this corpus.
The sealed snapshot sha 741353c4304216ee pins the exact dataset; nothing is derived or estimated from secondary sources.
Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 13, 2026 (snapshot sha 741353c4304216ee).
Get this data as a daily feed
The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.
Prefer to talk first? Contact us.
Cite this report
US Tech Automations Research, 2026-06 edition. “Who Blocks Common Crawl's CCBot? 40 of 107 Top Sites.” https://ustechautomations.com/resources/blog/who-blocks-common-crawl-ccbot-2026
Sealed snapshot sha256: 741353c4304216ee
Machine-readable data: CSV · JSON · All research & methodology
About the Author

Helping businesses leverage automation for operational efficiency.