Research & Data

How Many Top Sites Block ClaudeBot? Sealed robots.txt Data

Jun 13, 2026

Key Takeaways

38 of 107 top sites block ClaudeBot.

ClaudeBot is blocked at a 35.5% rate across 107 sites.

48 of 107 sites block at least one AI crawler.

Of the 122 prominent sites in our starting universe, 107 returned a parseable robots.txt file. Of those 107, exactly 38 block ClaudeBot — a block rate of 35.5%. Among the 9 crawlers measured in this edition, that places ClaudeBot in the second position, just two sites behind the leader CCBot and one site ahead of Bytespider.

ClaudeBot is the Anthropic web crawler. Anthropic uses ClaudeBot to gather web content to support the development of its Claude AI models. As an operator, Anthropic is blocked by 39 sites — one more than the per-bot count of 38 — indicating that a small number of sites use a broader Anthropic-level block that catches crawler tokens beyond ClaudeBot alone.

Of 107 prominent sites with a parseable robots.txt, 38 block ClaudeBot — a 35.5% block rate, second highest across all 9 crawlers measured in June 2026.

Anthropic as an operator is refused access by 39 of 107 sites; 48 of 107 (44.9%) block at least one AI crawler of any kind.


What Is ClaudeBot and Why Do Publishers Block It

ClaudeBot is the Anthropic web crawler. It identifies itself to web servers and robots.txt parsers by the user-agent string ClaudeBot. When a site includes a User-agent: ClaudeBot directive followed by a broad disallow, it is specifically targeting Anthropic data-collection activity rather than making a blanket AI statement.

Publishers who block ClaudeBot often do so alongside other AI crawlers, but not universally. The cross-bot patterns in this snapshot reveal that some sites draw a distinction between different operators: a site might block Anthropic while allowing another AI company, or vice versa. The nuanced picture is visible only when you measure all 9 bots against the same corpus, which this edition does.

ClaudeBot belongs to Anthropic, whose crawlers collectively are blocked by 39 sites — second highest among operators. That operator-level view matters for publishers who want to account for all Anthropic-labeled activity rather than a single user-agent string. For the full operator comparison, see the leaderboard table in this report.

For a look at how the top-ranked CCBot from Common Crawl compares, see how many sites block CCBot.


Methodology

US Tech Automations Research collected publicly accessible robots.txt files from 122 prominent sites on June 13, 2026. For each site, we fetched the file, parsed its directives, and checked whether ClaudeBot would be covered by a broad disallow instruction — either via a User-agent: ClaudeBot section or a catch-all User-agent: * section with a broad disallow.

Of the 122 sites, 107 returned a parseable robots.txt file. The remaining 15 are excluded from all percentages and counts. Every figure is a raw read; nothing is estimated, modeled, or extrapolated. Every number in this report is a verbatim count from the raw text of public files, sealed June 13, 2026 with snapshot sha 741353c4304216ee.

MetricValue
Sites in starting universe122
Sites with a parseable robots.txt107
Sites blocking ClaudeBot38
Block rate35.5%

Sites That Block ClaudeBot

The 38 sites that block ClaudeBot are drawn from news, technology media, entertainment, professional platforms, health, and government. The list shows that ClaudeBot is not singled out by a particular content category — it appears in disallow lists across the full range of prominent web publishers.

Major news outlets blocking ClaudeBot include nytimes.com, washingtonpost.com, theguardian.com, bbc.com, cnn.com, apnews.com, bloomberg.com, forbes.com, businessinsider.com, theatlantic.com, usatoday.com, latimes.com, newsweek.com, and vox.com. The entertainment and culture publications rollingstone.com, variety.com, hollywoodreporter.com, and billboard.com also block ClaudeBot.

Technology media on the blocker list includes techcrunch.com, theverge.com, wired.com, arstechnica.com, cnet.com, zdnet.com, and mashable.com. In the health space, webmd.com and healthline.com both block ClaudeBot. The professional network linkedin.com blocks it, as do e-commerce destinations amazon.com and ebay.com.

User-generated and social content platforms quora.com, tumblr.com, medium.com, and vimeo.com all appear in the blocker list. Review and travel platforms tripadvisor.com and yelp.com block ClaudeBot, and investopedia.com covers the personal-finance reference segment. Government content is represented by congress.gov.

Notable allowers — sites with a parseable robots.txt that do not block ClaudeBot — include reuters.com, wsj.com, time.com, and the broad set of e-commerce destinations walmart.com, target.com, bestbuy.com, etsy.com, homedepot.com, wayfair.com, ikea.com, nordstrom.com, and nike.com. Financial services sites chase.com, bankofamerica.com, wellsfargo.com, fidelity.com, paypal.com, nerdwallet.com, bankrate.com, morningstar.com, marketwatch.com, fool.com, and coinbase.com all allow ClaudeBot. Government portals including cdc.gov, medlineplus.gov, usa.gov, irs.gov, sec.gov, whitehouse.gov, census.gov, nasa.gov, and uspto.gov do not block ClaudeBot. Educational institutions mit.edu, harvard.edu, stanford.edu, coursera.org, and edx.org are in the allower list as well.

Entertainment platforms netflix.com, spotify.com, youtube.com, hulu.com, and espn.com also allow ClaudeBot in this snapshot. Tech platforms github.com, reddit.com, pinterest.com, substack.com, wordpress.com, blogger.com, and twitch.tv permit it too. The technology publications gizmodo.com, venturebeat.com, engadget.com, and slashdot.org do not block ClaudeBot either.


Cross-Bot Leaderboard (all 107 sites)

The table below shows block counts and rates for all 9 measured crawlers across the same 107-site corpus, letting you compare ClaudeBot directly with its peers.

BotSites BlockingBlock Rate
CCBot4037.4%
ClaudeBot3835.5%
Bytespider3734.6%
GPTBot3330.8%
Applebot-Extended3129%
Meta-ExternalAgent3028%
PerplexityBot2927.1%
Google-Extended2523.4%
Amazonbot2220.6%

ClaudeBot sits two sites behind CCBot and one site ahead of Bytespider. The top three crawlers occupy a tight band. The middle tier — GPTBot, Applebot-Extended, Meta-ExternalAgent — is a meaningful step below. Google-Extended and Amazonbot trail significantly.

For a deeper look at the ByteDance crawler in third place, see how many sites block Bytespider. For the fourth-ranked OpenAI crawler, see how many sites block GPTBot.


Operator Leaderboard (all 107 sites)

This table counts, by operator, the number of sites that block at least one crawler from that organization. An operator with multiple crawler tokens will show a count equal to or higher than any single bot count.

RankOperatorSites Blocking
1Common Crawl40
2Anthropic39
3ByteDance37
4OpenAI35
4Meta35
6Apple31
7Diffbot30
8Perplexity29
9Cohere27
10Google25
11Amazon22
12Mistral12

Anthropic ranks second at 39 sites — one more than the ClaudeBot-specific count of 38 — indicating that one additional site blocks an Anthropic token other than ClaudeBot. This is a small difference but methodologically important: a publisher who wants to block all Anthropic activity needs to ensure every Anthropic user-agent string is covered.

Mistral at the bottom with 12 sites shows that not all AI operators are treated equally by publishers. The gap between 39 (Anthropic) and 12 (Mistral) is substantial and likely reflects a combination of crawl volume, public awareness, and how long each organization has been active in the web-crawling space.


Frequently Asked Questions

Q: Why would a site block ClaudeBot but allow GPTBot, or vice versa?

A: Blocking decisions in robots.txt are made independently per operator or per crawler token. A publisher might have updated their file to block an early-announced crawler and not yet added rules for later entrants. They might also make deliberate distinctions based on licensing negotiations, press coverage, or their assessment of each organization. The data in this snapshot reflects the state of those rules on June 13, 2026, not the reasoning behind them.

Q: Does blocking ClaudeBot stop Anthropic from accessing the site entirely?

A: No. robots.txt is an honor-system protocol. It signals a site owner preference; it does not technically prevent access. A compliant crawler respects the disallow directives it reads. Whether any given crawler is fully compliant can only be verified by analyzing server logs, not by inspecting robots.txt alone.

Q: What is the corpus-wide benchmark for AI blocking?

A: 48 of 107 sites (44.9%) block at least one AI crawler. ClaudeBot block rate of 35.5% sits below that corpus-wide figure. That means a site that blocks ClaudeBot has company — but blocking ClaudeBot alone would still leave the site accessible to the majority of the 9 crawlers in this edition.

Q: How does the operator count of 39 differ from the ClaudeBot count of 38?

A: The operator count captures any site that blocks at least one crawler attributed to Anthropic. If a site uses a different Anthropic-specific user-agent string beyond ClaudeBot, it would be counted in the operator total but not in the ClaudeBot-specific count. The one-site gap reflects exactly that scenario somewhere in the corpus.


How to Read This Number

A block count is a measure of stated policy, not of enforcement. When a site lists ClaudeBot in a disallow rule, it has published an intent: it does not want Anthropic to fetch its pages with that user-agent. Whether Anthropic honors that intent is a separate question that robots.txt alone cannot answer. The value of a cross-site count is that it turns thousands of individual policy decisions into one comparable signal — a way to see whether avoiding ClaudeBot is a fringe practice or a mainstream one among prominent publishers.

Reading the figure well also means respecting what it excludes. The count reflects only sites that returned a parseable robots.txt file, and only the exact user-agent token ClaudeBot. A site that blocks Anthropic under a different token, or with a server-side rule invisible to a robots.txt read, is not counted here. The number is therefore a floor on stated objection, not a ceiling — a deliberately conservative reading that matches the sealed-data discipline of this series, where nothing is estimated, modeled, or extrapolated.

Put AI-Access Data to Work

A publisher RevOps lead overseeing a portfolio of content properties needs to know whether AI crawlers are respecting stated policies — and whether competitors are changing their robots.txt rules in ways that signal a shift in industry norms. Spot-checking individual domains by hand does not scale across a large portfolio.

US Tech Automations can automate that monitoring: agentic workflows fetch, parse, and diff robots.txt files on a schedule, then route structured change alerts to the right team for review. No manual checking, no missed policy changes. Explore the agentic workflow platform to see how this fits your content and compliance stack.


Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 13, 2026 (snapshot sha 741353c4304216ee).

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “How Many Top Sites Block ClaudeBot? Sealed robots.txt Data.” https://ustechautomations.com/resources/blog/how-many-sites-block-claudebot-2026

Sealed snapshot sha256: 741353c4304216ee

Machine-readable data: CSV · JSON · All research & methodology

About the Author

Garrett Mullins
Garrett Mullins
Workflow Specialist

Helping businesses leverage automation for operational efficiency.