Research & Data

How Many Top Websites Block AI Crawlers in 2026?

Jun 13, 2026

How many of the web's most prominent sites are slamming the door on AI crawlers? The straightforward answer from the sealed data: of 122 prominent public websites we checked on June 13, 2026, 107 returned a parseable robots.txt — and 48 of those, or 44.9%, block at least one major AI crawler outright.

That 44.9% is the headline number of this report. It means nearly half of the major sites that publish a machine-readable access policy now name at least one AI bot and tell it, explicitly, to stay out. A few years ago that share was effectively zero. The "open web" that trained the first generation of large language models is, site by site, quietly closing.

This post covers a curated, named set of 122 prominent websites across 10 categories — news, tech, reference, retail, social, finance, travel, education, government, and entertainment. It is not a claim about "the top N sites on the internet," and it is not a random sample. Every figure here is a verbatim count from robots.txt files actually fetched on the snapshot date and sealed point-in-time; the numbers will not change as sites later edit their files. A site is counted as "blocking" a bot only when its robots.txt contains a user-agent group naming that bot exactly with a Disallow: / rule. Percentages are taken over the 107 sites that returned a readable robots.txt, not all 122 checked.

What the Sealed Snapshot Measured

The snapshot tracks 21 distinct AI-crawler user-agents operated by 12 companies, across 10 content categories. The point of widening the net past the obvious names is that "blocking AI" is not one decision — a site can wave through a search-grounding bot while slamming the door on a training scraper, and the snapshot captures that granularity.

Metric	Value
Prominent sites checked	122
Sites with a parseable robots.txt	107
Content categories	10
AI crawler user-agents tracked	21
Operating companies tracked	12
Sites blocking at least one major AI crawler	48
Share blocking at least one major AI crawler	44.9%
Sites serving an llms.txt file	20
Share serving an llms.txt file	18.7%

Two numbers anchor the whole picture. The 44.9% block rate shows how widespread refusal has become; the 18.7% llms.txt adoption rate shows the opposite reflex — a minority of sites actively inviting and guiding AI systems rather than walling them out. The web is not converging on one answer. It is splitting.

"As of June 13, 2026, 48 of the 107 prominent sites with a readable robots.txt — 44.9% — block at least one major AI crawler, while just 20 sites (18.7%) publish an llms.txt to guide AI systems instead."

The Per-Crawler Picture

Aggregate block rates hide the fact that some crawlers are far more frequently refused than others. Ranked by how many of the 107 sites block each one, the nine highest-profile AI crawlers sort like this:

AI Crawler	Operator	Sites Blocking	Block Rate
CCBot	Common Crawl	40	37.4%
ClaudeBot	Anthropic	38	35.5%
Bytespider	ByteDance	37	34.6%
GPTBot	OpenAI	33	30.8%
Applebot-Extended	Apple	31	29%
Meta-ExternalAgent	Meta	30	28%
PerplexityBot	Perplexity	29	27.1%
Google-Extended	Google	25	23.4%
Amazonbot	Amazon	22	20.6%

CCBot — the Common Crawl bot whose archives have seeded a large share of public LLM training corpora — is the single most-blocked crawler at 37.4%. That is not a coincidence: Common Crawl is the crawler site operators most associate with their content ending up in someone else's training run, so it draws the most defensive Disallow. ClaudeBot (35.5%) and Bytespider (34.6%) sit just behind it.

The spread between the most-blocked (CCBot, 37.4%) and least-blocked (Amazonbot, 20.6%) crawler is meaningful — a 16.8-point gap. Site operators are not blocking "AI" as an undifferentiated category; they are making bot-by-bot calls based on who they trust, who they fear training on their content, and which crawler sends referral traffic back.

The Blanket-Block Minority

There is a separate, harsher signal hiding in the data: the wildcard block. When a site puts Disallow: / under User-agent: *, it is telling every compliant crawler — search engines and AI bots alike — to stay out of its main content.

Block type	Sites	Share of 107
Block at least one major AI crawler (named)	48	44.9%
Wildcard `Disallow: /` under `User-agent: *`	9	8.4%

Only 9 sites (8.4%) take that blanket approach. The far more common pattern — the other 39 sites in the blocking group — is surgical: leave the general-purpose and search crawlers alone, and name specific AI bots to exclude. That distinction matters for anyone trying to read the trend. The dominant behavior in 2026 is not "close the site." It is "stay in Google, stay out of the training set."

The Hardest Closers: Sites That Block Everything

Inside the 48 blockers, there is a spectrum from surgical to absolute. The sealed per-domain record lets us rank sites by how many of the nine headline crawlers each one refuses.

Posture	Sites	Examples (from the sealed set)
Block all 9 headline crawlers	3	bbc.com, bloomberg.com, usatoday.com
Block 8 of 9	15	nytimes.com, cnn.com, forbes.com, theatlantic.com, wired.com, congress.gov
Block 0 of 9	59	reuters.com, wsj.com, wikipedia.org, github.com, walmart.com, cdc.gov

Only three sites in the entire curated set — bbc.com, bloomberg.com, and usatoday.com — slam the door on every one of the nine highest-profile AI crawlers. A second, larger tier of 15 sites blocks eight of the nine, typically leaving a single search-grounding crawler through; this tier is dominated by major news and tech publishers like nytimes.com, cnn.com, forbes.com, wired.com, and arstechnica.com, plus congress.gov.

The most striking number, though, is at the other end: 59 of the 107 readable-policy sites block none of the nine headline crawlers at all. That majority includes household names you might expect to be defensive — reuters.com and wsj.com among the news brands, wikipedia.org and britannica.com in reference, and the entire bench of retailers, banks, travel sites, and government domains from walmart.com and chase.com to cdc.gov and census.gov. The blocking story is real, but it is concentrated in a specific slice of the web, not evenly spread across it.

Why This Series Exists and Why It Can't Be Rebuilt Later

A robots.txt file is a destructive record. When a site changes its mind — adds GPTBot to the block list, or removes it after signing a licensing deal — the previous version is simply overwritten. There is no public, per-domain archive of who blocked which AI crawler on which date. Once today passes, today's access posture is gone unless someone captured it.

That is the entire reason for sealing this snapshot now. The 44.9% figure is a fact about June 13, 2026 that no one will be able to reconstruct in 2027. As AI-training litigation, content-licensing marketplaces, and agent-driven traffic all scale, a timestamped, hashed record of how the web's access policy shifted month over month becomes the kind of ground truth that can't be backfilled.

The snapshot is sealed with a sha256 hash over its full set of measured values, so the numbers in this report are tamper-evident: anyone can recompute the hash and confirm the figures were not edited after the fact. That property is what turns a one-time crawl into a citable record. A claim like "44.9% of prominent sites blocked at least one AI crawler in June 2026" is only useful if it is anchored to a fixed, verifiable measurement — and that is exactly what the seal provides.

Put This Data to Work

For an operations leader or a marketing team, the access-policy map is not trivia — it is a live input to two decisions. First, your own posture: if you publish content, your robots.txt is now a strategic document, not boilerplate, and most teams have never deliberately set it. Second, your competitive and sourcing intelligence: knowing which sources are open to AI grounding and which are walled off shapes where automated research and retrieval pipelines can legally and reliably draw from.

This is the kind of monitoring US Tech Automations builds for operators who would rather not check 100 robots.txt files by hand every month. An automation specialist can stand up a workflow that fetches and diffs these access policies on a schedule, flags the day a key competitor or data source changes its AI-crawler rules, and routes that signal to the right owner. The same US Tech Automations playbook that powers our sealed-data research — scheduled fetch, parse, diff, alert — is exactly what a content or RevOps team needs to keep its own AI-access policy intentional rather than accidental. If you run retrieval or research automations, US Tech Automations can help you keep their source lists aligned with what each site actually permits.

Frequently Asked Questions

What does it mean for a site to "block" an AI crawler?
In this report, a site blocks a crawler when its public robots.txt contains a user-agent group naming that exact bot with a Disallow: / rule. That is the standard, voluntary mechanism crawlers are expected to honor; it is not a technical firewall.

Does blocking in robots.txt actually stop the crawler?
robots.txt is an honor-system standard. Well-behaved crawlers from major AI companies generally respect it, but it is a request, not an enforcement mechanism. The signal it carries is the site operator's intent, which is what this dataset measures.

Why are percentages based on 107 sites and not 122?
Of the 122 prominent sites checked, 107 returned a parseable robots.txt. The 15 that did not (no file, an error response, or an unreadable format) are excluded from the rate denominators so the percentages reflect only sites that actually published a policy.

Is 44.9% representative of the whole web?
No. This is a curated set of prominent sites across 10 categories, deliberately weighted toward high-profile publishers, retailers, and reference sources. Smaller sites block AI crawlers at much lower rates. Treat 44.9% as a measure of prominent-site behavior, not the web at large.

How often does this change?
Frequently and without notice. That is the point of sealing a dated snapshot — the access posture captured here is specific to June 13, 2026 and will drift as sites revise their files.

What's the difference between blocking one AI crawler and blocking all of them?
A large one. Most blockers in this set are surgical — they name a specific bot like CCBot or ClaudeBot and leave search and general-purpose crawlers alone. Only a few take an absolute stance: three sites (bbc.com, bloomberg.com, usatoday.com) block all nine headline crawlers, while 59 sites block none of them. The headline 44.9% counts any site that blocks at least one, so it spans everything from a single targeted exclusion to a near-total wall.

Key Takeaways

Of 122 prominent sites checked on June 13, 2026, 107 had a parseable robots.txt, and 48 (44.9%) block at least one major AI crawler.
CCBot (37.4%) is the most-blocked crawler, followed by ClaudeBot (35.5%) and Bytespider (34.6%); Amazonbot (20.6%) is the least-blocked of the nine headline bots.
Only 9 sites (8.4%) use a blanket Disallow: / wildcard; the dominant pattern is surgical, AI-specific exclusion that leaves search crawlers untouched.
A countertrend exists: 20 sites (18.7%) publish an llms.txt to guide AI systems rather than block them.
robots.txt is a destructive record with no public per-domain history, which is why this sealed, timestamped snapshot captures something that cannot be reconstructed later.

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “How Many Top Websites Block AI Crawlers in 2026?.” https://ustechautomations.com/resources/blog/how-many-top-websites-block-ai-crawlers-2026

Sealed snapshot sha256: 741353c4304216ee

Machine-readable data: CSV · JSON · All research & methodology