Research & Data

How Many Sites Block Google-Extended? robots.txt Data

Jun 13, 2026

Key Takeaways

25 of 107 top sites block Google-Extended.

Google-Extended is blocked at a 23.4% rate across 107 sites.

48 of 107 sites block at least one AI crawler.

Google-Extended is the Google AI-training opt-out token. It is a distinct user-agent token that publishers can use specifically to exclude their content from Google's AI training pipeline without affecting their regular Google Search indexing. That separation is meaningful: a site can permit Googlebot for search while blocking Google-Extended for AI purposes. This report documents how 107 prominent sites have exercised that opt-out as of June 13, 2026.

25 of 107 sites with a parseable robots.txt file block Google-Extended as of June 13, 2026.

Google-Extended at 23.4% is the second-lowest block rate among the 9 measured crawlers, above only Amazonbot.

What Is Google-Extended and Why Does It Matter?

Google-Extended is the Google AI-training opt-out token — a crawler user-agent that Google introduced specifically to let publishers control whether their content contributes to Google AI model training and products. It is not the standard Googlebot used for web indexing and search ranking. Blocking Google-Extended leaves a site fully visible to Google Search, while signaling that the publisher does not want their content used in AI training contexts.

That design makes Google-Extended functionally different from most other AI crawler tokens. With other bots, blocking affects both AI-specific and general-purpose crawl activity by that operator. With Google-Extended, publishers can make a surgically precise policy decision: yes to search, no to AI training. Publishers who have adopted this approach appear in the blocker list.

This report is grounded entirely in a point-in-time read of public robots.txt files. The data covers 122 prominent sites measured on June 13, 2026. Of those, 107 returned a parseable robots.txt file. Every figure is a raw read; nothing is estimated, modeled, or extrapolated — every figure is a verbatim count from the sealed snapshot (sha 741353c4304216ee).

Snapshot Summary: Sites, Coverage, and Block Rate

The snapshot covered 122 prominent sites spanning news, e-commerce, professional platforms, social media, finance, travel, education, government, and entertainment verticals.

Metric	Count
Sites in scope	122
Sites with a parseable robots.txt	107
Sites blocking Google-Extended	25
Block rate for Google-Extended	23.4%

At 23.4%, the Google-Extended block rate is the second lowest in the nine-bot dataset, sitting above only Amazonbot. It falls substantially below the corpus-wide any-AI-block benchmark of 44.9%. That means a majority of sites that block at least one AI crawler have not added Google-Extended to their disallow rules.

Several factors may contribute to the relatively lower blocking rate. Publishers may be hesitant to conflict with Google's AI ecosystem given the search-indexing relationship. They may also not be as familiar with Google-Extended as a distinct opt-out token compared to more prominently discussed crawlers like GPTBot or CCBot.

Who Blocks Google-Extended? The Named Sites

The following 25 sites explicitly block Google-Extended in their robots.txt files as of June 13, 2026.

News and media: nytimes.com, bbc.com, cnn.com, bloomberg.com, theatlantic.com, usatoday.com, vox.com, theverge.com, wired.com, arstechnica.com, rollingstone.com, variety.com, hollywoodreporter.com, billboard.com

Technology press and business: techcrunch.com

Finance: investopedia.com, nerdwallet.com

E-commerce: amazon.com

Professional and social platforms: linkedin.com, tumblr.com, vimeo.com

Reviews: yelp.com

Entertainment: hulu.com

Sports: espn.com

Government: congress.gov

The news and media vertical again makes up the largest share of this list, consistent with the patterns seen across the other crawlers in this edition. Notable inclusions are hulu.com and espn.com — streaming and sports media properties — alongside finance sites investopedia.com and nerdwallet.com.

A substantial group of prominent sites does not block Google-Extended. The allower group includes washingtonpost.com, theguardian.com, reuters.com, apnews.com, wsj.com, forbes.com, businessinsider.com, latimes.com, time.com, newsweek.com, github.com, engadget.com, cnet.com, zdnet.com, mashable.com, gizmodo.com, venturebeat.com, wikipedia.org, webmd.com, healthline.com, reddit.com, pinterest.com, shopify.com, walmart.com, target.com, bestbuy.com, etsy.com, netflix.com, spotify.com, youtube.com, and government sites including nasa.gov, census.gov, and irs.gov.

The presence of washingtonpost.com and theguardian.com in the allower group — both of which appear in other bots' blocker lists — is notable. It suggests that blocking Google-Extended is a more selective decision than blocking some other AI crawlers, made only by publishers with a particularly clear stance on AI training opt-out. For a comparison with a crawler that attracts more blocks in the media sector, see the PerplexityBot report.

Cross-Bot Leaderboard: Where Google-Extended Ranks

The table below shows blocking counts for all 9 crawlers measured in this snapshot, across all 107 sites with a parseable robots.txt file.

Bot	Sites Blocking	Block Rate
CCBot	40	37.4%
ClaudeBot	38	35.5%
Bytespider	37	34.6%
GPTBot	33	30.8%
Applebot-Extended	31	29%
Meta-ExternalAgent	30	28%
PerplexityBot	29	27.1%
Google-Extended	25	23.4%
Amazonbot	22	20.6%

Google-Extended at 25 blocks ranks eighth among the 9 crawlers — the second-lowest position, above only Amazonbot at 22. The gap from Google-Extended to PerplexityBot above it is four sites, a meaningful step rather than a rounding difference.

The top of the table — CCBot at 40, ClaudeBot at 38, Bytespider at 37 — represents crawlers primarily associated with AI training dataset construction. The lower half of the table — Applebot-Extended, Meta-ExternalAgent, PerplexityBot, Google-Extended, Amazonbot — represents crawlers tied to AI product and inference use cases, which appear to attract somewhat fewer targeted blocks. The clustering is not perfect, but the pattern is visible in the data.

Operator Leaderboard: Google in Context

The operator leaderboard aggregates blocks at the company level across all 107 sites.

Rank	Operator	Sites Blocking (any crawler)
1	Common Crawl	40
2	Anthropic	39
3	ByteDance	37
4	OpenAI	35
4	Meta	35
6	Apple	31
7	Diffbot	30
8	Perplexity	29
9	Cohere	27
10	Google	25
11	Amazon	22
12	Mistral	12

Google ranks tenth among the 12 operators at 25 — the same as the Google-Extended bot count. That alignment occurs when an operator's only measured crawler is the one named in the bot-level analysis. The result confirms that the 25 sites blocking Google-Extended account for the full Google AI-crawler footprint in this nine-bot snapshot.

Common Crawl leads at 40 and Anthropic follows at 39 — both notably above Google. Mistral at 12 sits at the bottom of the operator table. For a view of a crawler that attracts the fewest blocks in this edition, the Amazonbot report covers that in detail.

Methodology and Data Integrity

This report is part of the US Tech Automations Closing Web research edition. The methodology is consistent across all reports in this edition:

A list of 122 prominent sites was assembled, covering major verticals with meaningful web traffic and publisher diversity.
Each site's robots.txt was fetched and parsed on June 13, 2026. Sites that returned a parseable robots.txt file formed the analysis denominator of 107.
For each named AI crawler, the parser checked for a Disallow rule targeting the root path or a broad pattern. A site is counted as "blocking" if its robots.txt instructs the named crawler to avoid at least the root.
All counts are verbatim. Nothing is estimated, modeled, or extrapolated. The snapshot is sealed under sha 741353c4304216ee.

A key methodological note specific to Google-Extended: blocking this token does not affect a site's relationship with Googlebot or its presence in Google Search results. The opt-out is specific to AI training and AI product use. When interpreting the 25-site blocker list, it is worth noting that these sites are likely still indexed by Google Search — the block is narrower than it might appear.

The robots.txt standard is advisory. Crawler operators can choose to honor or ignore Disallow rules. This report records stated policy only. For the cross-bot comparison context, the Meta-ExternalAgent report covers the adjacent rank position.

Frequently Asked Questions

Q: Does blocking Google-Extended affect a site's Google Search rankings?

A: No. Google-Extended is a separate user-agent from Googlebot, which is the crawler responsible for Search indexing and ranking. Blocking Google-Extended does not instruct Googlebot to stop crawling. A site that adds Google-Extended to its robots.txt disallow list remains fully eligible for Google Search indexing and ranking while opting out of AI training use specifically.

Q: Why does Google rank tenth as an operator when it is one of the largest technology companies?

A: Operator rankings in this snapshot reflect how many sites have chosen to block at least one crawler associated with each operator — they do not reflect company size, revenue, or market influence. Google ranks tenth because fewer of the 107 measured sites have added Google-Extended to their disallow rules compared to operators like Common Crawl, Anthropic, or ByteDance. One possible factor is that Google introduced Google-Extended as an explicit opt-out mechanism, which may signal to some publishers that the default (not blocking) is acceptable.

Q: Is 23.4% a low block rate in the context of this dataset?

A: It is the second lowest in the nine-bot measurement set, above only Amazonbot at 20.6%. Relative to the corpus-wide benchmark of 44.9% — meaning 44.9% of measured sites block at least one AI crawler — 23.4% is well below the any-block threshold. However, blocking is not the only policy option. Some publishers may be in licensing negotiations, using other technical controls, or simply monitoring before deciding. This data captures only what is declared in robots.txt files.

Q: How does Google-Extended compare to other AI-training-specific crawlers?

A: CCBot, which leads the leaderboard at 40 blocks and 37.4%, is also primarily associated with AI training data. The gap between CCBot and Google-Extended is substantial. One difference is that CCBot is operated by Common Crawl, a nonprofit data provider whose outputs are widely used in AI model training by multiple parties. Publishers may perceive that use case as more open-ended and respond more aggressively. Google-Extended by contrast is tied to a single named company with a defined product scope.

Q: Could the block rate for Google-Extended increase in future snapshots?

A: This report does not forecast future policy changes. It records only the state on June 13, 2026. Publisher AI-access policies are actively evolving, and both the number of blocking sites and the specific sites in each category can change. For a companion view of publisher behavior toward a different operator, see the ClaudeBot report.

Put AI-Access Data to Work

A retrieval or data engineer building a content-sourcing pipeline needs to know which sites permit AI access before including them in an ingestion workflow. Checking robots.txt manually for each source is feasible at small scale — not at scale across a corpus of hundreds of domains, and not when publisher policies change without notice.

US Tech Automations builds agentic workflows that automate this kind of AI-access monitoring: scheduled robots.txt reads across a defined source list, change detection when disallow rules are added or removed, and structured output that feeds directly into ingestion pipeline logic. When a publisher adds Google-Extended to their disallow list, the workflow surfaces that change before it becomes a compliance issue.

Explore agentic monitoring workflows on the platform to see how this type of AI-access tracking integrates with content operations and data engineering workflows.

Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 13, 2026 (snapshot sha 741353c4304216ee).

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “How Many Sites Block Google-Extended? robots.txt Data.” https://ustechautomations.com/resources/blog/how-many-sites-block-google-extended-2026

Sealed snapshot sha256: 741353c4304216ee

Machine-readable data: CSV · JSON · All research & methodology