Research & Data

Who Blocks Google-Extended? 25 of 107 Top Sites Do

Jun 13, 2026

Google operates Google-Extended as a dedicated user-agent for AI training and product improvement — explicitly separate from the standard Googlebot that drives search indexing. That distinction gave publishers a clean mechanism to opt out of AI data use without risking their search rankings. As of June 13, 2026, 25 of 107 prominent sites with parseable robots.txt files have done exactly that.

25 of 107 sites block Google-Extended — the lowest among the 4 largest operators.

The 23.4% block rate is the lowest among the large commercial AI operators tracked in this corpus. Common Crawl sits at 40 sites blocked, Anthropic at 39 sites, and OpenAI at 35 sites. Google-Extended's lower resistance rate likely reflects a calculation by many publishers: blocking Google-Extended carries no search penalty, but some publishers may still hesitate to antagonize the company that drives the bulk of their organic traffic — even when those two relationships are now technically decoupled.

Snapshot Methodology

US Tech Automations fetched robots.txt files from 122 prominent sites on June 13, 2026. Of those 122, 107 returned a parseable file; all percentages in this report are computed over that 107-site base. The snapshot is point-in-time and sealed — nothing is estimated, modeled, or extrapolated. Every numeral in this report is a verbatim count from public robots.txt directives as they existed on that date.

The snapshot sha is 741353c4304216ee, which pins the exact state of the dataset. robots.txt is an honor-system standard — it measures a site operator's stated intent, not a technical firewall. These numbers will not change as sites later edit their files; they describe a specific moment in time.

The 122-site panel spans 10 content categories and 21 tracked bot user-agents across 12 AI operators. Across the full corpus, 48 of 107 sites (44.9%) block some AI crawler. An additional 20 of 107 (18.7%) have adopted llms.txt, and 9 sites (8.4%) earned "star" status for the most comprehensive AI-access restrictions in the dataset.

How Often Google-Extended Is Refused

Google-Extended operates as a single user-agent. Like CCBot, there is no sub-agent split between training and search crawling — by design, Google created Google-Extended specifically so publishers could block AI use without touching Googlebot. The per-bot and operator counts are therefore identical at 25.

Google User-Agent	Sites Blocking (of 107)
Google-Extended	25

Google-Extended: 25 of 107 sites block it; every block is a clear AI-training opt-out.

The single-agent design is significant. It means every block is unambiguous: the site has chosen to opt out of Google AI training while maintaining its Google Search relationship. Of the 48 sites in this corpus that block any AI crawler at all, 25 have extended that policy to Google-Extended. The remaining 23 AI-blocking sites have drawn lines against other operators while leaving Google-Extended alone.

That split likely reflects mixed incentives. Some publishers may be in licensing discussions with Google. Others may have added AI blocks reactively, focusing on the operators most visibly associated with generative AI concerns, without updating their policy for Google-Extended. The result is a block count that is real but lower than peers.

Sealed finding: 25 of 107 top sites (23.4%) blocked Google-Extended as of June 13, 2026 — the lowest operator-level block rate among the 4 largest AI operators in this corpus.

The comparison across the 4 largest operators places the Google-Extended figure in sharp relief. Google-Extended at 25 sits well below CCBot at 40 and Anthropic at 39 — the two most-blocked operators in the corpus. Google-Extended's lower count is not simply a reflection of fewer AI-blocking sites overall — 48 sites block someone; only 25 of those 48 have reached Google-Extended.

Sealed finding: Of 48 total AI-blocking sites in this corpus, 25 have blocked Google-Extended and 23 have not — meaning nearly half of all broad AI-restrictors are currently making an exception for Google.

Which Industries Block Google-Extended

News leads the category breakdown with 7 sites blocking Google-Extended, followed closely by Entertainment at 6. Tech contributes 4 blockers, Social 3. Finance, Government, Reference, Retail, and Travel each register 1 blocker.

Category	Sites Blocking Google-Extended
News	7
Entertainment	6
Tech	4
Social	3
Finance	1
Reference	1
Retail	1
Government	1
Travel	1

Entertainment (6 sites) nearly matches News (7) in Google-Extended blocking volume.

The narrower News count of 7 is the most striking divergence from the pattern established by other operators in this corpus. News publishers are the most aggressive resisters across the corpus generally. Yet for Google-Extended, only 7 News sites have added a block. The explanation is likely strategic: a media company that depends on Google Search for referral traffic may calculate that the risks of appearing adversarial to Google outweigh the benefits of blocking its AI training agent.

Entertainment's 6 blockers place it nearly at parity with News for this operator — an unusual configuration. Properties like Rolling Stone, Variety, Hollywood Reporter, Billboard, ESPN, and Hulu have rich content archives and strong IP protection instincts. Entertainment's blocking posture toward Google-Extended appears to be driven more by content-value protection than by traffic-dependency logic.

Finance appears for the first time as a named category in this dataset, with NerdWallet as its sole representative. Finance was not a blocking category for OpenAI, Anthropic, or Common Crawl in this corpus. For a broader comparison of category patterns across operators, see who blocks Common Crawl CCBot and the full category breakdown.

The Named Sites That Block Google-Extended

All 25 sites blocking Google-Extended are named in the sealed dataset. The table below shows 12, sorted by their overall headline-crawlers-blocked score.

Site	Category	Headline Crawlers Blocked (of 9)
bbc.com	News	9
bloomberg.com	News	9
usatoday.com	News	9
nytimes.com	News	8
cnn.com	News	8
theatlantic.com	News	8
wired.com	Tech	8
arstechnica.com	Tech	8
congress.gov	Government	8
rollingstone.com	Entertainment	8
variety.com	Entertainment	8
hollywoodreporter.com	Entertainment	8

The top tier looks familiar: BBC, Bloomberg, USA Today at 9 headline bots each, followed by NYT, CNN, The Atlantic, Wired, and Ars Technica at 8. These are organizations that have adopted comprehensive AI-crawling restrictions and are not making exceptions for Google.

The list also includes notable absences compared to other operator reports. Forbes, which blocks both GPTBot and ClaudeBot in this corpus, does not appear among the Google-Extended blockers. Healthline, LinkedIn, and TripAdvisor — which block several other operators — also do not block Google-Extended here.

The remaining 13 full-list members beyond the table of 12 include: Vox (7 headline bots), LinkedIn (7), TechCrunch (6), Tumblr (6), Yelp (6), Vimeo (5), Investopedia (4), Amazon (7), Billboard (8), ESPN (4), Hulu (2), and NerdWallet (1). Hulu (2) and NerdWallet (1) represent the tail of the distribution: sites that have targeted Google-Extended specifically but have relatively narrow AI-blocking policies overall.

For a side-by-side view of how OpenAI crawlers compare on these same publishers, see who blocks OpenAI GPTBot and how the named-site lists overlap. The overlap between the two operator block lists reveals which publishers have adopted a universal AI-access stance versus a selective one.

Per-Industry Analysis: The Google-Extended Pattern

The most significant finding in the category breakdown is the compressed distribution. For most operators, News leads by a substantial margin. For Google-Extended, the margin between News (7) and Entertainment (6) is just 1 site. That compression reflects two forces pulling in opposite directions.

News publishers are the most sensitive to AI training across the corpus — they block all other operators at much higher rates. The fact that only 7 News sites have blocked Google-Extended suggests that the remaining News publishers have made a deliberate exception. The traffic-dependency hypothesis is the most plausible explanation: many major news outlets derive significant referral volume from Google Search and may perceive any action against a Google product as a relationship risk, even when the two agents are technically separate.

Entertainment's 6 blockers — Rolling Stone, Variety, Hollywood Reporter, Billboard, ESPN, Hulu — are properties with deep original-content archives. They have less dependency on Google Search for revenue than news outlets, and more to gain from protecting their content archives from AI training ingestion without the same traffic-risk calculus.

Finance (1 site) and the 1-site counts across Reference, Retail, Government, and Travel reflect the composition of the 107-site panel rather than a finding about those industries broadly. The Finance entry — NerdWallet — is notable because it does not appear in the Common Crawl, Anthropic, or OpenAI named-blocker lists. NerdWallet has adopted a specific posture toward Google-Extended that it does not apply to all operators.

Put This Data to Work

For an SEO director or content strategy lead at a publisher weighing AI access policy, the Google-Extended data point is particularly actionable. The 25-site block count — versus 39 for Anthropic and 40 for Common Crawl — shows that publishers are not treating all AI operators equally. Many are drawing distinctions based on commercial relationships, traffic dependency, or the perceived use of their content.

US Tech Automations can help you build a competitive intelligence workflow around this signal. If you are tracking how competitor publishers are adjusting their AI access policies, a scheduled robots.txt monitoring system can surface the moment a peer outlet adds or removes a Google-Extended block. That intelligence has direct implications for content strategy and licensing decisions.

The 25 Google-Extended blockers as of June 13, 2026 will drift over time. Google licensing activity with publishers is ongoing, and both additions and removals are plausible. Automated tracking is the only reliable way to know when the policy landscape shifts.

Frequently Asked Questions

Q: Does blocking Google-Extended affect my Google Search ranking?

A: No. Google-Extended is explicitly separate from Googlebot, which powers Search indexing. Google designed it this way so publishers could opt out of AI training without any search ranking consequence. Blocking Google-Extended will not remove your pages from Google Search.

Q: If I block Google-Extended, does that block Gemini from using my content?

A: Blocking Google-Extended affects data ingestion for Google AI training and product improvement. It does not prevent users from asking Gemini to summarize a URL they manually provide, and it does not retroactively remove content from existing datasets. It affects future crawls from the blocked date forward.

Q: Why does Google-Extended have a lower block rate than Anthropic or Common Crawl?

A: Several likely factors: Google Search dependency creates a perceived relationship risk even though the two agents are decoupled; Google-Extended was introduced to give publishers a clean opt-out mechanism, reducing the urgency of blocking it via broad wildcard rules; and some publishers may be in active licensing discussions. The sealed count is 25 of 107 vs. 39 for Anthropic and 40 for Common Crawl.

Q: Is Finance a new blocking category in this dataset?

A: In this corpus, Finance appears as a blocking category only for Google-Extended (1 site: NerdWallet). It is absent from the OpenAI, Anthropic, and Common Crawl category breakdowns for this dataset. See who blocks Anthropic ClaudeBot for the Anthropic category breakdown to confirm Finance does not appear there.

Q: How does Google-Extended compare to the other 3 large AI operators on block count?

A: Common Crawl leads at 40 sites blocked, followed by Anthropic at 39, OpenAI at 35, and Google-Extended at 25 — the lowest of the four. Among the 48 sites that block any AI crawler, only 25 have reached Google-Extended.

Key Takeaways

25 of 107 top sites block Google-Extended — the lowest operator-level block rate among the 4 largest AI operators in this corpus.
Google-Extended runs as a single user-agent with no sub-agent split; every block is an unambiguous opt-out from Google AI training.
Entertainment (6 sites) nearly matches News (7 sites) in blocking volume — an unusual pattern that does not appear for other operators in this corpus.
Finance appears as a blocking category uniquely for Google-Extended (NerdWallet, 1 headline bot), absent from the OpenAI, Anthropic, and Common Crawl breakdowns.
48 of 107 sites (44.9%) block some AI crawler; 25 of those 48 have extended the restriction to Google-Extended while 23 have not.
The sealed snapshot sha 741353c4304216ee pins the exact dataset; nothing is derived or estimated from secondary sources.

Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 13, 2026 (snapshot sha 741353c4304216ee).

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “Who Blocks Google-Extended? 25 of 107 Top Sites Do.” https://ustechautomations.com/resources/blog/who-blocks-google-extended-2026

Sealed snapshot sha256: 741353c4304216ee

Machine-readable data: CSV · JSON · All research & methodology