Research & Data

How Many Sites Block Meta-ExternalAgent? robots.txt Data

Jun 13, 2026

Key Takeaways

30 of 107 top sites block Meta-ExternalAgent.

Meta-ExternalAgent is blocked at a 28% rate across 107 sites.

48 of 107 sites block at least one AI crawler.

Meta-ExternalAgent is the Meta AI crawler. It is one of several crawlers operated by Meta, and the figure measured here is the specific named token found in robots.txt files across the 107 prominent sites in this snapshot. Understanding how often that token appears in a Disallow directive gives publishers, engineers, and SEO professionals a concrete benchmark for where Meta AI access policy currently stands across the open web.

30 of 107 sites with a parseable robots.txt file block Meta-ExternalAgent as of June 13, 2026.

Meta crawlers collectively face blocks at 35 of the 107 sites measured in this sealed snapshot.

What Is Meta-ExternalAgent and Why Does It Matter?

Meta-ExternalAgent is the Meta AI crawler — a robot that indexes publicly accessible web content for Meta's AI-related training and retrieval systems. Publishers who want to control whether their content enters Meta's AI pipeline use the User-agent: Meta-ExternalAgent token in their robots.txt files to signal that policy.

Unlike a general-purpose search crawler, Meta-ExternalAgent is targeted: it exists specifically to feed AI workloads. That distinction matters to publishers who may be comfortable with search indexing but unwilling to contribute to AI training or answer-engine systems without compensation or licensing agreement. The robots.txt standard is the only broadly recognized self-service mechanism those publishers have.

This report is grounded entirely in a point-in-time read of public robots.txt files. The data was collected across 122 prominent sites on June 13, 2026. Of those 122, 107 returned a parseable robots.txt file. The remaining 15 sites were excluded from blocking calculations. Nothing is estimated, modeled, or extrapolated — every figure in this post is a verbatim count from that sealed snapshot (sha 741353c4304216ee).

Snapshot Summary: Sites, Coverage, and Block Rate

The snapshot covered 122 prominent sites across news, e-commerce, social platforms, finance, travel, education, government, and entertainment verticals. Of those 122, 107 provided a parseable robots.txt file. That is the denominator for all block-rate calculations in this report.

Metric	Count
Sites in scope	122
Sites with a parseable robots.txt	107
Sites blocking Meta-ExternalAgent	30
Block rate for Meta-ExternalAgent	28%

The 28% block rate for Meta-ExternalAgent sits below the corpus-wide any-blocker benchmark of 44.9%. That means a meaningful portion of sites that block at least some AI crawler have not yet added Meta-ExternalAgent to their disallow list. Whether that reflects deliberate policy or a slower update cadence is not something this snapshot can determine — the data records only what is present in robots.txt, not the intent behind it.

The 28% figure also places Meta-ExternalAgent at sixth position among the 9 crawlers measured in this edition, as shown in the cross-bot leaderboard below.

Who Blocks Meta-ExternalAgent? The Named Sites

The following 30 sites explicitly block Meta-ExternalAgent in their robots.txt files as of June 13, 2026. This list is complete as recorded in the snapshot — naming these sites reflects only what their robots.txt files declare, not editorial commentary on their AI policies.

News and media: nytimes.com, washingtonpost.com, theguardian.com, bbc.com, bloomberg.com, forbes.com, theatlantic.com, usatoday.com, vox.com, theverge.com, wired.com, arstechnica.com, cnet.com, zdnet.com, mashable.com, gizmodo.com, rollingstone.com, variety.com, hollywoodreporter.com, billboard.com

Health: healthline.com

E-commerce and retail: amazon.com, ebay.com

Professional and social platforms: linkedin.com, tumblr.com, medium.com

Reviews and local: tripadvisor.com, yelp.com

Education: coursera.org

Government: congress.gov

The breadth of that list is notable: it spans multiple verticals and includes both large consumer platforms and premium editorial brands. News organizations — particularly those that depend on licensing and subscription revenue — appear with concentration in the blocker list.

Several prominent sites do not block Meta-ExternalAgent, according to this snapshot. Sites in the allower group include cnn.com, reuters.com, apnews.com, wsj.com, businessinsider.com, latimes.com, time.com, newsweek.com, github.com, techcrunch.com, engadget.com, venturebeat.com, wikipedia.org, webmd.com, goodreads.com, reddit.com, pinterest.com, shopify.com, walmart.com, target.com, bestbuy.com, expedia.com, booking.com, airbnb.com, netflix.com, spotify.com, youtube.com, nasa.gov, and others.

The presence of sites like reddit.com and wikipedia.org in the allower group, alongside major retail properties and streaming platforms, suggests that blocking Meta-ExternalAgent is a deliberate active decision rather than a default. Sites that block it have explicitly added the token; silence means access is permitted under the standard.

Cross-Bot Leaderboard: Where Meta-ExternalAgent Ranks

The table below shows blocking counts for all 9 crawlers measured in this snapshot, across all 107 sites with a parseable robots.txt file. Meta-ExternalAgent sits at rank six.

Bot	Sites Blocking	Block Rate
CCBot	40	37.4%
ClaudeBot	38	35.5%
Bytespider	37	34.6%
GPTBot	33	30.8%
Applebot-Extended	31	29%
Meta-ExternalAgent	30	28%
PerplexityBot	29	27.1%
Google-Extended	25	23.4%
Amazonbot	22	20.6%

CCBot, operated by Common Crawl, leads with 40 blocks. ClaudeBot (Anthropic) follows at 38. Meta-ExternalAgent at 30 blocks places it in the middle of the pack — meaningfully ahead of Google-Extended and Amazonbot, but behind the three crawlers at the top of the table.

The spread from 40 (CCBot) to 22 (Amazonbot) is substantial. It reflects genuine variation in how publishers perceive the different operators and their AI use cases. For detailed per-bot breakdowns, see the sibling reports: how many sites block CCBot and how many sites block PerplexityBot.

Operator Leaderboard: Meta in Context

Beyond looking at individual bots, this snapshot also tallies blocks at the operator level — counting how many sites block at least one crawler associated with each operator. The operator leaderboard covers 12 operators across all 107 sites.

Rank	Operator	Sites Blocking (any crawler)
1	Common Crawl	40
2	Anthropic	39
3	ByteDance	37
4	OpenAI	35
4	Meta	35
6	Apple	31
7	Diffbot	30
8	Perplexity	29
9	Cohere	27
10	Google	25
11	Amazon	22
12	Mistral	12

Meta ties OpenAI at 35, which is higher than the 30 count for Meta-ExternalAgent specifically. That gap is possible when sites block multiple Meta crawlers under different user-agent tokens — a site might block both Meta-ExternalAgent and other Meta crawlers, or block a Meta crawler not measured in this nine-bot subset, but still count once in the operator-level tally.

Mistral sits at the bottom of the operator leaderboard at 12 — the lowest measured operator. For a companion view of where Google ranks at the bot level, the Google-Extended report covers that in full detail.

Methodology and Data Integrity

This report is part of the US Tech Automations Closing Web research edition, which systematically monitors AI-access policy across a curated set of prominent sites. The methodology:

A list of 122 prominent sites was assembled across verticals representing meaningful web traffic and publisher diversity.
Each site's robots.txt was fetched and parsed on June 13, 2026. Sites that returned a parseable robots.txt file formed the analysis denominator of 107.
For each of 9 named AI crawlers, the parser checked for a Disallow rule covering the root path or a broad pattern. A site is counted as "blocking" if its robots.txt instructs the named crawler to avoid at least the root.
All counts are verbatim — nothing is estimated, modeled, or extrapolated. The snapshot is sealed under sha 741353c4304216ee.

The robots.txt standard is advisory. A crawler operator can technically ignore Disallow directives. This report records stated policy, not guaranteed enforcement. Publishers who want binding restrictions must use other technical or legal mechanisms. That said, robots.txt remains the primary public signal of AI-access intent used across the industry.

For a broader view of the any-AI-crawler blocking picture, the how many sites block GPTBot report provides useful context on the most widely recognized AI crawler token.

Frequently Asked Questions

Q: Does blocking Meta-ExternalAgent in robots.txt actually stop it?

A: Not necessarily. The robots.txt standard is an honor system — it signals to a crawler that it should not access certain paths, but there is no technical enforcement mechanism. A crawler that chooses to ignore robots.txt can still fetch the content. Blocking in robots.txt is a policy declaration, not a technical lock. Publishers who require guaranteed enforcement need additional technical controls.

Q: Why do some sites block Meta-ExternalAgent but still allow other Meta crawlers?

A: Different crawler tokens serve different purposes. A publisher might permit a general-purpose Meta crawler for search or social-link-preview purposes while specifically blocking the token associated with AI training or retrieval. The robots.txt file is per-user-agent, so granular opt-out by function is possible. This snapshot measures only the Meta-ExternalAgent token.

Q: How does the 28% block rate compare to the broader corpus benchmark?

A: The corpus-wide benchmark is that 48 of 107 sites — 44.9% — block at least one AI crawler. Meta-ExternalAgent at 28% falls well below that line, indicating that a significant share of sites that do block some AI crawlers have not specifically targeted Meta-ExternalAgent. Whether that will change as publishers update their policies is not something this point-in-time snapshot can predict.

Q: What does the operator count of 35 mean relative to the bot count of 30?

A: The operator count of 35 for Meta means that 35 of the 107 sites block at least one crawler associated with Meta. The bot count of 30 means that 30 sites specifically block Meta-ExternalAgent. The difference reflects sites that block a different Meta-associated crawler token — one not measured in this nine-bot subset — but do not specifically list Meta-ExternalAgent in their disallow rules.

Q: Is this data updated continuously?

A: No. This is a point-in-time snapshot collected on June 13, 2026. The Closing Web edition publishes periodic snapshots rather than continuous monitoring. robots.txt files can change at any time, and a site that blocks Meta-ExternalAgent today may change its policy in either direction. For continuous tracking of AI-access policy changes across a site portfolio, see the Amazonbot report for context on how the lower end of the blocking spectrum behaves.

Put AI-Access Data to Work

An SEO or content operations lead managing a publisher portfolio needs to know, at scale, which sites have changed their AI-access policy since the last snapshot — not as a one-time lookup, but as an ongoing monitoring workflow. Reading robots.txt files manually across dozens or hundreds of domains is not sustainable.

US Tech Automations builds agentic workflows that automate exactly this kind of monitoring: scheduled crawls of robots.txt across a defined site list, diff detection when tokens are added or removed, and alerting routed to the right stakeholder. Instead of a quarterly manual audit, the workflow runs on a defined cadence and surfaces changes as they happen.

Explore agentic monitoring workflows on the platform to see how this kind of AI-access tracking can be operationalized across a full site portfolio.

Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 13, 2026 (snapshot sha 741353c4304216ee).

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “How Many Sites Block Meta-ExternalAgent? robots.txt Data.” https://ustechautomations.com/resources/blog/how-many-sites-block-meta-externalagent-2026

Sealed snapshot sha256: 741353c4304216ee

Machine-readable data: CSV · JSON · All research & methodology