Research & Data

Who Blocks Meta's AI Crawler? 35 of 107 Top Sites Do

Jun 13, 2026

Meta AI crawlers are blocked by 35 of 107 prominent sites in our June 2026 robots.txt snapshot — a 32.7% refusal rate. That number reflects at least one of Meta's tracked user-agents being listed in a site's robots.txt with a Disallow: / directive. Meta enters this dataset as the only operator with both a dedicated AI crawler (Meta-ExternalAgent) and a legacy social bot (FacebookBot) tracked separately, which creates a layered picture of how publishers respond to Meta's web presence.

"Blocking" means a site's robots.txt explicitly names one of Meta's user-agents and instructs compliant crawlers to stay out. This is a stated-intent measure, not a technical enforcement mechanism. A site operator who wants to exclude Meta's AI systems has to name the right user-agents; the data below shows how many have done exactly that.

Meta AI crawlers are blocked at 35 of 107 sites — a 32.7% rate.

All figures are verbatim counts from public robots.txt files fetched and sealed point-in-time on June 13, 2026, across a curated set of 122 prominent sites; 107 returned a parseable robots.txt and percentages are over those 107. robots.txt is an honor-system standard — it measures the site operator's stated intent, not a technical firewall. These numbers will not change as sites later edit their files.

How Often Meta Is Refused

Meta operates three tracked crawlers in this dataset. Meta-ExternalAgent (case-sensitive) and its lowercase alias meta-externalagent each appear in 30 block entries; FacebookBot appears in 24. The operator total is 35, meaning some sites block only FacebookBot, some block only Meta-ExternalAgent, and some block both. Deduplication at the domain level produces the 35-site figure.

User-Agent	Sites Blocking
Meta-ExternalAgent	30
meta-externalagent	30
FacebookBot	24
Operator total (deduplicated)	35

The dual listing of Meta-ExternalAgent and meta-externalagent reflects the fact that some site operators define robots.txt rules case-sensitively. Sites that want to be thorough list both forms. The 30 blocks for each variant are almost certainly the same 30 sites using both capitalizations, counted once at the operator level.

"35 of 107 prominent sites — 32.7% — block at least one of Meta's AI crawlers, with Meta-ExternalAgent appearing in 30 of those robots.txt files as of June 13, 2026."

FacebookBot's 24-site count is notable. Some of these blocks were almost certainly in place as a reaction to social media scraping concerns that predate Meta's generative AI products. The co-occurrence of FacebookBot and Meta-ExternalAgent blocks on the same site suggests that publishers who distrust Meta's social crawler tend to extend that distrust to its newer AI system as well.

Meta-ExternalAgent: 30 site blocks. FacebookBot: 24. Operator total: 35.

"The 35-site operator total means 11 sites block only FacebookBot or only Meta-ExternalAgent, while 24 block both — a clear signal that Meta's dual-crawler structure creates uneven policy coverage across publishers."

Compare this profile with the ByteDance Bytespider data, where a single crawler string produces a clean 37-site count with no deduplication complexity.

Which Industries Block Meta

News publishers account for 11 of the 35 blocks, the largest single-category contribution. Tech media follows with 8, and the Social category — an interesting self-referential data point — adds 4. Entertainment (5) and Travel (2) round out the mid-tier categories.

Category	Sites Blocking Meta Crawlers
News	11
Tech	8
Entertainment	5
Social	4
Travel	2
Retail	2
Reference	1
Education	1
Government	1

Social platforms blocking a social media company's crawler is a sharp signal. LinkedIn, Tumblr, Medium, and Vimeo are all in this category, and all four appear in the named blockers list. The logic is competitive: these platforms generate user content that Meta's models could absorb, and they have strong incentives to prevent that data from training a rival's AI.

The Education category (1 site — Coursera) and Government (1 site — congress.gov) illustrate that resistance to Meta's crawlers extends to civic and academic institutions, not just commercial publishers. The Travel category contributing 2 sites — TripAdvisor and Yelp — likely reflects review-content protection: user-generated reviews are high-value training data.

The Reference category adds 1 site. The Retail category contributes 2, both of which are large consumer platforms with substantial product-description and review content that Meta's AI could absorb for commerce use cases.

The Named Sites That Block Meta

The following 12 sites represent the highest-headline-crawler-count blockers from the 35 total. "Headline crawlers blocked" counts how many of the 9 headline AI bots a given site restricts — a measure of how comprehensively a site has locked down AI access.

Site	Category	Headline Crawlers Blocked (of 9)
bbc.com	News	9
bloomberg.com	News	9
usatoday.com	News	9
nytimes.com	News	8
cnn.com	News	8
wired.com	Tech	8
arstechnica.com	Tech	8
ebay.com	Retail	8
rollingstone.com	Entertainment	8
congress.gov	Government	8
linkedin.com	Social	7
tripadvisor.com	Travel	7

BBC, Bloomberg, and USA Today each carry a maximum headline-crawler block score of 9, confirming that Meta's exclusion at these properties is part of a policy stance against AI scrapers broadly, not a targeted anti-Meta decision. The Verge, Healthline, Forbes, The Atlantic, and others in the full 35-site list similarly block Meta as one component of a multi-operator exclusion strategy.

The complete 35-site list also includes: The Washington Post, The Guardian, Newsweek, Vox, The Verge, Healthline, Amazon, TechCrunch, Tumblr, Medium, Yelp, Vimeo, ESPN, Gizmodo, and Coursera. Together they span 9 of the 10 content categories tracked in this study. For a lower-block-rate comparison from the same snapshot, see the Apple Applebot-Extended report, where 31 sites block a single user-agent.

News (11) and Tech (8) produce 19 of 35 total Meta crawler blocks.

The breadth of the 35-site named-blocker set is itself meaningful. A site like coursera.org (Education, 1 headline block) blocking Meta's AI crawler signals that even platforms whose content is designed for broad access have concluded that feeding course material into a rival's AI system crosses a line. ESPN (Entertainment, 4 headline blocks) reflects similar reasoning about sports commentary and statistics as proprietary assets. The 9-category reach of Meta's blocker set confirms that publisher resistance is not a media-industry phenomenon — it is a cross-sector policy response.

Methodology and Data Integrity

The Closing Web snapshot was assembled by fetching the public /robots.txt file from each of the 122 curated domains on June 13, 2026. Each file was parsed into a structured user-agent-to-disallow map and sealed under snapshot sha 741353c4304216ee. Only sites that returned an HTTP 200 response with a parseable robots.txt were included in the denominator; 107 of 122 qualified.

Every figure in this report is a verbatim count drawn from that sealed snapshot. nothing is estimated, modeled, or extrapolated. Operator block counts are deduplicated at the domain level so that a site blocking both Meta-ExternalAgent and FacebookBot counts as 1 toward the 35-site total. The per-bot counts (30 and 24) are not additive and are not summed in any headline figure.

The snapshot also tracks 20 sites that have published an llms.txt file (18.7% of the 107-site denominator) and 9 starred sites (8.4%) representing the most prominent properties in the corpus. All 9 starred sites appear in Meta's 35-site named-blocker list.

Put This Data to Work

Understanding Meta's AI crawler footprint matters for at least two audiences: content publishers deciding whether to update their own robots.txt, and data teams building competitive intelligence pipelines around AI web access.

For a content or SEO lead, the key question is whether your site's robots.txt currently addresses Meta-ExternalAgent — both capitalizations — in addition to the more-discussed GPTBot and ClaudeBot. Many robots.txt templates predate Meta's AI crawler launch and will miss it. US Tech Automations can automate a regular audit of your robots.txt against a current list of AI user-agents, flagging gaps and generating a corrected file for review.

For retrieval-pipeline and data-infrastructure engineers, the more interesting signal is the delta: which sites changed their Meta-crawler policy recently? US Tech Automations builds scheduled fetch-and-diff workflows that pull /robots.txt on a configurable cadence, parse the user-agent-to-disallow map into structured JSON, and surface changes via Slack alert or webhook. The same pipeline can track all 12 operators from this study in parallel.

The FacebookBot co-block pattern is also machine-parseable: if a site blocks FacebookBot, it has a higher probability of also blocking Meta-ExternalAgent. Encoding correlation rules like this into a policy-inference layer gives you a richer signal than a simple presence/absence flag. See the Perplexity crawler report for another multi-user-agent operator to compare against.

Frequently Asked Questions

Q: Does blocking Meta-ExternalAgent actually stop Meta from crawling my site?

A: robots.txt is an honor system. Listing Meta-ExternalAgent in a Disallow directive signals your preference, but it does not technically prevent access. Meta's compliant crawlers are expected to respect it; the robots.txt file itself is not a firewall.

Q: Why are there three Meta user-agents in this study?

A: Meta operates both a legacy social crawler (FacebookBot) and a newer AI-focused crawler (Meta-ExternalAgent). The dataset tracks both because they serve distinct purposes and sites may block one but not the other. The dual capitalization reflects inconsistent case-sensitivity handling across different site operators.

Q: Does blocking Meta's AI crawlers affect my Facebook social shares?

A: FacebookBot as tracked in this study is Meta's social crawler, responsible for generating link previews in Facebook and Instagram. If you block FacebookBot via robots.txt, link previews for your content in Facebook feeds may degrade. Blocking Meta-ExternalAgent alone should not affect social previews.

Q: Does this hurt my search SEO?

A: No. Google's Googlebot, Microsoft's Bingbot, and Apple's Applebot use separate user-agent strings for organic search indexing. Blocking Meta-ExternalAgent or FacebookBot has no known effect on Google Search rankings.

Q: How current is this data?

A: All figures come from a single fetch on June 13, 2026, sealed under snapshot sha 741353c4304216ee. robots.txt files can be updated at any time; this report reflects only the state on that specific date.

Key Takeaways

35 of 107 sites with parseable robots.txt (32.7%) block at least one Meta AI crawler as of June 13, 2026.
Meta-ExternalAgent is blocked at 30 sites; FacebookBot at 24 — some sites block one, some block both.
News (11 sites) and Tech (8 sites) account for more than half of all Meta crawler blocks.
Social platforms — LinkedIn, Tumblr, Medium, Vimeo — are among the blockers, reflecting competitive data-protection logic.
48 of the 107-site corpus block at least one AI crawler (44.9%); Meta's 32.7% places it mid-tier among the 12 tracked operators.
9 of the 10 content categories in the study contain at least one Meta blocker, confirming broad cross-sector resistance.

Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 13, 2026 (snapshot sha 741353c4304216ee).

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “Who Blocks Meta's AI Crawler? 35 of 107 Top Sites Do.” https://ustechautomations.com/resources/blog/who-blocks-meta-ai-crawler-2026

Sealed snapshot sha256: 741353c4304216ee

Machine-readable data: CSV · JSON · All research & methodology