Research & Data

Who Blocks Diffbot? 30 of 107 Top Sites Do

Jun 13, 2026

As of June 13, 2026, 30 of the 107 top sites that returned a parseable robots.txt file had explicitly named the Diffbot user-agent with at least one Disallow: / directive. That places Diffbot above both the cohere-ai crawler (27 sites) and Amazonbot (22 sites) in block count among the 12 AI operators tracked in this corpus.

"Blocking" in this report means a site's robots.txt file explicitly names the Diffbot user-agent string and pairs it with a disallow rule covering at least one path. It does not mean a technical firewall is in place — only that the site has stated its policy of excluding Diffbot from crawling.

Methodology and Data Integrity

This report is based on a point-in-time fetch of public robots.txt files from 122 prominent sites, sealed on June 13, 2026. Of those 122 sites, 107 returned a parseable robots.txt file. All counts in this report are verbatim from the sealed snapshot and reflect stated access policy as of that date only.

Every figure in this report is a verbatim count from the snapshot: nothing is estimated, modeled, or extrapolated. No percentages have been derived from sub-groups after the fact. The methodology treats each disallow rule as a binary presence: either the Diffbot user-agent appears with a disallow path in a site's robots.txt file, or it does not.

robots.txt is an honor-system standard that communicates a site operator's stated intent. It is not a technical enforcement mechanism. A compliant crawler will respect the directives; a non-compliant one will not. These numbers will not change as sites later edit their files — the snapshot is sealed.

30 of 107 sites with parseable robots.txt files block Diffbot as of June 13, 2026.

How Often Diffbot Is Refused

Diffbot operates a single crawler user-agent under its own brand name. The block count is therefore one-to-one: each of the 30 blocking sites named the Diffbot user-agent string directly.

User-Agent	Sites Blocking
Diffbot	30

For broader corpus context: 48 of 107 sites block at least one AI crawler of any kind, and 20 sites publish an llms.txt or equivalent structured access file. Diffbot at 30 sites sits well above the mid-tier operators tracked alongside it.

30 of the 107 top sites with parseable robots.txt files had explicitly blocked Diffbot as of June 13, 2026. News and Tech categories account for 20 of those 30 blocks combined.

Diffbot is blocked by 30 of 107 sites, or 28% of the panel.

9 of 107 sites block all 9 headline AI crawlers.

bbc.com and bloomberg.com block all 9 headline crawlers, Diffbot included.

News and Tech together account for 20 of Diffbot's 30 total blocks.

Which Industries Block Diffbot

News and Tech publishers drive the largest share of Diffbot blocks. This pattern is consistent with Diffbot's core commercial value: it specializes in structured-data extraction from text-rich pages, and news and tech articles are exactly the high-density content those systems are built to consume.

Category	Sites Blocking Diffbot
News	12
Tech	8
Entertainment	4
Reference	2
Social	2
Government	1
Retail	1

News leads with 12 blockers, followed by Tech at 8. Together they account for 20 of the 30 blocks. Entertainment adds 4 more — rollingstone.com, variety.com, hollywoodreporter.com, and billboard.com — reflecting concern among music, film, and entertainment publishers that structured extraction of their catalog and interview content feeds AI products without licensing arrangements.

Reference sites (2 blocks: healthline.com and quora.com) and Social platforms (2 blocks: linkedin.com and vimeo.com) add a smaller share. Government at 1 — congress.gov — represents a clear policy decision to restrict automated structured extraction from official legislative content.

The Retail block (amazon.com) is notable given the dual role amazon.com plays as both a content platform and a business rival to AI-extraction services that index product data.

Content-as-product publishers are the clear drivers. Outlets whose revenue model depends on exclusive access to their text — whether paywalled news or licensed entertainment content — have a direct economic incentive to block a service designed to extract and structure that content at scale.

9 sites in this corpus — including bbc.com, bloomberg.com, and usatoday.com — block all 9 headline AI crawlers tracked. Diffbot is one of those 9 headline bots, so broad-spectrum blockers account for a meaningful portion of its 30-site total.

The Named Sites That Block Diffbot

All 30 blocking sites are identified in the sealed snapshot. The table below shows all 30, ordered by headline-crawlers-blocked count.

Site	Category	Headline Crawlers Blocked (of 9)
bbc.com	News	9
bloomberg.com	News	9
usatoday.com	News	9
nytimes.com	News	8
cnn.com	News	8
forbes.com	News	8
theatlantic.com	News	8
wired.com	Tech	8
arstechnica.com	Tech	8
cnet.com	Tech	8
zdnet.com	Tech	8
mashable.com	Tech	8
congress.gov	Government	8
rollingstone.com	Entertainment	8
variety.com	Entertainment	8
hollywoodreporter.com	Entertainment	8
billboard.com	Entertainment	8
washingtonpost.com	News	7
newsweek.com	News	7
vox.com	News	7
theverge.com	Tech	7
healthline.com	Reference	7
amazon.com	Retail	7
linkedin.com	Social	7
techcrunch.com	Tech	6
quora.com	Reference	5
vimeo.com	Social	5
businessinsider.com	News	3
latimes.com	News	3
gizmodo.com	Tech	3

Sites at a headline score of 9 — bbc.com, bloomberg.com, usatoday.com — block all 9 headline AI crawlers tracked in the corpus. Diffbot is caught in a blanket AI-access policy at those properties, not a targeted decision against Diffbot specifically.

At the other end, businessinsider.com, latimes.com, and gizmodo.com each score 3 of 9 headline crawlers blocked. These are partial blockers that have made more selective choices. Their presence on the Diffbot list — despite not running a broad-spectrum deny policy — is a signal of more deliberate operator-specific blocking.

9 sites in this corpus block all 9 headline AI crawlers, including Diffbot — the most comprehensive blocker tier.

Broad-spectrum blocking is the dominant posture in this corpus. Sites that block Diffbot almost universally also block OpenAI GPTBot and the Anthropic ClaudeBot crawler. The exceptions — partial blockers like gizmodo.com — represent a more surgical policy choice.

Corpus-Wide Access Policy Context

Beyond Diffbot's own 30-site total, this sealed snapshot captures several corpus-level figures that help locate Diffbot's block rate in context. Of the 122 sites checked, 107 returned a parseable robots.txt. Of those 107, 48 block at least one AI crawler of any kind.

20 sites publish an llms.txt or equivalent structured access file — a signal of proactive AI-access policy that goes beyond robots.txt alone. Only 9 sites (8.4% of the corpus) block all 9 headline AI crawlers tracked, the most restrictive tier. Diffbot's 30-site total places it clearly above the field median among the 12 tracked operators.

For teams evaluating retrieval pipelines, the practical implication is that 30 of the highest-profile sites on the web have stated their intent to exclude Diffbot. How many more of the 107 sites will eventually add that rule is not something this snapshot can answer — it records stated policy on June 13, 2026, not trajectory.

Put This Data to Work

For a content operations or RevOps lead at an enterprise that licenses Diffbot for internal knowledge pipelines, this data tells a clear story: 30 prominent sites have opted out, and the list skews heavily toward news and tech — exactly the content categories that make Diffbot valuable.

The more actionable problem is that robots.txt is a living document. A site that allows Diffbot today may block it next quarter with no announcement. US Tech Automations builds scheduled robots.txt monitoring pipelines that fetch, diff, and alert on policy changes for the specific user-agents a retrieval stack depends on.

US Tech Automations can wire monitoring directly into an incident workflow: a Slack alert, a Jira ticket, or a webhook to a data-ingestion service. The result is a retrieval stack that degrades gracefully and visibly instead of silently.

For teams evaluating whether to build on Diffbot at all, pairing block-rate data with a periodic re-fetch — run every 30 days, diffed against a baseline — gives a living picture of access risk. US Tech Automations designs and automates exactly this kind of compliance monitoring for content and data engineering teams.

Frequently Asked Questions

Q: Does blocking Diffbot in robots.txt actually stop it from crawling?

A: Not technically. robots.txt is an honor-system protocol — it communicates a site's stated preferences but does not enforce them at a network level. A compliant crawler will respect the rules; a non-compliant one will not. The 30 sites in this report have stated their intent; whether that intent is enforced is a separate question.

Q: Does blocking Diffbot affect Google search rankings?

A: No. Google uses its own distinct crawlers — Googlebot and related agents — which are entirely separate from Diffbot. Blocking Diffbot in robots.txt has no effect on Google indexing or search ranking.

Q: Why do so many news sites block Diffbot specifically?

A: Diffbot core product is structured-data extraction from web pages — turning articles into clean JSON objects with author, date, body text, and other fields. News publishers whose revenue depends on exclusive access to that content have a clear business reason to opt out, particularly as AI products built on extracted content compete with their own traffic and licensing revenue.

Q: Is this snapshot current?

A: This report reflects a point-in-time fetch sealed on June 13, 2026 (snapshot sha 741353c4304216ee). robots.txt files change over time; these numbers reflect stated policy as of that date only. They will not be updated as sites modify their files.

Q: How many AI operators were tracked in this study?

A: 12 operators and 21 individual bot user-agents were tracked across 122 prominent sites in 10 content categories. 107 of those 122 sites returned a parseable robots.txt file.

Q: How does Diffbot compare to other operators in the corpus?

A: Among the 12 operators tracked, Diffbot at 30 blocks sits above the cohere-ai crawler (27 sites) and Amazonbot (22 sites). The highest-blocked operators in this corpus are those with the broadest public profile among web publishers managing AI-access policy.

Key Takeaways

30 of 107 sites with parseable robots.txt files block Diffbot as of June 13, 2026.
News publishers alone account for 12 of those 30 blocks — the single largest category.
Tech adds 8 more, meaning News and Tech together represent 20 of Diffbot's 30 refusals.
48 of 107 sites block at least one AI crawler; Diffbot at 30 sites sits above the mid-tier operators in this corpus.
Broad-spectrum blockers — sites like bbc.com and bloomberg.com that block all 9 headline AI crawlers — are heavily overrepresented among Diffbot's refusers.
businessinsider.com, latimes.com, and gizmodo.com are partial blockers scoring 3 of 9, yet still include Diffbot — a signal of deliberate operator-level targeting, not blanket policy.

Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 13, 2026 (snapshot sha 741353c4304216ee).

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “Who Blocks Diffbot? 30 of 107 Top Sites Do.” https://ustechautomations.com/resources/blog/who-blocks-diffbot-2026

Sealed snapshot sha256: 741353c4304216ee

Machine-readable data: CSV · JSON · All research & methodology