Research & Data

Do News Sites Block AI Crawlers? Sealed robots.txt Data

Jun 13, 2026

Who This Is For

This report is for SEO directors, content strategy leads, and news-industry RevOps teams who need to understand how the news publishing sector positions itself in relation to AI crawlers. If you manage or audit a publisher's robots.txt file, or if you track how AI companies access editorial content, these sealed figures are your baseline for June 2026.

TL;DR

14 of 17 News sites with a robots.txt block at least one AI crawler — an 82.4% rate. This is the highest blocking rate across all 10 content categories measured in this corpus. News is the most defensive category by a significant margin.

The News Category Finding

Of the 20 News sites checked, 17 returned a parseable robots.txt file. Of those 17, 14 block at least one AI crawler — a category block rate of 82.4%.

Three sites — reuters.com, wsj.com, and time.com — returned a robots.txt file but contained no rules that blocked any of the 21 tracked AI crawlers. Three additional sites — npr.org, politico.com, and axios.com — returned no parseable robots.txt file at all. No News site in this corpus published an llms.txt file.

The 14 blocking sites are: nytimes.com, washingtonpost.com, theguardian.com, bbc.com, cnn.com, apnews.com, bloomberg.com, forbes.com, businessinsider.com, theatlantic.com, usatoday.com, latimes.com, newsweek.com, and vox.com.

14 of 17 News sites with a parseable robots.txt block at least one AI crawler — an 82.4% category rate.

The breadth of the blocking list is notable. It spans wire services, national broadsheets, digital-native publishers, and specialist business titles. The blockers represent a wide cross-section of editorial models, not just a single segment of the news landscape.

82.4% of News sites with a robots.txt restrict at least one AI crawler.

Methodology

US Tech Automations Research fetched the public robots.txt file for each of the 122 sites in this corpus on June 13, 2026, using standard HTTP GET requests. Each response was parsed for User-agent directives targeting any of 21 tracked AI crawler identifiers across 12 operators. A site is counted as "blocking" if at least one Disallow rule for at least one AI crawler token applies to a non-empty path.

No robots.txt was inferred, modeled, or extrapolated — only responses with parseable content were counted. Sites returning an error, redirect, or non-parseable content are categorized as "no robots.txt." The full snapshot is sealed under sha 741353c4304216ee.

The 21 tracked crawler identifiers span 12 operators. The parsing logic applied a strict literal token match: each crawler user-agent string was compared against the User-agent directives in the robots.txt. A wildcard rule was treated as a fallback only where no crawler-specific directive was present. The block count reflects only explicit Disallow directives targeting at least one tracked crawler token — no wildcard-only inference was used to classify a site as blocking.

Sites that restrict only non-AI crawlers were not counted as blocking. Nothing is estimated, modeled, or extrapolated — all figures are verbatim counts from the sealed June 13, 2026 snapshot.

News Category Summary Table

MetricValue
Sites checked20
Sites with parseable robots.txt17
Sites blocking at least one AI crawler14
Category block rate82.4%
Sites with llms.txtnone
Sites with no robots.txt3

Cross-Category Ranking

The News category ranks first among all 10 content categories measured. The table below shows all categories from the sealed snapshot, ordered by block rate.

RankCategorySitesWith robots.txtAny blockBlock rate
1News20171482.4%
2Tech1513969.2%
3Entertainment99666.7%
4Reference1411654.5%
FiveSocial1010440%
6Travel99333.3%
7Finance1211218.2%
8Retail1512216.7%
9Education97114.3%
10Government98112.5%

News sits 37 percentage points above the corpus-wide average. Across all 107 sites with a parseable robots.txt, 48 block at least one AI crawler — a corpus-wide rate of 44.9%. The News category at 82.4% is far above that line.

You can compare the News findings against the Tech category report and the Reference category report for adjacent context.

Across all 107 sites in the corpus, 48 block at least one AI crawler — a 44.9% corpus-wide rate. News at 82.4% is nearly double that baseline.

Most-Blocked Operators and Bots (Corpus-Wide, All 107 Sites)

The leaderboard figures below reflect counts across all 107 sites in the full corpus — not News-specific counts. They show which AI operators and crawler tokens face the most restrictions globally.

Most-Blocked Operators (all 107 sites)

OperatorSites blocking their crawlers
Common Crawl40
Anthropic39
ByteDance37
OpenAI35
Meta35
Apple31
Diffbot30
Perplexity29
Cohere27
Google25
Amazon22
Mistral12

Common Crawl, Anthropic, and ByteDance each face restrictions from more than a third of all measured sites. Mistral, at 12, occupies the tail. These figures represent operator-level aggregates — a single operator may run multiple crawlers, all of which are counted.

Common Crawl is blocked by 40 of the 107 sites in the full corpus.

For a detailed look at how individual sites compare, the Retail category report shows a contrasting sector where blocking rates are far lower.

Site-Level Analysis

The 14 blocking News sites form a cross-section that spans print legacies, digital-native publishers, and wire services. nytimes.com, washingtonpost.com, and theguardian.com have long operated on subscription and licensing models for content; their robots.txt positions reflect formal editorial content-rights policies.

bloomberg.com and businessinsider.com represent paywalled financial and business news. Both block AI crawlers, consistent with protecting premium editorial content from unlicensed training data extraction.

Digitally-native outlets — vox.com, theatlantic.com — also appear in the blocking list. This dispels any assumption that blocking is purely a legacy-publisher behavior.

apnews.com, a wire service whose content is licensed to other publishers, also blocks. This is significant: the blocking decision here extends to content that is itself already in wide distribution under license arrangements, suggesting the concern is not only about traffic diversion but about training data licensing control.

The three allowers — reuters.com, wsj.com, time.com — each have robots.txt files but have chosen not to restrict AI crawlers via that mechanism. This does not mean they permit all AI use; other technical or contractual controls may apply. The robots.txt file is only one layer of access governance.

Automation Bridge

For SEO directors and content-strategy leads at publishing organizations, the signal here is clear: the majority of large news publishers have taken a formal robots.txt position on AI crawler access. Monitoring that position — not just for your own site, but across the competitive landscape — is an ongoing operational task.

Manually checking robots.txt files at scale is impractical when you need to track changes across dozens of domains over time. US Tech Automations builds automated workflows that can schedule, run, and parse robots.txt checks at any cadence, flag changes, and route alerts to the right stakeholder. If your team needs a robots.txt monitoring workflow, this is precisely the automation problem US Tech Automations solves.

For context on how a less defensive sector compares, the Social media category report shows a markedly different posture with a 40% block rate.

Key Takeaways

  • 14 of 17 News sites with a parseable robots.txt block at least one AI crawler — 82.4%.

  • News ranks first among all 10 categories measured in the June 2026 snapshot.

  • The corpus-wide baseline is 48 of 107 sites (44.9%); News is nearly double that rate.

  • 3 News sites — reuters.com, wsj.com, time.com — have robots.txt files but do not block any tracked AI crawler.

  • 3 News sites — npr.org, politico.com, axios.com — have no parseable robots.txt file.

  • No News site in this corpus publishes an llms.txt file.

  • Corpus-wide, Common Crawl is the most-blocked operator (40 of 107 sites), followed by Anthropic (39) and ByteDance (37).

  • Nothing in this report is estimated, modeled, or extrapolated — all counts come from the sealed June 13, 2026 snapshot.

FAQ

Q: Does blocking a crawler in robots.txt actually stop it?

A: No. robots.txt is an honor-system standard. A crawler that ignores the Disallow directive faces no technical barrier — it can still fetch the page. The file signals a site's stated preference, but enforcement depends entirely on the crawler operator choosing to respect it.

Q: Why do three News sites with a robots.txt file still allow all AI crawlers?

A: A robots.txt file can be completely permissive — it may contain rules only for traditional search bots, or it may simply contain no AI-specific Disallow directives. reuters.com, wsj.com, and time.com returned parseable robots.txt files, but none of those files blocked any of the 21 tracked AI crawlers. Permissive robots.txt is a deliberate choice, not an absence of a file.

Q: What does it mean that no News site publishes an llms.txt file?

A: llms.txt is an emerging voluntary standard that provides AI systems with structured guidance about a site, rather than a crawl restriction. The absence of llms.txt across all 20 News sites in this corpus means that, as of June 13, 2026, the News category is using robots.txt as its primary AI-governance signal — not the newer structured-guidance format. This may evolve as the standard matures.

Q: Why are npr.org, politico.com, and axios.com counted as "no robots.txt"?

A: These 3 sites returned no parseable robots.txt response at the time of the snapshot. This could reflect a missing file, a server configuration returning a different status code, or a redirect that the crawler did not follow to a parseable file. In this pipeline, nothing is estimated, modeled, or extrapolated — only confirmed parseable responses are included in the blocking count.

Q: What is the corpus-wide snapshot this data comes from?

A: The June 2026 Closing Web snapshot covers 122 sites across 10 categories. Of those, 107 returned a parseable robots.txt. The snapshot was sealed on June 13, 2026 under sha 741353c4304216ee. All figures in this report are verbatim counts from that sealed file — no modeling or estimation applied.

Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 13, 2026 (snapshot sha 741353c4304216ee).

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “Do News Sites Block AI Crawlers? Sealed robots.txt Data.” https://ustechautomations.com/resources/blog/do-news-sites-block-ai-crawlers-2026

Sealed snapshot sha256: 741353c4304216ee

Machine-readable data: CSV · JSON · All research & methodology

About the Author

Garrett Mullins
Garrett Mullins
Workflow Specialist

Helping businesses leverage automation for operational efficiency.