Research & Data

Do Science Sites Block AI Crawlers? Sealed robots.txt Data

Jun 14, 2026

Science publishing sits at an ideological crossroads with AI: openness and knowledge sharing are foundational values of the scientific enterprise, yet the economic models of major journals depend heavily on restricting access to their archives. That tension shows up directly in the robots.txt data. 5 of 10 Science sites block at least one AI crawler in our June 2026 sealed snapshot — a rate of 50%, landing the category precisely at the dividing line between the blocking-majority and the open-access half of the web we checked.

A robots.txt file is the standard mechanism through which a website communicates crawl permissions to automated bots; it is an honor-system protocol, not a technical access control. This report presents verbatim counts from a sealed snapshot of public robots.txt files collected on June 14, 2026 across 260 sites and 24 categories. The data is content-addressed with sha 834f1e2f07af24fd. To be explicit, nothing is estimated, modeled, or extrapolated — every count is a direct read from the sealed file.

What Makes Science Sites a Revealing Case

The split in Science is nearly perfect — five blockers, five allowers — and the two groups map cleanly onto recognizable structural differences in the science-media landscape. The blockers are the premium subscription and paywalled outlets: nature.com, science.org, scientificamerican.com, livescience.com, and newscientist.com. These are properties where access to original reporting and journal-linked content sits behind a commercial boundary, and where AI crawlers harvesting that content for training or retrieval would undermine the subscription model.

The five allowers are the sites where the publishing model is either advertising-funded, nonprofit, or structurally oriented toward wide dissemination: sciencedaily.com, phys.org, smithsonianmag.com, nationalgeographic.com, and popsci.com. The signal here is not that these sites are indifferent to AI access — it is that their incentive structure does not create the same pressure to gate content. A press-release aggregator like sciencedaily.com has different interests than a journal-affiliated publisher like nature.com.

The 50% rate is not the most dramatic finding in the 24-category landscape — Gaming leads at 88.9% and Nonprofit sits at 0% — but it is one of the most interpretable. The Science category is genuinely split in a way that reflects a real underlying debate in the publishing world.

Key Takeaways

5 of 10 Science sites block at least one AI crawler. That 50% block rate places Science exactly at the midpoint of the 24-category distribution.

The corpus-wide block rate across 223 sites with a parseable robots.txt is 46.6%. Science at 50% sits fractionally above the corpus average.

All 10 Science sites returned a parseable robots.txt file — full robots.txt coverage for the category in this snapshot.

The five blockers — nature.com, science.org, scientificamerican.com, livescience.com, and newscientist.com — are all properties with subscription or paywalled content models. The five allowers — sciencedaily.com, phys.org, smithsonianmag.com, nationalgeographic.com, and popsci.com — are advertising or open-access properties.

CCBot is blocked by 85 sites across all 223 surveyed — the most-blocked bot in the corpus, operated by Common Crawl.

Science Sites: The Snapshot

Metric	Count
Science sites checked	10
Sites with a parseable robots.txt	10
Sites blocking at least one AI crawler	5
Block rate	50%

All 10 sites returned a parseable robots.txt. That full coverage is notable: across all 260 sites in the corpus, 223 returned a parseable file. The Science category had no gaps in this snapshot.

5 of 10 Science sites block at least one AI crawler — the category lands at 50%, exactly on the corpus-wide midpoint.

The blockers: nature.com, science.org, scientificamerican.com, livescience.com, and newscientist.com. The allowers: sciencedaily.com, phys.org, smithsonianmag.com, nationalgeographic.com, and popsci.com. No Science site in our set was missing a robots.txt file.

How Science Compares Across All 24 Categories

Category	Sites Checked	With robots.txt	Blocking	Block Rate
Gaming	9	9	8	88.9%
News	20	17	14	82.4%
Food	10	10	7	70%
Tech	15	13	9	69.2%
Entertainment	9	9	6	66.7%
Healthcare	10	9	6	66.7%
Music	10	9	6	66.7%
Reference	14	11	6	54.5%
Science	10	10	5	50%
Automotive	10	9	4	44.4%
HomeGarden	10	9	4	44.4%
Fashion	9	7	3	42.9%
Social	10	10	4	40%
Sports	10	10	4	40%
Jobs	10	8	3	37.5%
Travel	9	9	3	33.3%
Weather	10	6	2	33.3%
Legal	10	7	2	28.6%
RealEstate	10	7	2	28.6%
Finance	12	11	2	18.2%
Retail	15	12	2	16.7%
Education	9	7	1	14.3%
Government	9	8	1	12.5%
Nonprofit	10	6	0	0%

Science at 50% sits between Reference (54.5%) and Automotive (44.4%). The categories above it — Gaming, News, Food, Tech, Entertainment, Healthcare, Music, Reference — are all blocking-majority. The categories below it are mostly open-majority. Science is the last category in the blocking-majority half of the distribution before the divide.

This placement is consistent with the dual-model structure of science publishing: premium journal-affiliated properties behave like the content-protective high-blocking categories, while open-access and ad-supported properties behave more like Finance or Retail.

Science's 50% block rate places it at the exact midpoint of 24 categories — the blockers are subscription-model journals, the allowers are open-access and ad-supported outlets.

The Blocker/Allower Divide in Depth

The subscription-model blockers represent a coherent group. nature.com and science.org are among the most cited scientific journals in the world; their archives are primary sources for AI systems seeking authoritative scientific information. scientificamerican.com has a long archive of expert-written popular science alongside paywalled premium content. livescience.com and newscientist.com are high-traffic science news properties with ad-supported models that also carry some premium content. All five have reasons to limit AI training access to their text.

The allowers represent a different distribution model. sciencedaily.com functions primarily as an aggregator of university and research institution press releases — content that institutions have already released for public dissemination. phys.org operates similarly. smithsonianmag.com, nationalgeographic.com, and popsci.com are broad-audience science publications backed by institutions or large media companies where reach and discoverability may outweigh the concern about AI training.

The distinction matters for anyone building AI systems that need authoritative science content. The five allower sites are accessible under an honor-system reading of robots.txt. The five blockers are not. See how the same tension plays out differently in other content-rich categories: our Gaming report shows a much more lopsided picture, and the Music report lands at a similar level to Science.

Corpus-Wide Bot and Operator Counts

The following figures cover all 223 sites with a parseable robots.txt in the June 2026 snapshot — not just Science sites.

Bots blocked most often (across all 223 sites):

Bot	Sites Blocking It	Share of Corpus
CCBot	85	38.1%
ClaudeBot	74	33.2%
Bytespider	69	30.9%
GPTBot	64	28.7%
Meta-ExternalAgent	63	28.3%
PerplexityBot	60	26.9%
Applebot-Extended	60	26.9%
Google-Extended	57	25.6%
Amazonbot	50	22.4%

Operators blocked most often (across all 223 sites):

Operator	Sites Blocking Them
Common Crawl	85
Anthropic	80
Meta	73
ByteDance	69
OpenAI	66
Perplexity	60
Apple	60
Google	57
Cohere	56
Diffbot	55
Amazon	50
Mistral	21

CCBot (Common Crawl) at 85 sites is the most-blocked bot. Anthropic at 80 sites is the second-most-blocked operator. The broad spread across 12 operators indicates that site owners are updating robots.txt to address a growing field of AI actors — not just the two or three most prominent ones. Mistral at 21 sites sits well below the pack, consistent with its more recent emergence as an active crawler.

For science-publishing contexts specifically, the blockers in this category are likely targeting the full top of the operator leaderboard — Common Crawl, Anthropic, and OpenAI — since those represent the training pipelines most likely to ingest journal content.

Methodology

US Tech Automations fetched robots.txt files from 260 prominent web domains across 24 content categories on June 14, 2026. Parsing evaluated each file against a fixed set of 9 AI crawler user-agent strings drawn from publicly documented bot identities. The snapshot is content-addressed with sha 834f1e2f07af24fd. Nothing is estimated, modeled, or extrapolated. A site is counted as "blocking" when it disallows at least one of the 9 tracked bots for at least one path.

The collection process:

Fetch. Each domain root was queried for its robots.txt. Domains with no file or a server error were recorded as no-robots and excluded from the block-rate denominator.
Parse. The file was decomposed into user-agent blocks and checked for Disallow directives covering each of the 9 tracked AI bots.
Seal. The full dataset was hashed on June 14, 2026, producing content address 834f1e2f07af24fd — immutable after this point.
Aggregate. All counts were computed directly from the sealed file with no estimation or interpolation.

For a broader view of AI-crawler access policies across categories where content economics differ, see our Fashion report and Home and Garden report.

Frequently Asked Questions

Q: Is the 50% block rate in Science unusually split compared to other categories?

A: Yes. Most categories in the June 2026 snapshot have a lopsided outcome — either a clear blocking majority (Gaming at 88.9%, News at 82.4%) or a clear open majority (Finance at 18.2%, Nonprofit at 0%). Science at 50% is among the most evenly divided categories we checked, and the split maps directly onto a real structural difference between subscription-model and open-access publishing.

Q: Why would a science site that promotes knowledge sharing block AI crawlers?

A: Knowledge sharing and commercial access control are not mutually exclusive in science publishing. nature.com and science.org promote open scientific communication but rely on subscription revenue for sustainability. Blocking AI crawlers from their archives protects a commercial boundary, not a censorship one. Their content is freely available to human readers who subscribe.

Q: Does a site being on the allower list mean its content is freely usable for AI training?

A: Not necessarily. robots.txt reflects what a site chooses to communicate under an honor system — it is not a legal license. A site may allow crawlers in robots.txt but still assert copyright over its content through terms of service or other legal instruments. The two frameworks are independent.

Q: How does Science compare to other research-adjacent categories?

A: Reference (54.5%) and Education (14.3%) are the adjacent categories. Reference — which includes encyclopedias and reference databases — blocks more often than Science. Education — which includes university and learning-platform sites — blocks far less often. Science at 50% sits between them in a way that reflects its dual identity as both a research-publication category and a popular-audience category.

Q: What happens if a blocked site removes its disallow between now and the next snapshot?

A: The sealed snapshot captures the state on June 14, 2026 only. Any changes after that date are not reflected here. The methodology is point-in-time by design: each snapshot is independently sealed and verifiable. Comparing across sealed snapshots over time is the right way to detect policy drift.

Put AI-Access Data to Work

Science sits at 50% and is cleanly divided along a structural fault line. That division makes it one of the most actionable categories in the corpus for three distinct audiences.

A content strategy lead at a science-adjacent AI product — say, a health information platform or a scientific search tool — needs to know which authoritative sources it can draw on under robots.txt guidelines. The five allowers in this snapshot represent potential retrieval sources; the five blockers do not. The right workflow is to re-crawl this Science domain set on a monthly cadence and alert immediately when any of the five allowers adds a new disallow, or when any blocker changes which operators it gates. A single policy shift by nature.com or nationalgeographic.com changes the retrieval landscape.

A publisher RevOps lead at a subscription science outlet should monitor whether the allower sites move toward blocking as the AI-access conversation matures. If smithsonianmag.com or popsci.com begins adding disallows, that signals a shift in the open-access segment's risk calculus — and a reason to audit your own policy against the emerging norm. Re-crawl monthly; alert on any new disallow token in the allower group.

A data-pipeline or retrieval engineer building a science knowledge base benefits from a weekly check of each site's robots.txt state, flagged against a diff of the sealed baseline. The 50/50 split means your accessible universe could shrink or grow with any single site change. Monitoring that boundary automatically prevents a silent data-access regression in your pipeline.

US Tech Automations automates scheduled robots.txt monitoring across your target domain set, routes change alerts to the right policy owner, and maintains a live AI-access policy dashboard — so you track drift from the sealed baseline without manual spot-checks.

Automate AI-access monitoring with agentic workflows

This snapshot of Science sites is one slice of a wider dataset; read how many top websites block AI crawlers for the cross-industry view.

Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 14, 2026 (snapshot sha 834f1e2f07af24fd).

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “Do Science Sites Block AI Crawlers? Sealed robots.txt Data.” https://ustechautomations.com/resources/blog/do-science-sites-block-ai-crawlers-2026

Sealed snapshot sha256: 834f1e2f07af24fd

Machine-readable data: CSV · JSON · All research & methodology