Do Reference Sites Block AI Crawlers? Sealed robots.txt Data
Who This Is For
This report is for content-strategy teams, SEO directors, and data-licensing teams at reference publishers, health-information platforms, and knowledge-base operators. If your organization publishes encyclopedic, definitional, or medical reference content, the sealed figures below give you the sector baseline for June 2026 AI crawler posture.
TL;DR
6 of 11 Reference sites with a parseable robots.txt block at least one AI crawler — a 54.5% rate. The Reference category ranks fourth across all 10 content categories in this corpus. It sits above the corpus-wide average of 44.9% but well below the top-tier News (82.4%) and Tech (69.2%) categories. The category is split almost evenly between blockers and allowers.
The Reference Category Finding
Of the 14 Reference sites checked, 11 returned a parseable robots.txt file. Of those 11, 6 block at least one AI crawler — a category block rate of 54.5%.
Five sites — wikipedia.org, britannica.com, merriam-webster.com, medlineplus.gov, and cdc.gov — have robots.txt files that do not restrict any of the 21 tracked AI crawlers. Three sites — mayoclinic.org, nih.gov, and imdb.com — returned no parseable robots.txt at all. No Reference site in this corpus publishes an llms.txt file.
The 6 blocking sites are: dictionary.com, investopedia.com, webmd.com, healthline.com, goodreads.com, and quora.com.
6 of 11 Reference sites with a parseable robots.txt block at least one AI crawler — a 54.5% rate.
The split within the allower group is significant. wikipedia.org, britannica.com, and merriam-webster.com are among the most-trafficked reference destinations on the web; their permissive posture represents a deliberate choice to remain open to machine access. medlineplus.gov and cdc.gov are government-funded public-health resources with explicit mandates for broad information access.
6 of 11 Reference sites with a robots.txt restrict at least 1 AI crawler in June 2026.
Methodology
US Tech Automations Research fetched the public robots.txt file for each of the 122 sites in this corpus on June 13, 2026, using standard HTTP GET requests. Each response was parsed for User-agent directives targeting any of 21 tracked AI crawler identifiers across 12 operators. A site is counted as "blocking" if at least one Disallow rule for at least one AI crawler token applies to a non-empty path.
Only responses with parseable content are counted. Sites returning other status codes or non-parseable content are categorized as "no robots.txt." The snapshot is sealed under sha 741353c4304216ee. Nothing is estimated, modeled, or extrapolated — every figure in this report is a verbatim count from the sealed file.
The parsing approach applied a strict literal match: each of the 21 tracked crawler user-agent strings was compared against the User-agent directives in the robots.txt. Wildcard rules were applied as a fallback only when no crawler-specific directive existed. No probabilistic inference or modeling was used to fill gaps — if a parseable robots.txt was not present, the site was recorded as having no robots.txt and was excluded from the block-rate calculation.
Reference Category Summary Table
| Metric | Value |
|---|---|
| Sites checked | 14 |
| Sites with parseable robots.txt | 11 |
| Sites blocking at least one AI crawler | 6 |
| Category block rate | 54.5% |
| Sites with llms.txt | none |
| Sites with no robots.txt | 3 |
Cross-Category Ranking
The Reference category ranks fourth among all 10 content categories measured. The full cross-category table from the sealed snapshot is below.
| Rank | Category | Sites | With robots.txt | Any block | Block rate |
|---|---|---|---|---|---|
| 1 | News | 20 | 17 | 14 | 82.4% |
| 2 | Tech | 15 | 13 | 9 | 69.2% |
| 3 | Entertainment | 9 | 9 | 6 | 66.7% |
| 4 | Reference | 14 | 11 | 6 | 54.5% |
| Five | Social | 10 | 10 | 4 | 40% |
| 6 | Travel | 9 | 9 | 3 | 33.3% |
| 7 | Finance | 12 | 11 | 2 | 18.2% |
| 8 | Retail | 15 | 12 | 2 | 16.7% |
| 9 | Education | 9 | 7 | 1 | 14.3% |
| 10 | Government | 9 | 8 | 1 | 12.5% |
Reference at 54.5% is above the corpus-wide rate of 44.9% (48 of 107 sites), but it sits in a middle tier — trailing News, Tech, and Entertainment but leading Social, Travel, Finance, Retail, Education, and Government.
For adjacent category context, the Tech category report covers the second-ranked sector, and the Social media category report covers the fifth-ranked sector — a useful range to bracket where Reference falls.
Across all 107 corpus sites, 48 block at least one AI crawler — a 44.9% rate. Reference at 54.5% is above the corpus average by nearly 10 points.
Most-Blocked Operators and Bots (Corpus-Wide, All 107 Sites)
These figures are corpus-wide counts across all 107 sites with a parseable robots.txt — not Reference-specific. They identify which operators face the broadest blocking across the full dataset.
Most-Blocked Operators (all 107 sites)
| Operator | Sites blocking their crawlers |
|---|---|
| Common Crawl | 40 |
| Anthropic | 39 |
| ByteDance | 37 |
| OpenAI | 35 |
| Meta | 35 |
| Apple | 31 |
| Diffbot | 30 |
| Perplexity | 29 |
| Cohere | 27 |
| 25 | |
| Amazon | 22 |
| Mistral | 12 |
Common Crawl leads with 40 sites blocking its crawlers across the corpus. Anthropic (39) and ByteDance (37) follow. OpenAI and Meta are each blocked by 35 sites. Mistral, at 12, is at the tail. These are global counts — not specific to the Reference category.
Anthropic is blocked by 39 of the 107 sites measured across the full corpus.
Site-Level Analysis
The 6 blocking Reference sites divide into two sub-groups by content type. dictionary.com is a commercial definition publisher with an ad-supported model; restricting AI crawlers protects the unique-visitor traffic that drives that revenue. investopedia.com operates a similar model for financial definitions and explainers — its blocking position aligns with that of a commercial knowledge publisher.
The health-information sites present a more complex picture. webmd.com and healthline.com both block AI crawlers. These platforms produce large volumes of editorial health content and operate under advertising and partnership revenue models. goodreads.com, a book-review and reading-list platform, also blocks. quora.com, a community-driven Q&A platform, blocks despite relying on user-generated content — which historically has been treated as more openly accessible.
The five allowers stand apart. wikipedia.org, britannica.com, and merriam-webster.com — three of the most authoritative reference destinations on the web — have explicitly not restricted AI crawlers. medlineplus.gov and cdc.gov are government-funded public-health resources. Their permissive posture reflects mandates for broad information access that are distinct from commercially-motivated access-governance decisions.
Three sites — mayoclinic.org, nih.gov, and imdb.com — have no parseable robots.txt in this snapshot. This does not imply permission; it means no machine-readable access-governance signal was found at the standard location at the time the snapshot was captured on June 13, 2026. Each of these 3 sites represents a distinct type of reference resource: a clinical information provider, a federal health research agency, and an entertainment-adjacent reference database. Their absence from both the blocker and allower lists is a gap in the signal, not a policy statement.
Automation Bridge
For SEO directors and content-strategy teams at reference publishers, the near-even split between blockers and allowers in this category creates a monitoring challenge. Understanding which peers block, which allow, and whether that changes over time requires systematic observation — not a one-time manual check.
US Tech Automations builds automated pipelines that schedule robots.txt fetches across defined site lists, parse for AI-crawler directives, detect changes, and route alerts to the appropriate team members. If your organization manages AI-access policy or competes in the reference information space, that ongoing monitoring is exactly the kind of workflow US Tech Automations delivers.
For comparison, the News category report shows the most restrictive end of the spectrum at 82.4%, while the lower-blocking categories provide context for how different content models approach the question.
Key Takeaways
6 of 11 Reference sites with a parseable robots.txt block at least one AI crawler — 54.5%.
Reference ranks fourth across all 10 categories in the June 2026 sealed snapshot.
The corpus-wide baseline is 48 of 107 sites (44.9%); Reference is about 10 points above that.
Five Reference sites — wikipedia.org, britannica.com, merriam-webster.com, medlineplus.gov, cdc.gov — allow all tracked AI crawlers.
3 Reference sites — mayoclinic.org, nih.gov, imdb.com — have no parseable robots.txt.
No Reference site in this corpus publishes an llms.txt file.
Corpus-wide, Common Crawl is blocked by 40 of 107 sites; Anthropic by 39.
Nothing in this report is estimated, modeled, or extrapolated — all counts are from the sealed June 13, 2026 snapshot.
FAQ
Q: Does blocking a crawler in robots.txt actually stop it?
A: No. robots.txt is an honor-system standard. A Disallow directive has no technical enforcement mechanism — a non-compliant crawler can still fetch the page. The file states the site operator's preference; compliance is at the discretion of the crawler operator.
Q: Why would wikipedia.org and britannica.com allow AI crawlers when many peers block?
A: The sealed data shows that wikipedia.org and britannica.com both have robots.txt files that do not restrict any of the 21 tracked AI crawlers. The reasons behind any site's access decision are not part of the sealed snapshot — this report records only what the robots.txt file states. Permissive posture is a deliberate choice, not an oversight.
Q: What explains the absence of llms.txt across all Reference sites?
A: No Reference site in this corpus published a parseable llms.txt file as of June 13, 2026. Across the full corpus of 107 sites with a robots.txt, 20 publish an llms.txt — a corpus-wide rate of 18.7%. The Reference category, with none of the 14 sites publishing one, is below even that modest average. The llms.txt standard is newer than robots.txt and adoption is still in its early stages.
Q: Why are mayoclinic.org, nih.gov, and imdb.com in the "no robots.txt" group?
A: These 3 sites returned no parseable robots.txt response at the time of the snapshot. In this pipeline, nothing is estimated, modeled, or extrapolated — only confirmed parseable responses are counted. The absence of a robots.txt file is not a permission; it simply means no machine-readable crawl-governance signal was found at the standard location.
Q: How do health-information sites compare to encyclopedic reference sites on AI access?
A: Within the Reference category, the sealed data shows a split: webmd.com and healthline.com block AI crawlers, while medlineplus.gov and cdc.gov allow them. The pattern suggests that commercially-funded health-information publishers have taken a restrictive posture, while publicly-funded health-information resources have remained open. This is a pattern in the data, not an inference — no causal explanation is available from the robots.txt file alone.
Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 13, 2026 (snapshot sha 741353c4304216ee).
Get this data as a daily feed
The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.
Prefer to talk first? Contact us.
Cite this report
US Tech Automations Research, 2026-06 edition. “Do Reference Sites Block AI Crawlers? Sealed robots.txt Data.” https://ustechautomations.com/resources/blog/do-reference-sites-block-ai-crawlers-2026
Sealed snapshot sha256: 741353c4304216ee
Machine-readable data: CSV · JSON · All research & methodology
About the Author

Helping businesses leverage automation for operational efficiency.