Research & Data

Do Healthcare Sites Block AI Crawlers? Sealed robots.txt Data

Jun 13, 2026

6 of 9 Healthcare sites block at least one AI crawler.

Healthcare sites block AI crawlers at a 66.7% rate.

72 of 157 sites block at least one AI crawler across the corpus.

Key Takeaways

6 of 9 Healthcare sites with a parseable robots.txt block at least one AI crawler.

Healthcare sits well above the corpus-wide average of 45.9%, placing it among the most restrictive categories in the June 2026 edition of the Closing Web report. The pattern is notable: major consumer-facing health information portals are leading the block wave, while institutional and government-affiliated medical sites remain more open.

66.7% of Healthcare sites with robots.txt block at least one AI crawler.

This report covers 10 Healthcare sites in total. Of those, 9 returned a parseable robots.txt file; 1 — hhs.gov — returned no robots.txt at all. Among the 9 sites with parseable files, 6 are actively blocking at least one named AI crawler. That is a 66.7% block rate for this category.

Corpus-wide, 72 of 157 sites block at least one AI crawler — a 45.9% rate.

Healthcare clears that benchmark by a meaningful margin. The three sites that returned parseable robots.txt files but chose not to block — clevelandclinic.org, hopkinsmedicine.org, and kff.org — represent a distinct posture: open access as a deliberate, documented policy choice.


What the Data Covers

This report is one installment of the US Tech Automations Closing Web series, which examines how publishers across 16 content categories configure their robots.txt files with respect to AI crawlers.

The June 2026 EXPANDED edition checked 182 sites across 16 categories. Of those, 157 returned a parseable robots.txt file. The snapshot was sealed on June 13, 2026 with sha 9ceca3bdf0dfeaca — and nothing is estimated, modeled, or extrapolated. Every figure in this report is a verbatim count from a public robots.txt file read at that moment.

The methodology is deliberately narrow: we fetch the root-level robots.txt, parse it for named AI-crawler User-agent strings, and record whether any Disallow rule applies to at least one of the 9 crawlers tracked in this edition. We do not infer intent, guess at legal strategy, or estimate traffic impact. The data is what it is.

For Healthcare specifically, the 10 sites checked span a range of publisher types: commercial drug-information portals, independent health news properties, professional clinical reference databases, academic medical center hubs, global public health bodies, and a government health department. That diversity makes the 66.7% figure meaningful — it is not driven by a single publisher type.


Site-by-Site Breakdown

The table below shows the four fields for the Healthcare category as recorded in the sealed snapshot.

CategorySites CheckedWith robots.txtBlocking Any AI CrawlerBlock Rate
Healthcare109666.7%

The 6 Blockers

drugs.com operates one of the highest-traffic drug-information databases on the web. Its robots.txt reflects a clear decision to restrict AI training crawlers from the content that drives that traffic.

medicalnewstoday.com publishes peer-reviewed-adjacent health news. A block on AI crawlers signals that its editorial content is viewed as a proprietary asset rather than freely licensable training data.

who.int, the World Health Organization, is a notable entrant in the blocker column. A global intergovernmental body publishing detailed public health data and guidance has chosen to restrict at least one category of AI crawlers — a posture with potential policy implications beyond commercial considerations.

verywellhealth.com is a consumer-focused health information site with a large library of condition and treatment articles. Blocking AI crawlers is consistent with protecting a high-volume content investment.

everydayhealth.com covers similar consumer health territory. Its block configuration parallels verywellhealth.com and suggests a category-level commercial rationale rather than site-specific edge cases.

medscape.com serves a professional clinical audience — physicians, nurses, pharmacists. Its decision to block AI crawlers may reflect the sensitivity of its credentialing content as much as commercial interest in its editorial library.

The Open Sites

clevelandclinic.org and hopkinsmedicine.org both returned parseable robots.txt files without disallowing any of the 9 tracked AI crawlers. These are two of the most recognized academic medical centers in the United States, and both appear to treat broad web access — including AI crawler access — as consistent with their educational mission.

kff.org (Kaiser Family Foundation) is a nonpartisan health policy research organization. Its open posture aligns with its mission of broad public access to health policy data and analysis.

The No-robots Site

hhs.gov — the U.S. Department of Health and Human Services — returned no robots.txt file at all. Absence of a robots.txt means no explicit disallow rules exist; crawlers that respect the standard encounter no stated restriction.


How Healthcare Compares Across All 16 Categories

The table below shows all 16 categories from the sealed snapshot, sorted by block rate. Healthcare shares the 66.7% rate with Entertainment.

CategorySites CheckedWith robots.txtBlocking AnyBlock Rate
News20151386.7%
Food1010770%
Tech1513969.2%
Entertainment99666.7%
Healthcare109666.7%
Reference1411654.5%
Automotive109444.4%
Social1010440%
Sports1010440%
Travel99333.3%
Legal107228.6%
RealEstate107228.6%
Finance1211218.2%
Retail1512216.7%
Education97114.3%
Government98112.5%

Healthcare lands in the top tier — well above the corpus-wide 45.9% rate and substantially above the lower half of the table. News, Food, and Tech are the only categories with higher block rates. Categories like Government (12.5%) and Education (14.3%) sit far below Healthcare in restricting AI access.

The contrast between Healthcare and Legal is worth noting: both are heavily regulated, high-stakes content verticals, yet Legal lands at 28.6% while Healthcare lands at 66.7%. For a more detailed look at the Legal category, see Do Legal Sites Block AI Crawlers? Sealed robots.txt Data.

For another perspective on how media-adjacent categories compare, see the Sports category report — Sports sits at 40%, notably below Healthcare despite its similarly large content libraries.


Which AI Crawlers Are Most Commonly Blocked — Across All 157 Sites

The bot and operator leaderboards below reflect counts across the full 157-site corpus, not just Healthcare. They show which AI crawlers publishers are most likely to name and block in their robots.txt files.

Bot NameSites Blocking (of 157)Block Rate
CCBot5836.9%
ClaudeBot5333.8%
GPTBot4528.7%
Bytespider4428%
PerplexityBot4226.8%
Meta-ExternalAgent3924.8%
Applebot-Extended3924.8%
Google-Extended3723.6%
Amazonbot3119.7%

CCBot (Common Crawl) leads the list, named in 58 of the 157 sites tracked. ClaudeBot (Anthropic) is second at 53. These two have the longest history in robots.txt discussions and tend to appear together in broad disallow configurations.

The operator view of the same data follows:

OperatorSites Blocking (of 157)
Common Crawl58
Anthropic55
OpenAI47
Meta45
ByteDance44
Perplexity42
Apple39
Google37
Cohere36
Diffbot36
Amazon31
Mistral15

Mistral, at 15 sites, is the least-blocked operator in the corpus. Operators like Cohere and Diffbot — less prominent in mainstream AI discourse — are still blocked by 36 sites each, suggesting that operators publishing any kind of named crawler user-agent string get added to block lists once administrators start configuring restrictions.

Across all 157 sites in the corpus, CCBot is named in 58 disallow configurations — the most of any single crawler.

Healthcare sites block AI crawlers at 66.7%, placing the category above the corpus-wide rate of 45.9% by a notable margin.


Frequently Asked Questions

Q: Why do Healthcare sites block AI crawlers at such a high rate?

A: The sealed data does not state reasons — only the fact of blocking. Based on what can be observed, most of the 6 blocking sites are commercial publishers with large, proprietary article libraries. Protecting editorial content from being used as training data without compensation is a common stated concern across publisher categories. The presence of who.int among the blockers is notable because it is not commercial, which suggests that at least some blocking reflects policy or data-governance concerns rather than purely commercial ones.

Q: Does a robots.txt block actually prevent AI crawlers from accessing the content?

A: No. robots.txt is an honor-system standard. It signals a site owner's preference; it does not technically enforce it. A crawler that ignores robots.txt directives can still fetch the page. The data here records only declared intent as of June 13, 2026, not observed crawler behavior.

Q: Why did hhs.gov not have a robots.txt file?

A: The sealed data records only what was returned at the root-level robots.txt path on June 13, 2026. No robots.txt was found for hhs.gov. A missing file means there is no explicit instruction for any crawler — AI or otherwise. This does not necessarily mean the site supports AI crawling; it means no configuration was in place at that moment.

Q: What does it mean that clevelandclinic.org and hopkinsmedicine.org allow AI crawlers?

A: These two sites returned parseable robots.txt files with no disallow directives targeting the 9 tracked AI crawlers. That is a deliberate configuration — an explicit choice to not restrict. For large academic medical centers whose stated mission includes broad public education, open web access may be seen as consistent with that mission. The sealed data records the fact; interpretation beyond that is outside the scope of this report.

Q: How often does this data change?

A: The June 2026 edition is a point-in-time snapshot sealed on June 13, 2026. Robots.txt files change frequently — a site that allows access today can add a disallow tomorrow. The sha 9ceca3bdf0dfeaca identifies this exact snapshot. Future editions will re-crawl and produce a new sha; changes between editions will be visible by comparing snapshots.


Methodology Note

US Tech Automations Research fetched the robots.txt file at the canonical root path for each of the 10 Healthcare sites on June 13, 2026. Each file was parsed for User-agent strings matching the 9 AI crawlers in this edition. A site is recorded as "blocking" if any Disallow rule applies to at least one of those crawlers. The full corpus covered 182 sites across 16 categories; 157 returned a parseable file. Nothing is estimated, modeled, or extrapolated — counts are verbatim from the sealed snapshot (sha 9ceca3bdf0dfeaca).

For the Food and Automotive category findings, see Do Food Sites Block AI Crawlers? and Do Automotive Sites Block AI Crawlers?.


Put AI-Access Data to Work

This data is not a one-time read — it is the basis for recurring operational workflows. Three profiles get the most value from monitoring Healthcare robots.txt configurations over time:

An SEO or content strategist working in health publishing needs to know when competitor sites change their AI-crawler stance. If everydayhealth.com or medscape.com removes a block, that signals a shift in content-access philosophy that could affect how AI-generated summaries of health content flow to users. A weekly re-crawl of Healthcare sites with an alert on any configuration change gives the strategist an early signal — not a quarterly surprise.

A publisher RevOps lead managing licensing or syndication relationships with health publishers should track which sites are open vs. closed to AI access. Blocking configurations can signal that a site is preparing to negotiate licensing for AI training or retrieval. A change from "allow" to "block" on kff.org or hopkinsmedicine.org would be a meaningful operational trigger.

A retrieval or data engineer building a health-domain RAG pipeline needs to know which of the 10 sites are accessible for indexing under the robots.txt honor system. The current state — 3 allow, 1 no-file, 6 block — defines the compliant crawl surface. That surface can shrink overnight. An automated weekly check against this category, with a diff alert on any new Disallow entry, keeps the pipeline configuration current without manual auditing.

US Tech Automations automates exactly this kind of monitoring — scheduled crawler checks, configuration diffs, and alert routing — through its agentic workflow platform. See how agentic workflows handle recurring data monitoring.


Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 13, 2026 (snapshot sha 9ceca3bdf0dfeaca).

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “Do Healthcare Sites Block AI Crawlers? Sealed robots.txt Data.” https://ustechautomations.com/resources/blog/do-healthcare-sites-block-ai-crawlers-2026

Sealed snapshot sha256: 9ceca3bdf0dfeaca

Machine-readable data: CSV · JSON · All research & methodology

About the Author

Garrett Mullins
Garrett Mullins
Workflow Specialist

Helping businesses leverage automation for operational efficiency.