Research & Data

Do Legal Sites Block AI Crawlers? Sealed robots.txt Data

Jun 13, 2026

2 of 7 Legal sites block at least one AI crawler.

Legal sites block AI crawlers at a 28.6% rate.

72 of 157 sites block at least one AI crawler across the corpus.

Key Takeaways

2 of 7 Legal sites with a parseable robots.txt block at least one AI crawler.

Legal is among the most open categories in the June 2026 Closing Web edition, despite being a high-stakes content vertical. The 28.6% block rate places Legal well below the corpus-wide average of 45.9% — a striking finding given that legal content is often considered sensitive, proprietary, or professionally regulated.

28.6% of Legal sites with robots.txt block at least one AI crawler.

Of 10 Legal sites checked, 7 returned a parseable robots.txt file. The remaining 3 — findlaw.com, americanbar.org, and courtlistener.com — returned no robots.txt at all. Of the 7 sites with parseable files, only 2 are actively blocking any of the 9 tracked AI crawlers.

Corpus-wide, 72 of 157 sites block any AI crawler — a 45.9% rate.

Legal sits substantially below that figure. The dominant posture in this category is open access, with government legal publishers and legal database sites actively choosing to allow AI crawlers to operate under the robots.txt honor system.

What the Data Covers

This report is one installment of the US Tech Automations Closing Web series, which examines how publishers across 16 content categories configure their robots.txt files with respect to AI crawlers.

The June 2026 EXPANDED edition checked 182 sites across 16 categories. Of those, 157 returned a parseable robots.txt file. The snapshot was sealed on June 13, 2026 with sha 9ceca3bdf0dfeaca — and nothing is estimated, modeled, or extrapolated. Every figure in this report is a verbatim count from a public robots.txt file read at that exact moment in time.

The methodology is deliberately narrow: we fetch the root-level robots.txt for each site, parse it for named AI-crawler User-agent strings, and record whether any Disallow rule applies to at least one of the 9 crawlers tracked in this edition. We do not infer legal strategy, estimate liability exposure, or speculate on regulatory motivation. The data is exactly what was found.

For Legal, the 10 sites checked span commercial legal information providers, nonprofit legal databases, professional bar associations, federal government court and regulatory sites, and independent legal journalism. The diversity of that set makes the overall openness of the category more meaningful — it is not an artifact of a single publisher type dominating the sample.

Site-by-Site Breakdown

The table below shows the four fields for the Legal category as recorded in the sealed snapshot.

Category	Sites Checked	With robots.txt	Blocking Any AI Crawler	Block Rate
Legal	10	7	2	28.6%

The 2 Blockers

nolo.com is a commercial legal self-help publisher with a large library of consumer-facing legal guides, forms, and explanations. Its decision to block AI crawlers is consistent with the commercial-publisher pattern seen across the broader corpus: proprietary editorial content is viewed as a valuable asset not to be freely used for AI training.

scotusblog.com is an independent legal journalism site focused on the U.S. Supreme Court. Its block configuration places it among the restrictive minority in the Legal category. As an independent editorial operation with a highly specific focus, protecting its coverage and analysis from AI training use appears to reflect an intentional content-protection stance.

The Open Sites

law.cornell.edu (Legal Information Institute at Cornell) returned a parseable robots.txt without blocking any of the 9 tracked crawlers. LII is a widely cited free-access legal database that has long emphasized broad public access to legal information. Its open robots.txt posture is consistent with that mission.

justia.com provides free access to federal and state case law, statutes, and regulations. Its open configuration is similarly mission-aligned — Justia exists to make legal information freely accessible, and allowing AI crawlers follows from that principle.

supremecourt.gov is the official website of the U.S. Supreme Court. It returned a parseable robots.txt with no AI-crawler disallows. A government court site explicitly choosing openness toward AI crawlers is a notable data point in this edition.

justice.gov is the U.S. Department of Justice website. Like supremecourt.gov, it returned a parseable file without blocking any of the 9 tracked crawlers, reflecting a generally open federal government posture toward web access.

ecfr.gov publishes the Electronic Code of Federal Regulations — a core reference for federal regulatory text. Its open configuration is consistent with the public-access mandate of federal regulatory publishing.

The No-robots Sites

Three sites returned no robots.txt file: findlaw.com, americanbar.org, and courtlistener.com. Absence of a robots.txt means no explicit disallow directives exist under the standard. Crawlers encounter no stated restriction. This is a distinct posture from actively allowing — it is silence rather than permission — but under robots.txt convention it is treated as no restriction.

How Legal Compares Across All 16 Categories

The table below shows all 16 categories from the sealed snapshot, sorted by block rate. Legal is tied with RealEstate at 28.6% — both are notably below the corpus average.

Category	Sites Checked	With robots.txt	Blocking Any	Block Rate
News	20	15	13	86.7%
Food	10	10	7	70%
Tech	15	13	9	69.2%
Entertainment	9	9	6	66.7%
Healthcare	10	9	6	66.7%
Reference	14	11	6	54.5%
Automotive	10	9	4	44.4%
Social	10	10	4	40%
Sports	10	10	4	40%
Travel	9	9	3	33.3%
Legal	10	7	2	28.6%
RealEstate	10	7	2	28.6%
Finance	12	11	2	18.2%
Retail	15	12	2	16.7%
Education	9	7	1	14.3%
Government	9	8	1	12.5%

Legal ties RealEstate in the lower-middle tier. The categories above it — News at 86.7%, Healthcare at 66.7%, Tech at 69.2% — all show substantially more protective postures. The categories below Legal — Finance, Retail, Education, Government — are even more open, suggesting a broad tier of verticals where mission-driven or government-affiliated publishers dominate and commercial content-protection instincts are weaker.

The contrast with Healthcare is particularly sharp. Both are high-stakes content areas with significant professional and regulatory dimensions, yet Healthcare lands at 66.7% and Legal at 28.6%. For the full Healthcare picture, see Do Healthcare Sites Block AI Crawlers? Sealed robots.txt Data.

For a comparison with another open-access category, see the Sports category report, where 4 of 10 sites block at a 40% rate.

Which AI Crawlers Are Most Commonly Blocked — Across All 157 Sites

The bot leaderboard below reflects counts across the full 157-site corpus, not the Legal category in isolation. It shows which crawlers publishers name most often in Disallow directives.

Bot Name	Sites Blocking (of 157)	Block Rate
CCBot	58	36.9%
ClaudeBot	53	33.8%
GPTBot	45	28.7%
Bytespider	44	28%
PerplexityBot	42	26.8%
Meta-ExternalAgent	39	24.8%
Applebot-Extended	39	24.8%
Google-Extended	37	23.6%
Amazonbot	31	19.7%

CCBot leads with 58 sites, followed by ClaudeBot at 53 and GPTBot at 45. These three tend to appear together in restrictive configurations. Amazonbot trails the group at 31.

The operator-level view of the same data:

Operator	Sites Blocking (of 157)
Common Crawl	58
Anthropic	55
OpenAI	47
Meta	45
ByteDance	44
Perplexity	42
Apple	39
Google	37
Cohere	36
Diffbot	36
Amazon	31
Mistral	15

Anthropic is the second most-blocked operator at 55, behind Common Crawl at 58. Mistral, at 15, is the least blocked. The gap between the top and bottom of this list reflects the uneven awareness of different AI operators among site administrators configuring robots.txt files — some operators have simply been discussed more prominently in publishing circles.

Across all 157 sites in the corpus, Common Crawl is the most-blocked operator — named in Disallow configurations at 58 sites.

Legal sites block AI crawlers at 28.6%, well below the corpus-wide average of 45.9%, making this one of the more open content categories in the June 2026 edition.

Frequently Asked Questions

Q: Why is the Legal category so open to AI crawlers given the sensitivity of legal content?

A: The sealed data records only the fact of configuration, not the reasoning. But the composition of the open sites offers context: the sites with a parseable robots.txt that do not block include 3 federal government sites (supremecourt.gov, justice.gov, ecfr.gov) and 2 nonprofit legal databases (law.cornell.edu, justia.com). These are mission-driven open-access publishers. Commercial legal publishers may simply be underrepresented in this sample, or may not yet have updated their robots.txt files to target AI crawlers by name.

Q: What does it mean that 3 Legal sites have no robots.txt at all?

A: findlaw.com, americanbar.org, and courtlistener.com returned no robots.txt file at the root path on June 13, 2026. Under the robots.txt convention, absence of a file means no restrictions are stated. Crawlers that honor the standard would treat this as permission to crawl. This differs from an explicit allow — it is the absence of any signal rather than a positive declaration of openness.

Q: Does a 28.6% block rate mean Legal sites are safe for AI training data use?

A: No. This report reflects only the robots.txt honor-system signal. Robots.txt is not a legal instrument. Sites that do not block AI crawlers in robots.txt may still have terms of service, copyright claims, or other legal instruments that restrict use of their content for AI training. The 28.6% figure describes configuration, not legal permissions.

Q: Could the Legal category block rate increase in future editions?

A: The sealed data covers a single point in time — June 13, 2026. Robots.txt files can change at any moment. The presence of large commercial publishers like nolo.com in the blocking group, alongside the absence of blocking by nonprofit and government publishers, suggests that if commercial legal information sites update their configurations, the category rate could shift substantially. Future editions will capture those changes.

Q: How is the robots.txt block rate calculated for this category?

A: The block rate is the count of sites with parseable robots.txt that disallow at least one tracked AI crawler, divided by the total count of sites with parseable robots.txt. For Legal: 2 blocking divided by 7 with parseable robots.txt equals 28.6%. Sites with no robots.txt (3 in this category) are excluded from the denominator. Nothing is estimated, modeled, or extrapolated.

Methodology Note

US Tech Automations Research fetched the robots.txt file at the canonical root path for each of the 10 Legal sites on June 13, 2026. Each file was parsed for User-agent strings matching the 9 AI crawlers in this edition. A site is recorded as "blocking" if any Disallow rule applies to at least one of those crawlers. The full corpus covered 182 sites across 16 categories; 157 returned a parseable file. Nothing is estimated, modeled, or extrapolated — counts are verbatim from the sealed snapshot (sha 9ceca3bdf0dfeaca).

For comparison with other professional-information verticals, see Do Healthcare Sites Block AI Crawlers? and Do Real Estate Sites Block AI Crawlers?.

Put AI-Access Data to Work

The Legal category robots.txt landscape is not static — even though 28.6% blocking today represents an open posture, configurations change. Three operational profiles have a direct use for monitoring Legal site AI-access configurations on a recurring basis:

An SEO or content strategist at a legal publisher needs to know when competitors change their AI-crawler stance. If nolo.com expands its block to additional crawlers, or if justia.com adds a disallow for the first time, that signals a competitive shift in how legal content is being positioned for AI training markets. A weekly automated re-crawl of all 10 Legal sites, with a configuration-diff alert, gives the strategist an early view of market sentiment changes — not a lagging indicator.

A RevOps or business development lead in legal tech who is building data licensing deals with legal publishers needs a real-time picture of who is open vs. closed. A site that moves from no robots.txt to an active block is signaling that it is preparing to assert control — a trigger to accelerate outreach before the window closes.

A retrieval or data engineer building a legal-domain knowledge graph or RAG pipeline needs a weekly confirmation that the compliant crawl surface has not changed. With 3 sites missing a robots.txt file and the rest actively allowing, the accessible surface is relatively large — but it can shrink overnight. An automated check with a diff on each site removes the need for manual re-auditing and keeps the pipeline configuration current.

US Tech Automations automates exactly this kind of ongoing monitoring — scheduled robots.txt fetches, configuration diffs, and alert routing without manual oversight. See how agentic workflows handle recurring access monitoring.

For the whole-web baseline behind the Legal category, see our national study on how many top websites block AI crawlers.

Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 13, 2026 (snapshot sha 9ceca3bdf0dfeaca).

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “Do Legal Sites Block AI Crawlers? Sealed robots.txt Data.” https://ustechautomations.com/resources/blog/do-legal-sites-block-ai-crawlers-2026

Sealed snapshot sha256: 9ceca3bdf0dfeaca

Machine-readable data: CSV · JSON · All research & methodology