Research & Data

Do Nonprofit Sites Block AI Crawlers? Sealed robots.txt Data

Jun 14, 2026

The Nonprofit category sits alone at the very bottom of the June 2026 Closing Web corpus: every nonprofit site we checked allows all AI crawlers. Of 10 nonprofit sites, 6 returned a parseable robots.txt file, and 0 of those 6 block any tracked AI crawler — a block rate of 0%. That is not a rounding artifact or a near-zero result.

The Nonprofit category is the only one in the entire 24-category sweep to record a 0% block rate.

0 of 6 Nonprofit sites with a parseable robots.txt block any AI crawler — a 0% block rate.

This report uses only data read verbatim from the sealed snapshot. To be explicit, nothing is estimated, modeled, or extrapolated. Every figure comes directly from public robots.txt files sealed June 14, 2026 under snapshot sha 834f1e2f07af24fd.

Why 0% Is the Real Signal

The absence of any AI-crawler blocking among the nonprofit sites in this panel is itself informative — and the reasons are worth considering carefully rather than treating the zero as a default.

The 6 sites with parseable robots.txt files — worldwildlife.org, habitat.org, aclu.org, salvationarmyusa.org, feedingamerica.org, and unitedway.org — represent a cross-section of the nonprofit world: environmental conservation, affordable housing, civil liberties, domestic hunger relief, and federated fundraising. These are organizations whose explicit mission is public awareness and mobilization. A robots.txt block on an AI crawler restricts content visibility in AI-generated results, which in the context of an advocacy or cause-marketing mission runs directly counter to organizational goals.

"0 of 6 Nonprofit sites with parseable robots.txt files blocked any AI crawler as of June 14, 2026 — the only category in the 24-category corpus to record a 0% block rate."

Consider what a block would accomplish for a site like feedingamerica.org or habitat.org. Their content — explanations of food insecurity, donation appeals, program descriptions, volunteer recruitment — is designed to reach the widest possible audience. Being surfaced in AI-generated answers when someone asks about hunger relief or housing assistance is arguably better-than-organic distribution for a mission-driven organization. The economic incentives that push commercial editorial publishers toward blocking (training-data monetization, licensing, traffic attribution) simply do not apply in the same way when the organization's measure of success is cause awareness and donor conversion, not content paywall revenue.

The 4 sites that returned no robots.txt file — redcross.org, unicef.org, charitywater.org, and doctorswithoutborders.org — are excluded from blocking counts entirely. A missing robots.txt does not imply blocking; compliant crawlers treat a missing file as permission to crawl all paths. Their absence from the file-present pool reinforces the pattern of minimal barriers rather than contradicting it.

0% is the lowest block rate in the entire June 2026 corpus — 9 points below Government at 12.5%.

Why This Differs From Every Other Category

The contrast with the rest of the corpus is striking. Gaming sits at 88.9%. News at 82.4%. Even sectors with commercial incentives toward openness — Finance at 18.2%, Education at 14.3%, Government at 12.5% — still record at least one blocker. Nonprofit is the only category where the count for blocking sites is zero.

The structural explanation is that nonprofit organizations:

Do not have a content-licensing revenue model that AI training threatens.
Often have explicit mandates around public access and information dissemination.
Benefit more from AI visibility (cause discovery, donation intent) than they risk losing from it (there is no subscription to be bypassed, no salary-data product to be extracted).
Tend to maintain leaner web operations, meaning active tracking and response to emerging AI-crawler standards is less resourced than at commercial publishers.

All four factors push toward allowing. The result is a unanimous 0% across the 6 sites with parseable robots.txt files.

"Across all 223 corpus sites with parseable robots.txt files, 104 block at least one AI crawler — a 46.6% rate. Nonprofit is the only category to record zero blocks."

How All 24 Categories Compare

Category	Sites Checked	With robots.txt	Blocking Any AI	Block Rate
Gaming	9	9	8	88.9%
News	20	17	14	82.4%
Food	10	10	7	70%
Tech	15	13	9	69.2%
Entertainment	9	9	6	66.7%
Healthcare	10	9	6	66.7%
Music	10	9	6	66.7%
Reference	14	11	6	54.5%
Science	10	10	5	50%
Automotive	10	9	4	44.4%
Home & Garden	10	9	4	44.4%
Fashion	9	7	3	42.9%
Social	10	10	4	40%
Sports	10	10	4	40%
Jobs	10	8	3	37.5%
Travel	9	9	3	33.3%
Weather	10	6	2	33.3%
Legal	10	7	2	28.6%
Real Estate	10	7	2	28.6%
Finance	12	11	2	18.2%
Retail	15	12	2	16.7%
Education	9	7	1	14.3%
Government	9	8	1	12.5%
Nonprofit	10	6	0	0%

The table makes the Nonprofit result concrete: it is the last row, and it is the only row with a zero in the blocking count. The next-closest categories — Government (12.5%), Education (14.3%), Retail (16.7%) — are all still nonzero. Nonprofit is a genuine outlier at the low end, not a near-miss.

Corpus-Wide Bot and Operator Blocking

The tables below report bot- and operator-level blocking across all 223 sites with parseable robots.txt files. These counts are not Nonprofit-specific — they describe the full corpus.

Bot	Sites Blocking It (of 223)	Block Rate
CCBot	85	38.1%
ClaudeBot	74	33.2%
Bytespider	69	30.9%
GPTBot	64	28.7%
Meta-ExternalAgent	63	28.3%
PerplexityBot	60	26.9%
Applebot-Extended	60	26.9%
Google-Extended	57	25.6%
Amazonbot	50	22.4%

CCBot remains the most blocked individual crawler across the corpus — 85 of 223 sites carry a disallow for it. None of those blocks come from the Nonprofit category; all 85 come from other categories — most heavily the high-blocking end of the spectrum documented in the Gaming sites report.

Operator	Sites Blocking Them (of 223)
Common Crawl	85
Anthropic	80
Meta	73
ByteDance	69
OpenAI	66
Perplexity	60
Apple	60
Google	57
Cohere	56
Diffbot	55
Amazon	50
Mistral	21

For operators and AI product teams, the Nonprofit category represents the cleanest access signal in the corpus: no expressed honor-system barrier from any of the 6 sites with robots.txt files. That does not substitute for reviewing terms of service or assessing any applicable data agreements, but at the robots.txt level, the sector is uniformly open as of June 14, 2026.

Mistral faces the fewest blocks corpus-wide — 21 of 223 sites — likely reflecting the recency of its crawler relative to CCBot or GPTBot.

0 of 6 Nonprofit sites block any AI crawler.

Nonprofit sites block at a 0% rate.

104 of 223 sites block at least one AI crawler.

Key Takeaways

0 of 6 Nonprofit sites block any AI crawler — a 0% block rate, the lowest in the 24-category corpus.
The 6 sites with parseable robots.txt files are worldwildlife.org, habitat.org, aclu.org, salvationarmyusa.org, feedingamerica.org, and unitedway.org — all allowing all tracked crawlers.
redcross.org, unicef.org, charitywater.org, and doctorswithoutborders.org returned no robots.txt file in this snapshot.
The corpus-wide block rate is 46.6%; Nonprofit is the only category at 0%.
The 0% reflects mission alignment with AI visibility, not absence of a robots.txt policy.
39 of 223 sites across the corpus have deployed llms.txt — a 17.5% adoption rate for that newer standard.
The 24-category distribution runs from Gaming at 88.9% to Nonprofit at 0%; the middle is dense with mid-blocking categories.

Frequently Asked Questions

Q: Does 0% mean every nonprofit site we checked explicitly welcomes AI crawlers?

A: No. It means 0 of the 6 sites with parseable robots.txt files carry an explicit disallow directive for any tracked AI crawler. Absence of a block is an honor-system signal, not an affirmative invitation. It means the sites have not used robots.txt to express a crawling objection; it does not address terms of service, data reuse rights, or any other form of consent. Nonprofit organizations can still restrict AI access through other mechanisms — access control, legal terms, or authentication.

Q: What about the 4 sites with no robots.txt file — redcross.org, unicef.org, charitywater.org, doctorswithoutborders.org?

A: A missing robots.txt file means the site has not published explicit crawling instructions. Per the robots exclusion protocol, compliant crawlers treat this as permission to crawl all accessible paths. These 4 sites are excluded from the blocking count because there is no file to assess. Their absence from the robots.txt pool is consistent with the category-wide pattern of minimal AI access barriers, but they are not counted as either blockers or allowances in the sealed figures.

Q: Could nonprofit sites change their policy in future editions?

A: Yes. robots.txt files can be updated at any time, and the organizations in this panel could add AI-crawler disallows in response to internal policy decisions, board guidance, or sector-level consensus. The 0% figure is sealed to June 14, 2026. Re-crawling the panel in future snapshots will detect any change. Monitoring the Nonprofit category for a first-ever blocker would be an early signal that the sector's relationship with AI access is shifting. See the Jobs category report for how a sector with more commercial pressure handles a similar structural dynamic.

Q: Why does Nonprofit have the lowest block rate while Government — which might seem similarly mission-oriented — blocks at 12.5%?

A: Government is a different case. Government sites at 12.5% include at least one site with a meaningful block, which the data shows. Government agencies may carry more conservative IT policies shaped by legal counsel, data-sovereignty concerns, or compliance frameworks that simply have no analog in the nonprofit sector. Nonprofit organizations in this panel have cause-awareness missions where maximum reach is a success metric; government agencies often have access-management obligations that do not apply to advocacy organizations. The difference in block rates reflects that structural distinction.

Q: What is a robots.txt file, and why does it matter here?

A robots.txt file is a plain-text file placed at the root of a domain that signals crawling preferences to compliant automated agents. It is a voluntary standard — publishers cannot force compliance, but legitimate search engines and AI crawlers generally respect it as a matter of good-faith operating practice and, increasingly, legal risk management. A sealed-snapshot methodology reads these files verbatim on a specific date, hashes the result, and counts blocks without inference.

That makes the figures verifiable and reproducible at the stated sha.

Put AI-Access Data to Work

Three recurring workflows apply directly to what the Nonprofit sealed data shows.

SEO and content strategists at mission-driven organizations can use the 0% block rate as a benchmark — the sector standard as of June 2026 is universal openness to AI crawling. A concrete recurring job: re-crawl the 10 Nonprofit panel sites monthly and alert on the first appearance of any AI-crawler disallow directive. That alert is a signal to examine whether peer organizations have made a coordinated policy shift that your organization should assess. The trigger is any change from zero to nonzero in the category block count.

Data and research leads at AI companies and nonprofits can use this category data as evidence in access-policy discussions. The Nonprofit 0% result, cross-walked against sectors like Gaming at 88.9% and News at 82.4%, illustrates that AI-access blocking is a deliberate commercial and mission-based choice rather than a sector default. Publishing this kind of transparent sector-by-sector analysis — the approach US Tech Automations takes with the Closing Web series — provides the shared factual baseline that policy conversations need. See the Weather category report for the sector at 33.3%, sitting between Nonprofit and the corpus median.

Retrieval and data pipeline engineers building systems that ingest nonprofit content — for cause-matching, charitable research, or donor intelligence — face the clearest access map of any category: no honor-system barriers from any of the 6 sites with robots.txt files in this panel as of June 14, 2026. A scheduled re-crawl converts this clean snapshot into an ongoing compliance signal, alerting the moment any site in the panel changes its policy.

US Tech Automations automates this monitoring: scheduled robots.txt crawls, change-diff alerting, and a per-category policy dashboard that flags the moment a previously-open site adds a disallow directive — covering the Nonprofit panel and every other category in the corpus. Learn how the platform handles continuous AI-access monitoring.

This snapshot of Nonprofit sites is one slice of a wider dataset; read how many top websites block AI crawlers for the cross-industry view.

Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 14, 2026 (snapshot sha 834f1e2f07af24fd).

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “Do Nonprofit Sites Block AI Crawlers? Sealed robots.txt Data.” https://ustechautomations.com/resources/blog/do-nonprofit-sites-block-ai-crawlers-2026

Sealed snapshot sha256: 834f1e2f07af24fd

Machine-readable data: CSV · JSON · All research & methodology