Research & Data

Do Government Sites Block AI Crawlers? Sealed Data

Jun 13, 2026

Government websites hold some of the most authoritative public data on the internet: legislation, tax guidance, securities filings, census records, weather data, scientific research, and patent databases. These are sources that AI systems actively seek out for grounding, citation, and knowledge retrieval. The question of whether government agencies restrict AI-crawler access is therefore a meaningful one for anyone building or monitoring AI-powered information pipelines.

The sealed data from June 2026 gives a clear answer: government is the most open category in this corpus.

Only 1 of 8 Government sites with a parseable robots.txt blocks any AI crawler — a 12.5% rate.

That is the lowest block rate across all 10 categories in this Closing Web edition. It sits far below the corpus-wide 44.9% average (48 of 107 sites) and reflects a consistent, near-universal posture of AI-crawler openness across US federal agencies.

All data in this report comes from a sealed snapshot of public robots.txt files. Nothing is estimated, modeled, or extrapolated. Every figure is a direct verbatim count from the snapshot sealed June 13, 2026 (sha 741353c4304216ee).

What the Government Data Shows

Of the 9 Government sites checked, 8 returned a parseable robots.txt. One site — weather.gov — returned no robots.txt at all. Of the 8 parseable sites, exactly 1 blocks any AI crawler.

Metric	Count
Government sites checked	9
Sites with parseable robots.txt	8
Sites blocking at least one AI crawler	1
Block rate	12.5%
Sites with no robots.txt	1

The sole blocker is congress.gov. The 7 non-blockers are usa.gov, irs.gov, sec.gov, whitehouse.gov, census.gov, nasa.gov, and uspto.gov.

7 of 8 Government sites with a parseable robots.txt impose no restrictions on any known AI crawler.

No Government site in this corpus maintains an llms.txt file. That absence contrasts with categories like Finance (schwab.com, paypal.com, coinbase.com) and Education (coursera.org, edx.org, khanacademy.org, duolingo.com), where several sites have voluntarily published AI-access declarations. Government agencies have not adopted that convention.

Government has the lowest AI-crawler block rate of all 10 categories in this corpus — 12.5%.

The Sole Blocker: congress.gov

congress.gov is the sole Government site in this corpus that restricts any AI crawler. It hosts the full text of US legislation, congressional records, bill status, and member information — one of the most comprehensive publicly funded legal databases in the world.

The decision to restrict AI crawlers from congress.gov is notable precisely because the data is public information funded by taxpayers and meant to be accessible. The robots.txt restriction is not a secrecy measure but rather a crawl-management choice — agencies may restrict certain automated access patterns to manage server load, to prevent bulk scraping, or to assert some degree of policy over how the content enters training pipelines.

What the sealed data records is the binary fact: congress.gov returns at least one Disallow directive covering at least one known AI-crawler bot string. The specific bots named are not broken out per-site in this sealed dataset.

The Non-Blocker Majority: Seven Federal Agencies

Seven federal sites — usa.gov, irs.gov, sec.gov, whitehouse.gov, census.gov, nasa.gov, and uspto.gov — impose no AI-crawler restrictions in their robots.txt. This group spans the breadth of what the federal government publishes online.

irs.gov hosts tax forms, publication libraries, and filing guidance. sec.gov hosts financial disclosures, EDGAR filings, and enforcement records. census.gov hosts demographic data, economic surveys, and population statistics. nasa.gov hosts scientific data, mission documentation, and research publications. uspto.gov hosts patent records and trademark filings.

The open posture of these agencies is coherent with the principle that publicly funded government information should be widely accessible. That principle predates AI crawlers — it underlies the Freedom of Information Act, the data.gov initiative, and the general practice of publishing federal data without access paywalls. Extending that openness to AI crawlers may not be a deliberate new policy so much as the default application of an existing value.

whitehouse.gov and usa.gov similarly impose no restrictions. Both serve largely as navigation hubs and official communication portals for federal information; restricting AI access would reduce the discoverability of official government positions, which would be counter to the communications function of those sites.

weather.gov serves no robots.txt at all, meaning crawlers receive no machine-readable guidance from that domain.

Cross-Category Rankings

Government ranks last of 10 categories in AI-crawler block rate.

Category	Sites Checked	With robots.txt	Any Blocker	Block Rate
News	20	17	14	82.4%
Tech	15	13	9	69.2%
Entertainment	9	9	6	66.7%
Reference	14	11	6	54.5%
Social	10	10	4	40%
Travel	9	9	3	33.3%
Finance	12	11	2	18.2%
Retail	15	12	2	16.7%
Education	9	7	1	14.3%
Government	9	8	1	12.5%

The contrast between Government (12.5%) and the top category, News (82.4%), captures the width of the spectrum in this corpus. News publishers have broadly moved to protect original journalism from AI training; government agencies have broadly left their public information accessible to crawlers.

Education (14.3%) and Government (12.5%) are statistically close — both have exactly 1 site blocking out of their parseable set. The structural difference is that Education has 7 parseable sites vs. Government's 8, producing the slight rate difference. Both categories represent the open end of the AI-access spectrum.

The Finance report tracks a similarly low block rate at 18.2%, reflecting that transactional and institutional sites across multiple sectors share this open posture.

Corpus-Wide Operator Leaderboard (All 107 Sites)

These counts span all 107 parseable sites in the corpus — not Government-specific. They show which AI operators face the most resistance globally.

AI Operator	Sites Blocking (of 107)
Common Crawl	40
Anthropic	39
ByteDance	37
OpenAI	35
Meta	35
Apple	31
Diffbot	30
Perplexity	29
Cohere	27
Google	25
Amazon	22
Mistral	12

Common Crawl leads with 40 blocks across 107 sites. Anthropic follows at 39. Across the 12 operators tracked, Common Crawl faces 40 blocks corpus-wide, Anthropic 39, and ByteDance 37. These counts include every category; Government contributes only congress.gov to these totals.

Anthropic is blocked by 39 of 107 sites corpus-wide.

Just 1 of 8 Government sites blocks AI crawlers.

Government blocks AI crawlers at only 12.5%.

48 of 107 sites in the corpus block at least one AI crawler — a 44.9% rate.

Government at 12.5% sits far below that line. For teams building AI systems that rely on government data as a knowledge source, this corpus suggests the robots.txt layer is largely not a barrier in the current landscape.

Methodology

US Tech Automations Research fetched the robots.txt file for each of the 122 sites in the Closing Web corpus on June 13, 2026. Each response was categorized as parseable (returned a parseable robots.txt file with valid syntax), absent, or error. For the 107 parseable responses, we checked for 21 known AI-crawler bot strings across 12 operators. A site is counted as "blocking" if any Disallow directive covers "/" under any AI-crawler user-agent.

The snapshot is point-in-time and sealed — nothing is estimated, modeled, or extrapolated. weather.gov is excluded from the blocking rate because it returned no robots.txt file. The snapshot is sealed at sha 741353c4304216ee. All figures in this report are verbatim counts from that snapshot.

No Government site in this corpus maintains an llms.txt file; that field returned empty for all 9 Government domains.

Who This Is For

This report is relevant for:

AI product and retrieval teams building systems that ground answers in government data and need to verify access policy
Compliance and legal teams at AI companies monitoring whether federal sites restrict crawler access
Government digital teams benchmarking their robots.txt posture against peer agencies
Policy researchers studying how open-government principles interact with AI-access conventions
SEO and data teams tracking policy changes in the government domain category

For organizations that rely on government data — regulatory filings, census records, patent databases — understanding the current robots.txt posture is the baseline. Knowing when that posture changes is the operational requirement.

Automating AI-Access Monitoring for Government Sites

Government sites may be open today, but robots.txt files are updated without announcement. A federal agency could add AI-crawler restrictions in response to policy guidance, legal interpretation, or administrative decision at any time. For teams whose data pipelines depend on government sources, a change to the robots.txt of irs.gov, sec.gov, or uspto.gov is a meaningful operational event.

US Tech Automations builds automation workflows that schedule robots.txt fetches across domain watchlists, parse changes in Disallow directives for specific bot strings, and route alerts to the relevant teams. For a compliance team at an AI company, the goal is not to check government robots.txt once — it is to monitor continuously and surface changes before they affect production systems.

The same workflow framework that tracks Government sites applies equally to Education sites tracked in this report and to high-blocking categories like News — where 82.4% of News sites block at least one AI crawler. The automation is category-agnostic; the value is in the persistent monitoring rather than the one-time snapshot.

For a RevOps or data-engineering team integrating government data sources into retrieval pipelines, the robots.txt layer is currently a green light. Maintaining visibility into when that changes is where automation earns its keep.

Key Takeaways

Of 9 Government sites checked, 8 returned a parseable robots.txt. Only 1 of those 8 blocks any AI crawler — a 12.5% rate.
The sole blocker is congress.gov.
The 7 non-blockers are usa.gov, irs.gov, sec.gov, whitehouse.gov, census.gov, nasa.gov, and uspto.gov.
weather.gov serves no robots.txt at all.
No Government site in this corpus maintains an llms.txt file.
Government ranks last of 10 categories — far below the 44.9% corpus-wide average.
The near-universal open posture reflects the principle that publicly funded government data should be widely accessible.

Government is the most AI-accessible category in the entire Closing Web corpus as of June 2026.

FAQ

Q: Why is congress.gov the only Government site that blocks AI crawlers?

A: The sealed data records the policy, not the stated reason. congress.gov hosts the complete text of US legislation and congressional records — a uniquely high-value knowledge corpus. It is possible that the Disallow directives exist to manage server load from bulk automated access, or to assert some degree of policy over how the content enters training pipelines. The other 7 Government sites cover tax guidance, financial disclosures, census data, scientific research, and patent records — all of which appear to be served with no AI-crawler restrictions.

Q: Does an open robots.txt mean government content is free to use in AI training?

A: robots.txt governs whether a crawler is permitted to access the content. It does not address copyright, licensing, or terms of use. Government works in the United States are generally not copyrightable by the federal government under US law, but the specific legal status varies by content type, agency, and use case. robots.txt openness and legal authorization to use content for AI training are separate questions; robots.txt data tells you only about the crawl-access policy, not the downstream use rights.

Q: What does the absence of llms.txt across all Government sites mean?

A: No Government site in this corpus publishes an llms.txt file. llms.txt is a voluntary, emerging convention where sites declare AI-access preferences in a structured plain-text format. The absence of llms.txt does not indicate restriction — it may simply mean government agencies have not adopted a convention that is still nascent. The open robots.txt posture of the 7 non-blockers suggests openness, not silence indicating restriction.

Q: How does Government compare to Education in this corpus?

A: Both categories sit at the open end of the spectrum. Education has a 14.3% block rate (1 of 7 parseable sites blocking — coursera.org). Government has a 12.5% rate (1 of 8 parseable sites blocking — congress.gov). The structural drivers are similar: both sectors have historically valued broad public access to information. The Education report covers the full detail, including why the commercial model of coursera.org differentiates it from universities like mit.edu and harvard.edu.

Q: Could government robots.txt policy change?

A: Yes. robots.txt files are not static. Federal agencies update their files in response to policy decisions, administrative guidance, and legal interpretations. This snapshot is sealed June 13, 2026 — it reflects point-in-time policy. Future changes would not be reflected here. Monitoring for changes as they happen requires automated crawl-diff workflows, not periodic manual checks.

Curious how Government sites compare across every vertical? Our flagship study tracks how many top websites block AI crawlers.

Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 13, 2026 (snapshot sha 741353c4304216ee).

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “Do Government Sites Block AI Crawlers? Sealed Data.” https://ustechautomations.com/resources/blog/do-government-sites-block-ai-crawlers-2026

Sealed snapshot sha256: 741353c4304216ee

Machine-readable data: CSV · JSON · All research & methodology