Research & Data

AI-Crawler Blocking by Industry: News vs Retail (2026)

Q: Could a low government block rate just mean those sites have no robots.txt?

No — the denominator is sites that *returned* a parseable robots.txt. Government's 12.5% is 1 of 8 sites that published a policy, so it reflects deliberate openness, not a missing file.

Jun 13, 2026

Does AI-crawler blocking depend on what business you're in? Dramatically. The sealed data shows news sites blocking AI crawlers at 82.4% while government sites sit at 12.5% — a gap of nearly 70 points between the most-closed and most-open industries. The decision to wall off AI is not a tech-wide consensus; it tracks almost perfectly with whether a site's content is its product.

This report groups a curated set of 122 prominent websites into 10 categories and measures, within each, the share that block at least one major AI crawler. A site counts as blocking when its robots.txt names a major AI bot with Disallow: /. Rates are computed over the sites in each category that returned a parseable robots.txt — so a category's denominator is its readable-policy sites, not its total. All counts are verbatim from files fetched and sealed point-in-time on June 13, 2026.

The Industry Ranking

Sorted by the share of readable-policy sites that block at least one major AI crawler:

Category	Sites	With robots.txt	Blocking ≥1 AI Crawler	Block Rate
News	20	17	14	82.4%
Tech	15	13	9	69.2%
Entertainment	9	9	6	66.7%
Reference	14	11	6	54.5%
Social	10	10	4	40%
Travel	9	9	3	33.3%
Finance	12	11	2	18.2%
Retail	15	12	2	16.7%
Education	9	7	1	14.3%
Government	9	8	1	12.5%

The ranking sorts into three tiers so cleanly it looks designed. At the top, content-as-product industries — News (82.4%), Tech (69.2%), Entertainment (66.7%) — block aggressively. In the middle, Reference (54.5%) and Social (40%) hedge. At the bottom, the transactional and public-good industries — Finance (18.2%), Retail (16.7%), Education (14.3%), Government (12.5%) — barely block at all.

"News sites block AI crawlers at 82.4% — 14 of 17 with a readable robots.txt — versus 12.5% for government and 16.7% for retail, a near-70-point industry gap as of June 13, 2026."

Why News Blocks and Retail Doesn't

The pattern follows a single logic: industries block AI crawlers in direct proportion to how much their revenue depends on the content being the destination rather than a means to one.

For a news publisher, the article is the product. If an AI assistant ingests the reporting and answers the reader's question directly, the reader never visits, never sees an ad, never hits a subscription wall. That is an existential threat, and 82.4% of news sites have responded by naming AI crawlers and blocking them. Tech publishers (69.2%) and entertainment sites (66.7%) face the same dynamic — their words, reviews, and listings are the asset being substituted.

For a retailer, the calculus inverts. A retailer wants to be the answer an AI shopping assistant surfaces; being cited by an AI agent is a path to a sale, not a substitute for one. Blocking the crawler would mean disappearing from exactly the surface where future purchases get decided. So retail blocks at just 16.7%. Finance (18.2%) reasons similarly — being the recommended answer to "best high-yield savings account" is worth far more than protecting the page copy. Government (12.5%) and education (14.3%) round out the bottom because their mandate is public access; CDC, Census, and university pages are published precisely to be found and used, including by machines.

The two middle-tier categories are the most interesting because they are genuinely conflicted.

Reference sites land at 54.5% — 6 of 11. This category contains both the great public knowledge commons (which exist to be read by anything) and commercial reference brands (which monetize their compiled answers and want to protect them). That split pulls the category to the middle. Social sits at 40% — 4 of 10. User-generated-content platforms have the most to gain from being the corpus everyone trains on and the most to lose from giving away their users' contributions for free, and that tension shows up as a near-even split.

The takeaway from the middle tier is that "block or allow" is not a settled question even within an industry. It is being decided platform by platform, and the 40–55% band is where the live argument is happening. These are also the categories most worth watching over time: a shift in the reference or social rate would be an early indicator of where the broader consensus is heading, well before the more entrenched top and bottom tiers move.

The Named Sites Behind the Rates

The category rates are easier to trust when you can see the specific sites driving them. The sealed per-domain record names exactly who blocks and who does not.

At the top of the News tier, the blocking is near-universal and often near-total:

News site	Headline crawlers blocked (of 9)
bbc.com	9
bloomberg.com	9
usatoday.com	9
nytimes.com	8
cnn.com	8
forbes.com	8
washingtonpost.com	7
theguardian.com	7
newsweek.com	7

bbc.com, bloomberg.com, and usatoday.com block all nine headline crawlers, and nytimes.com, cnn.com, and forbes.com block eight of nine. Even reputationally cautious publishers like theguardian.com, washingtonpost.com, and newsweek.com block seven. The News category's 82.4% is not one or two outliers — it is the prevailing behavior of the segment.

Retail is the mirror image:

Retail site	Headline crawlers blocked (of 9)
ebay.com	8
amazon.com	7
walmart.com	0
target.com	0
bestbuy.com	0
homedepot.com	0
nike.com	0

Of the readable-policy retailers, only two block any headline crawler: amazon.com (which blocks seven) and ebay.com (which blocks eight). Every other major retailer in the set — walmart.com, target.com, bestbuy.com, homedepot.com, wayfair.com, ikea.com, nordstrom.com, etsy.com, and nike.com — blocks none. The two retail blockers are the two largest marketplaces, both of which run their own large ad and search businesses; the pure retailers stay wide open because being the answer an AI shopping assistant returns is worth more to them than protecting page copy.

Government and education make the openness even starker. Across the readable government domains — cdc.gov, census.gov, nasa.gov, irs.gov, sec.gov, usa.gov, whitehouse.gov, uspto.gov — only congress.gov blocks (and it blocks eight), producing the category's 12.5% rate. In education, MIT, Harvard, Stanford, edX, and Duolingo all block zero. Public-mission sites publish to be read by anything, and the data shows it.

What This Means If You Operate a Site

The industry pattern is a useful benchmark, but it is not a prescription. The right answer depends on your specific funnel:

If your content is your product (publishing, media, premium research), the 82.4% news rate reflects a defensive posture worth understanding — though even there, the choice is increasingly nuanced between blocking training crawlers and allowing search-grounding ones.
If you sell something the content points toward (retail, finance, services), the 12–18% bottom-tier rates reflect a deliberate choice to stay discoverable by AI assistants, because being the surfaced answer drives revenue.
If your mission is reach (government, education, nonprofits), low blocking aligns with the goal of maximum machine-readable availability.

Reading the Category Data Responsibly

Because the per-category samples are modest — between 7 and 17 readable-policy sites each — the rates are best treated as directional within this curated set rather than as precise population estimates. A single site changing its robots.txt can move a small category by several points, which is exactly why a one-time snapshot is less valuable than a tracked trend. What gives the ranking its weight is not any individual rate but the shape of the whole distribution: the same logic — content-as-product blocks, content-as-funnel stays open — holds across all ten categories, top to bottom, with no real exceptions. When a pattern is that consistent across independent segments, it is unlikely to be noise.

It is also worth separating two questions that get conflated. "Does this category block AI?" and "Should I block AI?" are different. The category rate tells you what your peers are doing, which is useful context for a board conversation or a competitive scan. It does not tell you what is right for your specific funnel, your specific content, or your specific risk tolerance around training-data inclusion versus AI-answer discoverability. A retailer that copies the news playbook would be blocking itself out of the surfaces where its customers increasingly start their purchases; a premium publisher that copies the retail playbook might be giving away the reporting that is its only product. The benchmark is an input, not an instruction.

Put This Data to Work

The most valuable version of this dataset is not the snapshot — it is the trend within your category. When the news block rate climbs from 82.4% to 90%, or when a retail competitor breaks ranks and starts blocking, that movement is a strategic signal you want to catch early.

US Tech Automations builds exactly this kind of category-level monitoring for marketing and operations teams. An automation specialist can configure a workflow that tracks AI-crawler policy across your competitive set, rolls the results up by industry, and alerts the content or growth owner when your category's posture shifts. For a content strategist deciding whether to keep AI crawlers in or out, US Tech Automations can turn a one-time audit into a standing dashboard. And for an e-commerce or RevOps team that wants to be surfaced by AI assistants, US Tech Automations can verify on a schedule that your own robots.txt is not accidentally blocking the crawlers you most want to reach you — a more common and more costly mistake than teams realize.

Frequently Asked Questions

Why do news sites block AI crawlers so much more than retailers?
Because a news article is the product itself — if an AI answers the reader directly, the publisher loses the visit and the revenue. A retailer's content points toward a purchase, so being surfaced by an AI assistant helps rather than substitutes. That difference produces the 82.4% versus 16.7% gap.

Are the category rates comparable given different sample sizes?
Each category's rate is computed over its own readable-policy sites, shown in the table. Sample sizes are modest (7–17 per category), so treat the rates as directional within this curated set, not as precise population estimates.

Could a low government block rate just mean those sites have no robots.txt?
No — the denominator is sites that returned a parseable robots.txt. Government's 12.5% is 1 of 8 sites that published a policy, so it reflects deliberate openness, not a missing file.

Does blocking hurt my SEO?
Blocking AI-specific crawlers like GPTBot or Google-Extended does not affect traditional search indexing, which uses separate crawlers. The trade-off is about AI-answer visibility and training inclusion, not classic search rank.

Why do the two biggest retailers block when the rest don't?
In this set, amazon.com and ebay.com are the only retail blockers, and both run large in-house advertising and search businesses with their own reasons to control crawler access. The pure retailers — walmart.com, target.com, best buy, and others — stay open because being surfaced by an AI shopping assistant is a path to a sale, not a threat to one.

Where is the policy still genuinely unsettled?
In the middle tiers. Reference (54.5%) and Social (40%) split almost evenly between blocking and allowing, because they contain both content owners who want to protect compiled answers and platforms that benefit from being widely cited. Those are the categories where the next year of policy changes is most likely to show up.

Key Takeaways

AI-crawler blocking splits sharply by industry: News (82.4%), Tech (69.2%), and Entertainment (66.7%) block heavily; Government (12.5%), Education (14.3%), Retail (16.7%), and Finance (18.2%) barely do.
The dividing line is whether the content is the product (block) or points toward a transaction (stay open).
Reference (54.5%) and Social (40%) form a genuinely conflicted middle where the block-or-allow question is still being argued.
The benchmark is useful, but the right posture depends on your funnel — discoverability versus content protection.
All figures are verbatim counts from robots.txt sealed point-in-time on June 13, 2026, over a curated set of 122 prominent sites.

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “AI-Crawler Blocking by Industry: News vs Retail (2026).” https://ustechautomations.com/resources/blog/ai-crawler-blocking-by-industry-news-vs-retail-2026

Sealed snapshot sha256: 741353c4304216ee

Machine-readable data: CSV · JSON · All research & methodology

About the Author

Garrett Mullins

Workflow Specialist

Helping businesses leverage automation for operational efficiency.

What AI Actually Saves a Computer User Support Specialist (2026)