Research & Data

Who Blocks MistralAI-User? Just 12 of 107 Sites

Jun 13, 2026

As of June 13, 2026, only 12 of the 107 top sites that returned a parseable robots.txt file had blocked the MistralAI-User crawler — the lowest figure among all 12 operators tracked in this corpus. The next-lowest operator, Amazonbot, stands at 22 sites — nearly double the Mistral total.

That low number is not evidence that publishers are comfortable with the MistralAI-User crawler. It is more likely evidence of an awareness lag: site operators building their robots.txt blocklists have prioritized the highest-profile AI companies first, and Mistral has not yet reached the same policy attention threshold at most publishing organizations.

"Blocking" means a site's robots.txt explicitly names the MistralAI-User user-agent string with a Disallow: / directive on at least one path. It reflects stated intent, not a technical enforcement.

Methodology and Data Integrity

This report is based on a point-in-time fetch of public robots.txt files from 122 prominent sites, sealed on June 13, 2026. Of those 122, 107 returned a parseable robots.txt. All counts in this report are verbatim from the sealed snapshot and reflect stated access policy as of that date only.

Every figure in this report is a verbatim count from the snapshot: nothing is estimated, modeled, or extrapolated. No percentages have been derived from sub-groups after the fact. The methodology treats each disallow rule as a binary presence: either the MistralAI-User user-agent appears with a disallow path in a site's robots.txt, or it does not.

robots.txt is an honor-system standard that communicates a site operator's stated intent. It is not a technical enforcement mechanism. These numbers will not change as sites later edit their files — the snapshot is sealed.

12 of 107 sites with parseable robots.txt files block MistralAI-User as of June 13, 2026 — the fewest of any operator in this corpus.

How Often MistralAI-User Is Refused

Mistral AI is tracked in this corpus under a single crawler user-agent: MistralAI-User. All 12 blocks are attributed to that one agent.

User-AgentSites Blocking
MistralAI-User12

Compare this to the broader corpus: 48 of 107 sites block at least one AI crawler of any kind. Mistral at 12 sites means that even among the sites actively managing AI-access policy, the vast majority have not written a rule targeting MistralAI-User specifically.

Only 12 of 107 sites with a parseable robots.txt had blocked MistralAI-User as of June 13, 2026 — making Mistral the least-blocked of the 12 AI operators tracked in this corpus.

Just 12 of 107 sites block MistralAI-User — 11.2% of the panel.

Mistral is the least-blocked of 12 tracked AI operators.

48 of 107 sites block at least one AI crawler.

MistralAI-User at 12 blockers is the lowest-blocked operator in this corpus — below Amazonbot at 22 and the cohere-ai crawler at 27.

Which Industries Block MistralAI-User

Mistral's category breakdown is the most concentrated in this corpus: only 4 categories block it at all, compared to 7 or 8 for higher-profile operators like Diffbot and the cohere-ai crawler. Tech actually leads the block count — an unusual pattern.

CategorySites Blocking MistralAI-User
Tech6
News4
Government1
Retail1

Tech leads with 6 sites — more than News, which contributes only 4. Across other operators in this corpus, News almost always holds the top position. The reversal here likely reflects that tech-oriented publishers running broad-spectrum AI blocklists had MistralAI-User in their comprehensive deny lists, while many news publishers have not yet updated their robots.txt files to include it.

The 4 News blockers are usatoday.com, cnn.com, theatlantic.com, and vox.com. Notably absent from the News column: bbc.com, bloomberg.com, nytimes.com, forbes.com, washingtonpost.com, and newsweek.com — all of which appear as blockers for multiple other operators. The gap strongly suggests those publishers have not yet added MistralAI-User to their existing blocklists.

Entertainment, Social, Reference, and Travel — all present in Cohere and Diffbot category distributions — contribute zero blocks for Mistral. The 4-category footprint underscores how early-stage Mistral's policy recognition is compared to operators with a larger US-market presence.

Tech (6 sites) leads News (4 sites) in MistralAI-User blocks — the only operator in this corpus where Tech outranks News in the category distribution.

The Named Sites That Block MistralAI-User

All 12 blocking sites are identified in the sealed snapshot. Unlike the other operators in this batch, the full named list fits in a single table without selection.

SiteCategoryHeadline Crawlers Blocked (of 9)
usatoday.comNews9
cnn.comNews8
theatlantic.comNews8
wired.comTech8
arstechnica.comTech8
cnet.comTech8
zdnet.comTech8
mashable.comTech8
congress.govGovernment8
vox.comNews7
theverge.comTech7
amazon.comRetail7

Every one of the 12 sites blocking MistralAI-User also blocks multiple other AI operators. None appear to be singling out Mistral — they are broad-spectrum blockers who have included MistralAI-User in their comprehensive deny lists.

usatoday.com sits at a headline score of 9 — blocking all 9 headline AI crawlers tracked, meaning MistralAI-User is simply one entry in an exhaustive deny policy. Most of the other blocking sites score 7 or 8 of 9 — also firmly in the broad-spectrum camp.

The practical implication is that Mistral's low block count is almost certainly not a sign of publisher tolerance toward its data practices. It is a sign that medium-profile publishers have not yet updated their robots.txt files with MistralAI-User. The 12 sites that have blocked it are precisely those running the most proactive, comprehensive AI-access policies in the corpus.

9 sites in this corpus block all 9 headline AI crawlers; usatoday.com is the only one of those 9 that also blocks MistralAI-User.

Corpus-Wide Access Policy Context

Of the 122 sites checked in this sealed snapshot, 107 returned a parseable robots.txt. Of those 107, 48 block at least one AI crawler of any kind. 20 sites publish an llms.txt or equivalent structured access file. Only 9 sites (8.4% of the corpus) block all 9 headline AI crawlers tracked.

Mistral at 12 blockers sits far below the field. The gap between Mistral's 12 and the 48-site any-AI-crawler total is made up of sites that actively restrict at least one operator but have not yet added a MistralAI-User rule. As Mistral's web-crawler footprint grows and webmaster awareness increases, those sites represent the most likely source of future block count growth.

Comparing across the 12 operators: Amazonbot at 22 sites already has nearly double Mistral's count. The cohere-ai crawler at 27 and Diffbot at 30 sit substantially higher. The trajectory for MistralAI-User is almost certainly toward higher block rates as awareness grows — the 12-site baseline is a starting point, not a stable steady-state.

Put This Data to Work

For a data or content engineering team evaluating Mistral-powered retrieval products, the 12-site block count is a best-case snapshot that will not stay this low. The sites that have blocked MistralAI-User are concentrated among the most comprehensive AI-policy actors in the corpus — which means as awareness grows, the next wave of adopters will likely include mid-tier news publishers, entertainment platforms, and reference sites that already block other operators.

US Tech Automations recommends treating the current 12-site baseline as a starting point for a monitoring workflow, not a stable steady-state. A scheduled robots.txt fetch against target content domains — run monthly, diffed against a baseline, alerting on any MistralAI-User rule additions — gives a leading indicator of access policy shifts before they affect retrieval quality.

US Tech Automations builds and maintains exactly these kinds of policy-monitoring pipelines. The workflow is straightforward: fetch, diff, parse for specific user-agent rules, alert. The value is organizational: retrieval teams learn about policy changes from a dashboard, not from a degraded evaluation run months after the change happened.

For an enterprise team planning a Mistral-based RAG product against publicly crawled content, now is the right time to set up that monitoring layer — while the block count is low and the baseline is clean.

Frequently Asked Questions

Q: Why does Mistral have the lowest block rate among the 12 operators tracked?

A: The most likely explanation is an awareness lag. Site operators building AI-access blocklists have prioritized the highest-profile US-headquartered operators first. The 12 sites that do block MistralAI-User are the most proactive broad-spectrum blockers in the corpus — they block 7 to 9 of the 9 headline AI crawlers tracked.

Q: Does a low block count mean Mistral has better access to web content?

A: In practical terms as of June 13, 2026 — yes. Fewer sites have explicitly restricted MistralAI-User than any other tracked operator. But that gap is expected to close as awareness grows, particularly among publishers who already block comparable operators. The 12-site baseline is a starting point.

Q: Does blocking MistralAI-User prevent Mistral from using content?

A: Only under the honor-system robots.txt protocol. A compliant crawler will respect the directive; the standard does not enforce compliance technically or legally. Stating the intent in robots.txt is the first step — technical enforcement, if desired, requires additional measures.

Q: Is this data current?

A: This report is a point-in-time snapshot sealed on June 13, 2026 (snapshot sha 741353c4304216ee). The figures reflect robots.txt policy as of that date; they will not be updated as sites modify their files.

Q: How does Mistral compare to other operators in this study?

A: MistralAI-User at 12 blockers is the fewest of the 12 operators tracked. The next-lowest is Amazonbot at 22 sites. At the corpus level, 48 of 107 sites block at least one AI crawler of any kind — Mistral at 12 represents well under a third of that total.

Q: What categories account for zero MistralAI-User blocks?

A: Entertainment, Social, Reference, and Travel all contribute zero blocks for MistralAI-User as of June 13, 2026. Those same categories each contribute at least 1 block for the cohere-ai crawler, underscoring how early-stage Mistral's policy recognition is relative to other operators.

Key Takeaways

  • Only 12 of 107 sites with parseable robots.txt files block MistralAI-User as of June 13, 2026 — the lowest figure among all 12 operators in this corpus.

  • Tech (6 sites) leads News (4 sites) in Mistral's category distribution — an unusual reversal from the pattern seen across every other operator in this batch.

  • All 12 blockers are broad-spectrum AI-access restrictors; none appear to be singling out Mistral specifically.

  • The low count most likely reflects an awareness lag, not publisher comfort with Mistral's data practices — the trajectory is toward higher block rates as the operator gains profile.

  • 48 of 107 sites block at least one AI crawler; MistralAI-User at 12 sits far below the field, representing well under a third of that any-AI-crawler total.

  • Entertainment, Social, Reference, and Travel each contribute zero MistralAI-User blocks — a 4-category footprint smaller than any other operator in this corpus.

Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 13, 2026 (snapshot sha 741353c4304216ee).

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “Who Blocks MistralAI-User? Just 12 of 107 Sites.” https://ustechautomations.com/resources/blog/who-blocks-mistral-ai-crawler-2026

Sealed snapshot sha256: 741353c4304216ee

Machine-readable data: CSV · JSON · All research & methodology

About the Author

Garrett Mullins
Garrett Mullins
Workflow Specialist

Helping businesses leverage automation for operational efficiency.