Who Blocks MistralAI-User? Just 12 of 107 Sites
As of June 13, 2026, only 12 of the 107 top sites that returned a parseable robots.txt file had blocked the MistralAI-User crawler — the lowest figure among all 12 operators tracked in this corpus. The next-lowest operator, Amazonbot, stands at 22 sites — nearly double the Mistral total.
That low number is not evidence that publishers are comfortable with the MistralAI-User crawler. It is more likely evidence of an awareness lag: site operators building their robots.txt blocklists have prioritized the highest-profile AI companies first, and Mistral has not yet reached the same policy attention threshold at most publishing organizations.
"Blocking" means a site's robots.txt explicitly names the MistralAI-User user-agent string with a Disallow: / directive on at least one path. It reflects stated intent, not a technical enforcement.
Methodology and Data Integrity
This report is based on a point-in-time fetch of public robots.txt files from 122 prominent sites, sealed on June 13, 2026. Of those 122, 107 returned a parseable robots.txt. All counts in this report are verbatim from the sealed snapshot and reflect stated access policy as of that date only.
Every figure in this report is a verbatim count from the snapshot: nothing is estimated, modeled, or extrapolated. No percentages have been derived from sub-groups after the fact. The methodology treats each disallow rule as a binary presence: either the MistralAI-User user-agent appears with a disallow path in a site's robots.txt, or it does not.
robots.txt is an honor-system standard that communicates a site operator's stated intent. It is not a technical enforcement mechanism. These numbers will not change as sites later edit their files — the snapshot is sealed.
12 of 107 sites with parseable robots.txt files block MistralAI-User as of June 13, 2026 — the fewest of any operator in this corpus.
How Often MistralAI-User Is Refused
Mistral AI is tracked in this corpus under a single crawler user-agent: MistralAI-User. All 12 blocks are attributed to that one agent.
| User-Agent | Sites Blocking |
|---|---|
| MistralAI-User | 12 |
Compare this to the broader corpus: 48 of 107 sites block at least one AI crawler of any kind. Mistral at 12 sites means that even among the sites actively managing AI-access policy, the vast majority have not written a rule targeting MistralAI-User specifically.
Only 12 of 107 sites with a parseable robots.txt had blocked MistralAI-User as of June 13, 2026 — making Mistral the least-blocked of the 12 AI operators tracked in this corpus.
Just 12 of 107 sites block MistralAI-User — 11.2% of the panel.
Mistral is the least-blocked of 12 tracked AI operators.
48 of 107 sites block at least one AI crawler.
MistralAI-User at 12 blockers is the lowest-blocked operator in this corpus — below Amazonbot at 22 and the cohere-ai crawler at 27.
Which Industries Block MistralAI-User
Mistral's category breakdown is the most concentrated in this corpus: only 4 categories block it at all, compared to 7 or 8 for higher-profile operators like Diffbot and the cohere-ai crawler. Tech actually leads the block count — an unusual pattern.
| Category | Sites Blocking MistralAI-User |
|---|---|
| Tech | 6 |
| News | 4 |
| Government | 1 |
| Retail | 1 |
Tech leads with 6 sites — more than News, which contributes only 4. Across other operators in this corpus, News almost always holds the top position. The reversal here likely reflects that tech-oriented publishers running broad-spectrum AI blocklists had MistralAI-User in their comprehensive deny lists, while many news publishers have not yet updated their robots.txt files to include it.
The 4 News blockers are usatoday.com, cnn.com, theatlantic.com, and vox.com. Notably absent from the News column: bbc.com, bloomberg.com, nytimes.com, forbes.com, washingtonpost.com, and newsweek.com — all of which appear as blockers for multiple other operators. The gap strongly suggests those publishers have not yet added MistralAI-User to their existing blocklists.
Entertainment, Social, Reference, and Travel — all present in Cohere and Diffbot category distributions — contribute zero blocks for Mistral. The 4-category footprint underscores how early-stage Mistral's policy recognition is compared to operators with a larger US-market presence.
Tech (6 sites) leads News (4 sites) in MistralAI-User blocks — the only operator in this corpus where Tech outranks News in the category distribution.
The Named Sites That Block MistralAI-User
All 12 blocking sites are identified in the sealed snapshot. Unlike the other operators in this batch, the full named list fits in a single table without selection.
| Site | Category | Headline Crawlers Blocked (of 9) |
|---|---|---|
| usatoday.com | News | 9 |
| cnn.com | News | 8 |
| theatlantic.com | News | 8 |
| wired.com | Tech | 8 |
| arstechnica.com | Tech | 8 |
| cnet.com | Tech | 8 |
| zdnet.com | Tech | 8 |
| mashable.com | Tech | 8 |
| congress.gov | Government | 8 |
| vox.com | News | 7 |
| theverge.com | Tech | 7 |
| amazon.com | Retail | 7 |
Every one of the 12 sites blocking MistralAI-User also blocks multiple other AI operators. None appear to be singling out Mistral — they are broad-spectrum blockers who have included MistralAI-User in their comprehensive deny lists.
usatoday.com sits at a headline score of 9 — blocking all 9 headline AI crawlers tracked, meaning MistralAI-User is simply one entry in an exhaustive deny policy. Most of the other blocking sites score 7 or 8 of 9 — also firmly in the broad-spectrum camp.
The practical implication is that Mistral's low block count is almost certainly not a sign of publisher tolerance toward its data practices. It is a sign that medium-profile publishers have not yet updated their robots.txt files with MistralAI-User. The 12 sites that have blocked it are precisely those running the most proactive, comprehensive AI-access policies in the corpus.
9 sites in this corpus block all 9 headline AI crawlers; usatoday.com is the only one of those 9 that also blocks MistralAI-User.
Corpus-Wide Access Policy Context
Of the 122 sites checked in this sealed snapshot, 107 returned a parseable robots.txt. Of those 107, 48 block at least one AI crawler of any kind. 20 sites publish an llms.txt or equivalent structured access file. Only 9 sites (8.4% of the corpus) block all 9 headline AI crawlers tracked.
Mistral at 12 blockers sits far below the field. The gap between Mistral's 12 and the 48-site any-AI-crawler total is made up of sites that actively restrict at least one operator but have not yet added a MistralAI-User rule. As Mistral's web-crawler footprint grows and webmaster awareness increases, those sites represent the most likely source of future block count growth.
Comparing across the 12 operators: Amazonbot at 22 sites already has nearly double Mistral's count. The cohere-ai crawler at 27 and Diffbot at 30 sit substantially higher. The trajectory for MistralAI-User is almost certainly toward higher block rates as awareness grows — the 12-site baseline is a starting point, not a stable steady-state.
Put This Data to Work
For a data or content engineering team evaluating Mistral-powered retrieval products, the 12-site block count is a best-case snapshot that will not stay this low. The sites that have blocked MistralAI-User are concentrated among the most comprehensive AI-policy actors in the corpus — which means as awareness grows, the next wave of adopters will likely include mid-tier news publishers, entertainment platforms, and reference sites that already block other operators.
US Tech Automations recommends treating the current 12-site baseline as a starting point for a monitoring workflow, not a stable steady-state. A scheduled robots.txt fetch against target content domains — run monthly, diffed against a baseline, alerting on any MistralAI-User rule additions — gives a leading indicator of access policy shifts before they affect retrieval quality.
US Tech Automations builds and maintains exactly these kinds of policy-monitoring pipelines. The workflow is straightforward: fetch, diff, parse for specific user-agent rules, alert. The value is organizational: retrieval teams learn about policy changes from a dashboard, not from a degraded evaluation run months after the change happened.
For an enterprise team planning a Mistral-based RAG product against publicly crawled content, now is the right time to set up that monitoring layer — while the block count is low and the baseline is clean.
Frequently Asked Questions
Q: Why does Mistral have the lowest block rate among the 12 operators tracked?
A: The most likely explanation is an awareness lag. Site operators building AI-access blocklists have prioritized the highest-profile US-headquartered operators first. The 12 sites that do block MistralAI-User are the most proactive broad-spectrum blockers in the corpus — they block 7 to 9 of the 9 headline AI crawlers tracked.
Q: Does a low block count mean Mistral has better access to web content?
A: In practical terms as of June 13, 2026 — yes. Fewer sites have explicitly restricted MistralAI-User than any other tracked operator. But that gap is expected to close as awareness grows, particularly among publishers who already block comparable operators. The 12-site baseline is a starting point.
Q: Does blocking MistralAI-User prevent Mistral from using content?
A: Only under the honor-system robots.txt protocol. A compliant crawler will respect the directive; the standard does not enforce compliance technically or legally. Stating the intent in robots.txt is the first step — technical enforcement, if desired, requires additional measures.
Q: Is this data current?
A: This report is a point-in-time snapshot sealed on June 13, 2026 (snapshot sha 741353c4304216ee). The figures reflect robots.txt policy as of that date; they will not be updated as sites modify their files.
Q: How does Mistral compare to other operators in this study?
A: MistralAI-User at 12 blockers is the fewest of the 12 operators tracked. The next-lowest is Amazonbot at 22 sites. At the corpus level, 48 of 107 sites block at least one AI crawler of any kind — Mistral at 12 represents well under a third of that total.
Q: What categories account for zero MistralAI-User blocks?
A: Entertainment, Social, Reference, and Travel all contribute zero blocks for MistralAI-User as of June 13, 2026. Those same categories each contribute at least 1 block for the cohere-ai crawler, underscoring how early-stage Mistral's policy recognition is relative to other operators.
Key Takeaways
Only 12 of 107 sites with parseable robots.txt files block MistralAI-User as of June 13, 2026 — the lowest figure among all 12 operators in this corpus.
Tech (6 sites) leads News (4 sites) in Mistral's category distribution — an unusual reversal from the pattern seen across every other operator in this batch.
All 12 blockers are broad-spectrum AI-access restrictors; none appear to be singling out Mistral specifically.
The low count most likely reflects an awareness lag, not publisher comfort with Mistral's data practices — the trajectory is toward higher block rates as the operator gains profile.
48 of 107 sites block at least one AI crawler; MistralAI-User at 12 sits far below the field, representing well under a third of that any-AI-crawler total.
Entertainment, Social, Reference, and Travel each contribute zero MistralAI-User blocks — a 4-category footprint smaller than any other operator in this corpus.
Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 13, 2026 (snapshot sha 741353c4304216ee).
Get this data as a daily feed
The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.
Prefer to talk first? Contact us.
Cite this report
US Tech Automations Research, 2026-06 edition. “Who Blocks MistralAI-User? Just 12 of 107 Sites.” https://ustechautomations.com/resources/blog/who-blocks-mistral-ai-crawler-2026
Sealed snapshot sha256: 741353c4304216ee
Machine-readable data: CSV · JSON · All research & methodology
About the Author

Helping businesses leverage automation for operational efficiency.