Research & Data

Who Blocks Cohere's AI Crawler? 27 of 107 Sites

Q: Is the cohere-ai crawler the same as Cohere API products?

They are related but separate systems. The `cohere-ai` user-agent in robots.txt refers to web-crawling infrastructure used to build or refresh training data and retrieval indices. Cohere API products serve customer applications and are separate from the crawler itself.

Q: How does cohere-ai compare to the lowest-blocked operator in this corpus?

The least-blocked operator in this corpus is [MistralAI-User at 12 sites](/resources/blog/who-blocks-mistral-ai-crawler-2026). The cohere-ai crawler at 27 sites has more than double that block count, reflecting a larger recognized web footprint among webmasters actively managing AI-access policy.

Jun 13, 2026

As of June 13, 2026, 27 of the 107 top sites that returned a parseable robots.txt file had blocked the cohere-ai crawler user-agent. That places the cohere-ai crawler in the mid-upper tier of the 12 AI operators tracked in this corpus — above Amazonbot (22 sites) but below Diffbot (30 sites).

"Blocking" means a site's robots.txt explicitly names the cohere-ai user-agent and includes a Disallow: / directive covering at least one path. It does not mean a technical firewall or legal restriction — only that the site has stated its access policy excludes this crawler.

Methodology and Data Integrity

This report is based on a point-in-time fetch of public robots.txt files from 122 prominent sites, sealed on June 13, 2026. Of those 122, 107 returned a parseable robots.txt. All counts in this report are verbatim from the sealed snapshot and reflect stated access policy as of that date only.

Every figure in this report is a verbatim count from the snapshot: nothing is estimated, modeled, or extrapolated. No percentages have been derived from sub-groups after the fact. The methodology treats each disallow rule as a binary presence: either the cohere-ai user-agent appears with a disallow path, or it does not.

robots.txt is an honor-system standard that communicates a site operator's stated intent. It is not a technical enforcement mechanism. These numbers will not change as sites later edit their files — the snapshot is sealed.

27 of 107 sites with parseable robots.txt files block the cohere-ai crawler as of June 13, 2026.

How Often the cohere-ai Crawler Is Refused

Cohere is tracked in this corpus under a single crawler user-agent: cohere-ai. All 27 blocks are attributed to that one agent.

User-Agent	Sites Blocking
cohere-ai	27

For corpus context: 48 of 107 sites block at least one AI crawler of any kind, and 20 sites publish an llms.txt or equivalent structured access file. Cohere at 27 sites means roughly 1 in 4 of the most prominent web properties have stated their intent to opt out of its crawler — a meaningful access constraint for retrieval pipelines built on publicly crawled content.

27 of 107 top sites with a parseable robots.txt had explicitly blocked the cohere-ai crawler as of June 13, 2026 — placing it in the mid-upper tier among the 12 operators in this corpus.

27 of 107 sites block the cohere-ai crawler — 25.2% of the panel.

News leads Cohere blocks at 10 sites; Tech follows at 8.

9 of 107 sites block all 9 headline AI crawlers.

27 of 107 sites block cohere-ai — the same total as 4 Entertainment sites plus News, Tech, and 5 other categories combined.

Which Industries Block the cohere-ai Crawler

News (10) and Tech (8) together account for the majority of Cohere's 27 total blocks. That concentration follows the same pattern seen across higher-block-rate operators: outlets that treat text as a primary product have the strongest economic incentive to restrict automated extraction.

Category	Sites Blocking cohere-ai
News	10
Tech	8
Entertainment	4
Government	1
Retail	1
Reference	1
Social	1
Travel	1

News leads at 10 sites and Tech follows at 8, together the dominant share of all cohere-ai blocks. Entertainment publishers add 4 more — rollingstone.com, variety.com, hollywoodreporter.com, and billboard.com — music, film, and entertainment trade publications whose content libraries have licensing value independent of any AI product.

The remaining 5 categories contribute 1 block each, covering Government (congress.gov), Retail (amazon.com), Reference (investopedia.com), Social (linkedin.com), and Travel (tripadvisor.com).

The Travel entry is notable: tripadvisor.com as a blocker signals that review-content and user-generated-data platforms are adopting AI-access restrictions that were previously concentrated in editorial content. A retrieval pipeline targeting hospitality or travel content should treat this as a meaningful data point, not an outlier.

The Government block reflects a deliberate policy choice about official legislative content. Blocking patterns in that category tend to be broad-spectrum and persistent rather than operator-specific. For teams building retrieval pipelines that depend on structured news content, the 10-site News total is the most operationally significant figure in this distribution.

4 Entertainment publishers — rollingstone.com, variety.com, hollywoodreporter.com, and billboard.com — each block cohere-ai with a headline score of 8, indicating broad-spectrum AI-access policies rather than targeting Cohere specifically.

The Named Sites That Block the cohere-ai Crawler

All 27 blocking sites are identified in the sealed snapshot. The table below shows all 27, ordered by headline-crawlers-blocked count.

Site	Category	Headline Crawlers Blocked (of 9)
bbc.com	News	9
bloomberg.com	News	9
usatoday.com	News	9
nytimes.com	News	8
cnn.com	News	8
forbes.com	News	8
theatlantic.com	News	8
wired.com	Tech	8
arstechnica.com	Tech	8
cnet.com	Tech	8
zdnet.com	Tech	8
mashable.com	Tech	8
congress.gov	Government	8
rollingstone.com	Entertainment	8
variety.com	Entertainment	8
hollywoodreporter.com	Entertainment	8
billboard.com	Entertainment	8
newsweek.com	News	7
vox.com	News	7
theverge.com	Tech	7
amazon.com	Retail	7
linkedin.com	Social	7
tripadvisor.com	Travel	7
apnews.com	News	6
techcrunch.com	Tech	6
investopedia.com	Reference	4
gizmodo.com	Tech	3

The top three — bbc.com, bloomberg.com, usatoday.com — each block all 9 headline AI crawlers tracked in the corpus. The cohere-ai crawler is caught in blanket AI-access policy at those properties, not a decision targeting Cohere specifically.

At the lower end, investopedia.com scores 4 of 9 and gizmodo.com scores 3 of 9. Their presence in the cohere-ai blockers list despite not running comprehensive deny policies is worth flagging for teams building financial-domain or tech-news retrieval products. Financial content with high SEO value — definitions, explainers, investment guides — is increasingly under explicit AI-access restriction.

4 Entertainment publishers — rollingstone.com, variety.com, hollywoodreporter.com, billboard.com — each block cohere-ai with a headline score of 8.

The presence of tripadvisor.com extends the blocker profile beyond editorial content into review and catalog data. Comparing Cohere's named blockers to those of the Anthropic ClaudeBot crawler reveals substantial overlap among the broad-spectrum blockers, though the absolute counts differ.

Corpus-Wide Access Policy Context

Of the 122 sites checked in this sealed snapshot, 107 returned a parseable robots.txt. Of those 107, 48 block at least one AI crawler of any kind. 20 sites publish an llms.txt or equivalent structured access file. Only 9 sites (8.4% of the corpus) block all 9 headline AI crawlers tracked — the most restrictive tier.

Cohere at 27 sites sits clearly in the upper half of the 12 operators tracked. The gap between 27 and the 48 sites that block any AI crawler at all represents 21 sites that actively restrict some operators but have not yet added a cohere-ai rule. Whether those sites will do so in future updates is not something this snapshot can answer — it records stated policy on a single date.

The 9 sites that block all 9 headline AI crawlers (8.4% of the parseable-robots corpus) represent the most restrictive access tier. All 9 of those sites appear in the cohere-ai blockers list, meaning 9 of Cohere's 27 blocks come from that most-restrictive tier. The blocks beyond that most-restrictive tier — the majority of Cohere's 27 — represent publishers who have selectively chosen to restrict cohere-ai without running a blanket deny policy — a meaningful signal that Cohere specifically has a recognizable enough web footprint to draw deliberate attention from a segment of content publishers.

Put This Data to Work

Enterprise teams using Cohere retrieval APIs to build RAG pipelines over web-crawled content face a measurable access gap. 27 prominent sites — including major news sources and a travel platform — have declared their intent to opt out. For retrieval-pipeline engineers, understanding which source domains fall inside that 27-site block list is the first step in scoping retrieval reliability.

US Tech Automations builds automated robots.txt monitoring workflows that track per-operator access policy across target content domains. When a site like apnews.com or techcrunch.com updates its cohere-ai block rule, the monitoring system surfaces that change — before it degrades retrieval quality silently.

For RevOps leads evaluating Cohere as a vendor, crawl-access data belongs in procurement due-diligence alongside latency benchmarks and pricing tiers. US Tech Automations can help build a lightweight access-monitoring layer that runs on a scheduled interval and surfaces policy drift before it becomes a support ticket.

Frequently Asked Questions

Q: Does blocking cohere-ai in robots.txt prevent Cohere from crawling a site?

A: It communicates intent under the robots.txt honor-system protocol. A compliant crawler will respect the rule; the standard does not enforce compliance at a network level. The 27 sites counted here have stated that intent; technical enforcement is a separate matter.

Q: Does blocking an AI crawler hurt Google search rankings?

A: No. Google crawlers — Googlebot and related agents — are distinct from cohere-ai and all other AI product crawlers. Blocking cohere-ai has no effect on Google indexing or search rankings.

Q: Is the cohere-ai crawler the same as Cohere API products?

A: They are related but separate systems. The cohere-ai user-agent in robots.txt refers to web-crawling infrastructure used to build or refresh training data and retrieval indices. Cohere API products serve customer applications and are separate from the crawler itself.

Q: Is this snapshot current?

A: This report reflects a point-in-time fetch sealed on June 13, 2026 (snapshot sha 741353c4304216ee). robots.txt files change; these numbers are fixed to that date and will not be updated.

Q: How many operators and sites were included?

A: 12 operators and 21 individual crawler user-agents were tracked across 122 prominent sites in 10 content categories; 107 of those sites returned a parseable robots.txt.

Q: How does cohere-ai compare to the lowest-blocked operator in this corpus?

A: The least-blocked operator in this corpus is MistralAI-User at 12 sites. The cohere-ai crawler at 27 sites has more than double that block count, reflecting a larger recognized web footprint among webmasters actively managing AI-access policy.

Key Takeaways

27 of 107 sites with parseable robots.txt files block the cohere-ai crawler as of June 13, 2026.
News accounts for 10 of those 27 blocks and Tech for 8 — together the majority of the total.
Entertainment (4 sites) and Travel (1 site: tripadvisor.com) extend blocking beyond editorial news into review and catalog content.
48 of 107 sites block at least one AI crawler; cohere-ai at 27 sites sits in the mid-upper range of the 12 operators tracked.
The named blockers overlap heavily with broad-spectrum AI-access restrictors — the cohere-ai crawler is caught in general AI-access policy at most of the 27 sites rather than facing operator-specific opposition.
investopedia.com as a Reference-category blocker signals that financial content platforms are actively restricting AI crawlers, not just editorial news outlets.

Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 13, 2026 (snapshot sha 741353c4304216ee).

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “Who Blocks Cohere's AI Crawler? 27 of 107 Sites.” https://ustechautomations.com/resources/blog/who-blocks-cohere-ai-crawler-2026

Sealed snapshot sha256: 741353c4304216ee

Machine-readable data: CSV · JSON · All research & methodology