Research & Data

Do Cannabis Sites Block AI Crawlers? 1 of 8 Do

Jun 14, 2026

The cannabis web is almost entirely open to AI crawlers. Of the 10 Cannabis sites we checked, 8 returned a parseable robots.txt, and just 1 of those disallows any AI user-agent — a 12.5% block rate. For a regulated, compliance-heavy industry, that openness is the headline, and it runs counter to what you might expect from a sector so cautious everywhere else.

A robots.txt file is the plain-text rulebook a site posts at its root to tell crawlers which paths they may fetch. We read each Cannabis site's file literally on June 14, 2026, and logged only what it declares about AI user-agents. The result: one gatekeeper, seven open doors, and two sites that publish no policy at all.

1 of 8 Cannabis sites blocks at least one AI crawler.

Who Gates the Crawlers in Cannabis

The single blocker is leafly.com. Its robots.txt disallows at least one AI user-agent. Every other Cannabis site we checked with a published policy allows the crawlers we track.

The allowers are the larger group: weedmaps.com, hightimes.com, marijuanamoment.net, thecannabisindustry.org, ganjapreneur.com, cannabisnow.com, and norml.org all return a robots.txt that lets every AI crawler through. Two more — cannabisbusinesstimes.com and mjbizdaily.com — returned no parseable robots.txt, which means there is no published rule for a crawler to read, not a deliberate block.

One Cannabis site — leafly.com — disallows at least one AI user-agent.

There is an interesting mix among the allowers. You have a consumer marketplace, several trade-news outlets, an industry association, and an advocacy nonprofit, all leaving the door open. For trade publishers and advocacy groups, being summarized and cited in AI answers extends reach — exactly the posture a movement-driven sector benefits from.

The advocacy angle is especially telling. An organization like norml.org exists to shift public understanding and policy; having its positions surfaced in AI answers is not a risk to manage but a goal to pursue. The same logic applies to an industry association whose job is to legitimize and explain the sector. When the players most invested in shaping the narrative are also the most open to crawlers, the category's permissive posture starts to look intentional rather than accidental. These sites are not forgetting to write a robots.txt directive; they are choosing reach.

leafly.com is the instructive counterexample. As a consumer-facing platform with a large structured database of products, strains, and dispensary listings, it has something to protect that a news outlet or an advocacy group does not: a proprietary catalog that competitors and aggregators would happily ingest. That is the most common profile for a lone blocker in an otherwise open vertical — the one business whose core asset is data rather than narrative. Its decision to disallow at least one AI user-agent fits that profile cleanly, and it is the single data point in this category most worth watching for change.

What a 12.5% Block Rate Actually Means

At 12.5%, Cannabis is one of the lowest-blocking categories in this edition. Corpus-wide, 177 of 542 sites block at least one AI crawler — a 32.7% rate — so Cannabis sits far below the line. Only a thin tail of categories blocks less.

That gap is the story. In a sector defined by regulation, age-gating, and banking friction, you might expect aggressive crawler control. Instead, the data shows the opposite: cannabis media behaves like an open-knowledge vertical, not a locked-down one. The compliance burden these companies carry clearly lives in payments and licensing, not in robots.txt.

Cannabis posts a 12.5% AI-crawler block rate, well below the corpus-wide 32.7%.

For context on the other extreme, the most-gated categories are nowhere near Cannabis. The mini-table below shows the top blockers against the most open ones.

Category	Sites	With robots.txt	Block at least one	Block rate
Gaming	9	9	8	88.9%
News	20	17	14	82.4%
Marketing	10	10	1	10%
Productivity	10	10	1	10%

Where Cannabis Sits Among Its Nearest Neighbors

The focused window below centers on Cannabis and the categories immediately above and below it in the block-rate ranking. Government, Crypto, Books, and Pharma share its exact 12.5% rate; Religion, Insurance, and Coffee fall just below. This is the company Cannabis keeps — verticals that overwhelmingly leave crawlers alone.

Category	Sites	With robots.txt	Block at least one	Block rate
Education	9	7	1	14.3%
Government	9	8	1	12.5%
Crypto	9	8	1	12.5%
Books	9	8	1	12.5%
Pharma	9	8	1	12.5%
Cannabis	10	8	1	12.5%
Religion	10	9	1	11.1%
Insurance	10	9	1	11.1%
Cybersecurity	10	9	1	11.1%
Coffee	10	9	1	11.1%

The band is strikingly uniform: a single blocker each. Cannabis is not unusual within this group — it is a textbook example of a vertical where one cautious property gates crawlers while the rest stay open. If you publish in cannabis, the practical read is that allowing crawlers is the prevailing industry norm.

That uniformity carries a lesson about how to read the snapshot. A category with one blocker is not really a "blocking category" at all; it is an open category with a single outlier. Cannabis, Government, Crypto, Books, and Pharma all share that shape, and grouping them by their identical rate makes the resemblance obvious.

The interesting analytical move is not to rank these categories against each other — the rates are too close to mean much — but to identify the one site in each that breaks from its peers and ask why. In Cannabis, that site is leafly.com, and the why is most likely about protecting a proprietary product-and-strain database rather than any sector-wide caution. A comparably premium-catalog vertical, the watch and horology web, shows its own concentrated blocking pattern for similar catalog-protection reasons.

It is also worth weighing what the two no-policy sites mean for the category's real openness. cannabisbusinesstimes.com and mjbizdaily.com — both trade-news properties — published no parseable robots.txt at all. That leaves the bulk of the category's authoritative coverage either explicitly open or simply unguarded, which reinforces the headline: cannabis media, for all its regulatory weight, is one of the more reachable verticals on the AI web. The compliance instincts that govern payments and licensing in this industry plainly do not extend to crawler policy.

Cannabis sits far below the corpus-wide 32.7% AI-crawler block rate.

How the Snapshot Was Sealed

We fetch each site's robots.txt directly, parse it for AI user-agent directives, and seal the result to a content hash so the figures cannot change after the fact. This edition covers 645 sites overall, of which 542 returned a parseable robots.txt, across 64 content categories. Every figure in this report is a verbatim count from that sealed file — nothing is estimated, modeled, or extrapolated.

Definitions keep it honest. "Blocks at least one AI crawler" means the file disallows one or more AI user-agents, not necessarily all of them. A site with no robots.txt is recorded as having no policy, not as a blocker. And robots.txt is an honor-system standard — a directive is a request, not an enforced firewall.

Across all 542 sites, the most-disallowed crawler is CCBot at 133 (24.5%), then ClaudeBot at 114 (21%) and GPTBot at 108 (19.9%). On the operator view, Common Crawl leads at 133, Anthropic at 125, and OpenAI at 113. Separately, 117 of 542 sites (21.6%) publish an llms.txt file. The corpus-wide operator picture is below.

Operator	Sites disallowing
Common Crawl	133
Anthropic	125
OpenAI	113
Meta	110
ByteDance	106
Perplexity	80

These corpus-wide totals frame what the lone Cannabis blocker is most likely reaching for: the same broad-coverage operators that dominate disallow lists across every vertical.

Frequently Asked Questions

Q: Does a robots.txt block actually stop an AI crawler?

A: No. robots.txt is an honor-system standard. A compliant crawler reads the file and obeys it, but the directive is a request, not an enforced control. We report what each Cannabis site declares, not what any crawler ultimately does.

Q: How many Cannabis sites block AI crawlers here?

A: Of 10 Cannabis sites checked, 8 returned a parseable robots.txt and 1 of those — leafly.com — disallows at least one AI user-agent. That works out to a 12.5% block rate within the category.

Q: Why would a regulated industry like cannabis leave crawlers open?

A: Their compliance burden sits in payments, licensing, and age-gating, not in robots.txt. Many cannabis sites are trade outlets and advocacy groups whose reach grows when AI answers cite them, so gating crawlers would work against their goals.

Q: What about cannabisbusinesstimes.com and mjbizdaily.com?

A: Both returned no parseable robots.txt, so there is no published rule for a crawler to read. We count them as having no policy, not as blockers. A missing file is silence, not a disallow.

Q: How does Cannabis compare to the rest of the snapshot?

A: Corpus-wide, 177 of 542 sites block at least one crawler — a 32.7% rate. At 12.5%, Cannabis is one of the most open categories, sitting alongside Government, Crypto, and Books.

Key Takeaways

Cannabis is among the most open verticals in this edition. Of 8 sites with a published policy, 1 blocks at least one AI crawler and 7 allow every crawler we track. The 12.5% block rate sits far under the corpus-wide 32.7% line, clustering Cannabis with other single-blocker categories rather than with the heavily gated commercial sectors.

Corpus-wide, 177 of 542 sites block at least one AI crawler.

For anyone tracking AI access in cannabis media, the actionable question is whether leafly.com stays the sole gatekeeper — or whether a peer follows. For how adjacent open verticals behave, see our companion reads on whether space publishers gate AI crawlers and where comics sites land on AI access.

Put AI-Access Data to Work

This report is a point-in-time count; the customer value is detecting drift from it. Three buyers can act on these sealed figures.

A cannabis-compliance data analyst — the person at a multistate operator or a platform like weedmaps.com responsible for monitoring how the category presents itself online — should track whether leafly.com remains the only blocker, re-crawling the 8 Cannabis domains weekly and routing an alert the moment a second site adds an AI user-agent token to its disallow list, since a shift toward gating would change content-syndication strategy across the sector.

A trade-publisher audience-growth lead at an outlet like marijuanamoment.net should monitor its own policy and its competitors' weekly, so a newly added disallow is caught before it quietly removes the brand from AI answers. A retrieval-product engineer building a cannabis-knowledge feature should watch which of the 8 domains stays crawlable so sourcing stays current.

US Tech Automations runs that surveillance as scheduled robots.txt and llms.txt crawls with change alerts and an AI-access policy dashboard, turning a token added to a disallow list into a routed notification rather than a manual audit. Automate AI-access monitoring with agentic workflows.

For the whole-web baseline behind the Cannabis category, see our national study on how many top websites block AI crawlers.

Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 14, 2026 (snapshot sha eb8a3956a17595bc).

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “Do Cannabis Sites Block AI Crawlers? 1 of 8 Do.” https://ustechautomations.com/resources/blog/do-cannabis-sites-block-ai-crawlers-2026

Sealed snapshot sha256: eb8a3956a17595bc

Machine-readable data: CSV · JSON · All research & methodology