Research & Data

Do Crypto Sites Block AI Crawlers? Sealed robots.txt Data

Jun 14, 2026

Crypto is one of the most data-hungry verticals on the web — price feeds, wallet activity, on-chain analytics — yet when we checked 9 major Crypto sites against public robots.txt files, only 1 of the 8 that returned a parseable file chose to block any AI crawler at all.

That single dissenter is coindesk.com; the remaining sites — binance.com, cointelegraph.com, kraken.com, crypto.com, blockchain.com, kucoin.com, and gemini.com — allow every major AI crawler without restriction. The block rate of 12.5% places Crypto near the bottom of all 32 categories we surveyed, well below the corpus-wide rate of 42% (123 of 293 sites block at least one AI crawler).

A robots.txt file is a public plain-text signal in a website's root directory that tells crawlers which paths they may or may not access. It is a convention, not a technical lock — a crawler that ignores these rules can still access the content. With that in mind, the unusually open posture in Crypto takes on a specific meaning: most of these platforms apparently see AI-crawler indexing as neutral or beneficial to their distribution goals.

Key Takeaways

1 of 8 Crypto sites with a parseable robots.txt blocks any AI crawler.

The Crypto block rate of 12.5% is well below the corpus average of 42%.

coindesk.com is the only Crypto site that restricts AI crawler access.

Across all 293 sites in the corpus, 123 block at least one AI crawler.

48 sites (16.4%) of the full corpus have deployed an llms.txt file.

The Crypto category shares its 12.5% block rate only with Government, making it one of only two non-Nonprofit, non-Streaming categories at that level or lower. The contrast with high-blocking verticals like Gaming (88.9%) and News (82.4%) is stark. These findings come from US Tech Automations Research sealed snapshot a5ca246fbdc79954, recorded June 14, 2026.

The Crypto AI-Access Picture: Who Gates and Who Allows

The story in Crypto is almost entirely one of openness. Eight of the nine checked sites returned a parseable robots.txt file, and of those eight, seven — binance.com, cointelegraph.com, kraken.com, crypto.com, blockchain.com, kucoin.com, and gemini.com — place no restrictions on any known AI crawler. Only coindesk.com takes a different approach.

Of 8 Crypto sites with a parseable robots.txt, just 1 restricts AI crawlers — a 12.5% block rate that sits far below the 42% corpus average.

Why would coindesk.com block while its peers remain open? As a publisher-first brand generating editorial revenue from exclusive reporting, it has more in common structurally with the high-blocking News category (82.4%) than with exchange or analytics platforms. Publishers that monetize attention tend to be protective of their content; platforms that benefit from distribution tend to leave the door open.

The site etherscan.io returned no parseable robots.txt file at all. Its absence from the blocking count is a data artifact, not a policy statement — we cannot know its intent from a missing file. The analysis covers only what the sealed snapshot captured.

etherscan.io returned no parseable robots.txt in the June 14, 2026 snapshot — its AI-access posture is unknown from this data alone.

The openness of the exchange and analytics platforms makes intuitive sense. Sites like binance.com, kraken.com, and gemini.com operate in a context where greater data visibility and indexing by AI systems may support user acquisition and credibility. On-chain analytics platforms like blockchain.com similarly benefit from broad distribution of their data. Content sites like cointelegraph.com and kucoin.com lean into discovery as a primary traffic channel.

Where Crypto Sits Across the 32-Category Corpus

The table below shows all 32 categories ranked by block rate, using sealed data from the June 14, 2026 snapshot across 339 sites.

CategorySites CheckedWith robots.txtBlocking Any AIBlock Rate
Gaming99888.9%
News20171482.4%
Food1010770%
Tech1513969.2%
Entertainment99666.7%
Healthcare109666.7%
Music109666.7%
Parenting108562.5%
Reference1411654.5%
Science1010550%
Automotive109444.4%
HomeGarden109444.4%
Fashion97342.9%
Social1010440%
Sports1010440%
Fitness1010440%
Photography1010440%
Jobs108337.5%
Travel99333.3%
Weather106233.3%
Legal107228.6%
RealEstate107228.6%
Pets107228.6%
Crafts108225%
Finance1211218.2%
Retail1512216.7%
Education97114.3%
Government98112.5%
Crypto98112.5%
Religion109111.1%
Nonprofit10600%
Streaming101000%

Crypto shares its 12.5% position with Government and sits just above Religion (11.1%), Nonprofit (0%), and Streaming (0%). The pattern suggests that transactional and infrastructure-adjacent categories — exchanges, government portals, e-commerce platforms — tend toward openness, while content publishers and gaming platforms tend toward restriction.

Corpus-Wide Bots and Operators Most Frequently Blocked

Even though Crypto itself shows very little blocking, the broader corpus tells a different story. The table below covers all 293 sites with parseable robots.txt files across the full 339-site snapshot.

AI BotSites Blocking (of 293)Block Rate
CCBot9733.1%
ClaudeBot8729.7%
Bytespider7525.6%
GPTBot7425.3%
Meta-ExternalAgent7023.9%
PerplexityBot6823.2%
Applebot-Extended6722.9%
Google-Extended6622.5%
Amazonbot5619.1%

CCBot (Common Crawl) is the single most-blocked bot at 33.1% across all 293 sites. ClaudeBot (Anthropic) follows at 29.7%. These two lead because Common Crawl and Anthropic are among the operators that content publishers most frequently name in blocking rules. GPTBot (OpenAI) sits close behind at 25.3%.

Operator Blocked (all 293 sites)Sites Blocking
Common Crawl97
Anthropic93
Meta80
OpenAI77
ByteDance75
Perplexity69
Apple67
Google66
Cohere63
Diffbot60
Amazon56
Mistral23

The operator-level table reveals that Common Crawl (97 sites) and Anthropic (93 sites) are the two most-blocked operators in the corpus, followed by Meta (80) and OpenAI (77). Mistral, with only 23 blocking sites, is the least-restricted major operator. These figures span all 32 categories — the 1 blocking Crypto site represents a narrow slice of this broader pattern.

How the Snapshot Was Sealed

This report draws on a single point-in-time crawl of public robots.txt files, sealed June 14, 2026 under snapshot sha a5ca246fbdc79954. US Tech Automations Research collected the files programmatically, parsed each user-agent block, and flagged any site that listed at least one known AI crawler agent string in a Disallow rule. The data covers 339 sites across 32 content categories.

The methodology is strictly observational. We read only what is publicly posted; nothing is estimated, modeled, or extrapolated. Every figure in this report appears verbatim in the sealed snapshot. A site that lacks a robots.txt contributes to the sites count but not the withRobots count; it is not treated as a blocker or an allower. A site is flagged as blocking if any single crawler agent string appears in a Disallow rule — the level of restriction (path-level versus full-site) is not differentiated here.

Steps in the sealed-snapshot process:

  1. Crawl. Each domain's /robots.txt endpoint is fetched programmatically on the snapshot date.

  2. Parse. Each file is parsed for user-agent blocks and Disallow directives. Known AI crawler strings are matched against a fixed reference list.

  3. Seal. The raw collected files are content-hashed and stored in an append-only log, producing the snapshot sha.

  4. Aggregate. Per-domain results are grouped by category; block rates are computed from sealed counts only.

Because this is a cross-sectional snapshot, there are no trend claims here. Whether these postures shift over time is something only future snapshots can answer. For context on how methodology applies consistently across the batch, see the companion Do Pet Sites Block AI Crawlers? report, which uses the same sealed-data process.

Frequently Asked Questions

Q: Why does Crypto have such a low block rate compared to categories like News?

A: The dominant sites in Crypto are exchanges and analytics platforms — businesses whose core value is broad data distribution and discovery. Publisher-first brands like coindesk.com resemble the News category in their incentives, and coindesk.com is indeed the one Crypto blocker. Exchange and analytics platforms, by contrast, generally benefit from AI indexing rather than fearing it.

Q: Does a robots.txt disallow actually prevent AI crawlers from reading the content?

A: No. robots.txt is an honor-system standard. Any crawler that ignores the directive can still fetch the content. The signal matters for tracking intent and policy posture, not for measuring technical enforcement. Some AI operators publicly commit to following robots.txt; others do not make that commitment explicitly.

Q: What does etherscan.io having no parseable robots.txt file mean?

A: It means we cannot characterize its AI-access posture from this snapshot. Absence of a robots.txt is not equivalent to allowing all crawlers — it may indicate a configuration oversight or a deliberate choice to defer to default crawler behavior. We report it in the sites count but exclude it from the block-rate calculation.

Q: Is 12.5% a surprisingly low block rate for a financial-data-adjacent vertical?

A: It is lower than the corpus average (42%), but the comparison depends on how you categorize Crypto. Finance sites show an 18.2% block rate, which is also below the corpus average. Both categories lean toward openness relative to content-publisher verticals. The notable outlier is coindesk.com, which behaves more like a media outlet than a financial platform — and blocks accordingly.

Q: How does llms.txt fit into this picture?

A: The llms.txt standard is a newer convention that lets sites describe their content for large language model training in a structured format. Across all 293 sites in the corpus, 48 (16.4%) have deployed an llms.txt file. This report covers only robots.txt blocking; llms.txt adoption is a separate signal tracked at the corpus level.

Put AI-Access Data to Work

The Crypto category's 12.5% block rate is a point-in-time anchor — but the value for practitioners lies in detecting when that posture shifts. Three audiences have concrete recurring workflows here.

An SEO or content-strategy lead at a competing publication or Crypto media brand monitors whether coindesk.com tightens its blocking over time or whether open sites like cointelegraph.com add restrictions. The trigger: re-crawl this category weekly and alert the moment any Crypto domain adds a new AI-crawler disallow. The cadence matters because a single high-authority site shifting policy can signal a sector-wide move.

A publisher RevOps lead benchmarking their own site against the category checks whether their robots.txt posture aligns with the sector norm — open — or diverges toward the coindesk.com pattern. The recurring job is a monthly diff of their own robots.txt against the sealed snapshot to confirm no unintentional changes have been deployed by engineering. An alert when internal policy drifts from intended posture prevents silent AI-access changes.

A retrieval or data-pipeline engineer building a Crypto knowledge base needs to know which sites actively restrict their training crawlers and which remain open. The practical job: maintain a live allowlist of Crypto domains, updated whenever the sealed snapshot changes, so training pipelines skip blockers without manual review.

US Tech Automations automates this monitoring with scheduled robots.txt recrawls, change-diffing, and alerting pipelines — so your team sees policy shifts the day they happen rather than discovering them in a quarterly audit. See how the platform handles this at /platform/agentic-workflows.

For context on how the Fitness and Parenting verticals handle AI access — categories with meaningfully higher block rates — see Do Fitness Sites Block AI Crawlers? and Do Parenting Sites Block AI Crawlers?. The contrast in posture across these verticals is part of what makes the corpus-level view useful.

Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 14, 2026 (snapshot sha a5ca246fbdc79954).

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “Do Crypto Sites Block AI Crawlers? Sealed robots.txt Data.” https://ustechautomations.com/resources/blog/do-crypto-sites-block-ai-crawlers-2026

Sealed snapshot sha256: a5ca246fbdc79954

Machine-readable data: CSV · JSON · All research & methodology

About the Author

Garrett Mullins
Garrett Mullins
Workflow Specialist

Helping businesses leverage automation for operational efficiency.