Research & Data

Do Podcast Sites Block AI Crawlers? 2 of 10 Do

Jun 14, 2026

The podcast web is one of the most open slices in this snapshot. Of the 10 Podcast sites we checked, every one returned a parseable robots.txt, and only 2 of them disallow even a single AI crawler. The hosting platforms — the infrastructure of the medium — almost uniformly leave the door open.

2 of 10 Podcast sites block at least one AI crawler.

A robots.txt file is the plain-text rulebook a site publishes telling automated crawlers which paths they may fetch. We read those files directly — nothing is estimated, modeled, or extrapolated. At a 20% block rate, Podcast sits well below the corpus-wide line, making it one of the more permissive categories in the set.

What makes this slice distinctive is who the two blockers are. In most low-block categories the holdouts are media publishers protecting original text. Here the opposite is true: the publishers and directories stay open, and the only sites that gate are software tools. That inversion says something about the medium. Podcasting's reference layer is not an article archive a model would want to ingest — it is feeds, metadata, and audio — so the usual reason to block barely applies, and the rate falls accordingly.

Who Gates the Crawlers Here

Only two sites carry an AI-crawler disallow rule: podbean.com and descript.com. Both are platform tools — a hosting service and an editing product — rather than the pure-directory or news-style names that usually dominate the blockers in other categories.

Eight sites allow every crawler we tested: buzzsprout.com, libsyn.com, transistor.fm, simplecast.com, captivate.fm, riverside.fm, podnews.net, and podcastinsights.com. That open group is almost the entire hosting-and-distribution stack of the industry plus its trade press — businesses that benefit from being discoverable wherever listeners and answer engines look.

That the blockers are tools rather than hosts is the most telling detail. podbean.com runs both a hosting service and a consumer listening app, and descript.com is primarily an audio and video editor; both have product surfaces and app areas that a disallow rule may be guarding, distinct from public show pages. The pure hosts — buzzsprout.com, libsyn.com, transistor.fm and the rest — exist to push episodes outward, so gating crawlers would cut against their core function. For a category where the dominant content sites reach the opposite conclusion, the woodworking report shows publishers gating at half their number.

Podcast Site	Blocks an AI Crawler?
podbean.com	Yes
descript.com	Yes
buzzsprout.com	No
libsyn.com	No
transistor.fm	No
simplecast.com	No
captivate.fm	No
riverside.fm	No
podnews.net	No
podcastinsights.com	No

The podcast hosting stack is almost uniformly open to AI crawlers.

Podcast sites post a 20% AI-crawler block rate.

What This Block Rate Actually Means

A 20% rate in an infrastructure category is telling. Podcast hosts make money when shows reach audiences, and increasingly those audiences arrive through search and AI answer surfaces. Gating those crawlers would work against the platform's own growth, so the open default here is a business decision, not an oversight.

The two that do block are tools — a host and an editor — that may be guarding app-area paths or product content rather than show pages. Even so, the category's center of gravity is wide open, which is the opposite posture from the news and gaming verticals at the top of the corpus.

There is a structural reason podcasting leans this way. The medium is built on open syndication: an RSS feed is meant to be fetched by anything that asks, and the whole distribution model assumes machines will read and redistribute episode metadata freely. A platform whose entire premise is "publish once, reach everywhere" has little appetite for telling AI crawlers to stay out. The 20% rate is less a deliberate openness campaign than the natural posture of an industry whose plumbing was designed for automated fetching from the start.

That also means a change here would be unusually meaningful. If a major host flipped to gating AI crawlers, it would signal a real shift in how the industry views answer engines — a move away from "reach everywhere" toward protecting hosted content. A single-day count cannot show that shift; only watching the number over time can.

How Podcast Compares to the Other Categories

Across the snapshot, 196 of 614 sites with a published policy block at least one AI crawler — a 31.9% corpus rate. Podcast's 20% lands well below that line. The focused window places it among neighbors: Skiing runs slightly higher, while Finance and Retail sit just below.

Category	Sites With robots.txt	Block at Least One	Block Rate
Space	8	2	25%
BoardGames	8	2	25%
HR	9	2	22.2%
Skiing	9	2	22.2%
Podcasts	10	2	20%
Finance	11	2	18.2%
Retail	12	2	16.7%
Education	7	1	14.3%

The neighbors tell a consistent story. Podcast sits between Skiing and HR just above and Finance and Retail just below — a band of service, platform, and infrastructure categories that all run under the corpus average. None of these are media-heavy verticals, and that is the common thread: where the crawlable page is a doorway rather than the product, gating stays low.

The corpus as a whole runs from a heavily gated top to a fully open floor.

Category	Sites With robots.txt	Block at Least One	Block Rate
Gaming	9	8	88.9%
Food	10	7	70%
Logistics	8	0	0%
Tea	10	0	0%

Which Bots Are Blocked Most

When a podcast platform does write a disallow rule, it tends to name the same crawlers that lead the corpus-wide picture. The bot leaderboard across all 614 sites shows CCBot in front, with the major model-builder agents close behind.

Bot	Sites Blocking (all 614 sites)
CCBot	145
ClaudeBot	124
GPTBot	121
Bytespider	118
Meta-ExternalAgent	105

CCBot leads because Common Crawl's archive feeds many downstream datasets, making it the highest-leverage single bot to disallow. ClaudeBot and GPTBot follow, so the two podcast blockers are almost certainly naming this front tier rather than fringe agents.

The ordering matters for anyone deciding what to monitor: a site that disallows only the top few user-agents covers most of the AI-crawling that actually reaches it, since the long tail of lesser bots accounts for far fewer blocks corpus-wide. Watching whether the leading bots gain or shed blocks is the most efficient way to read the direction of the whole field. The yoga report reads the operator-level cut of the same data, and the board-game breakdown covers another low-block category where two of its sites publish no policy at all.

Across all 614 sites, CCBot is the single most-blocked bot at 145.

How the Snapshot Was Sealed

Our research team fetched each site's robots.txt at one point in time, parsed the user-agent and disallow directives, and recorded which AI crawlers were named. Every figure follows the honesty rule: nothing is estimated, modeled, or extrapolated. A site counts as a blocker only when its own file disallows a known AI user-agent on any path.

The corpus spans 725 sites checked, 614 with a parseable robots.txt, across 72 categories. Separately, 141 sites publish an llms.txt file — 23% of those with robots — a newer convention for declaring AI-access intent. The snapshot is content-addressed under sha 77d0521dc8809a6c, so every count here can be reproduced exactly.

Corpus-wide, 196 of 614 sites block at least one AI crawler.

Because robots.txt is editable in seconds, this 20% is a single-day reading. Either current blocker could open and any of the eight allowers could close, so the durable value is in re-reading the file on a schedule rather than in the one-day count.

Frequently Asked Questions

Q: Why is the podcast block rate so low?

A: At 20%, Podcast sits well under the 31.9% corpus rate. The category is dominated by hosting and distribution platforms whose growth depends on shows being discoverable across search and AI surfaces, so an open policy aligns with their business. Only 2 of the 10 sites — both tools — gate any crawler.

Q: Does a disallow rule actually stop a crawler?

A: No. robots.txt is an honor-system standard. Compliant crawlers respect it, but the file enforces nothing on its own. Hard enforcement needs server-side blocking. This report measures stated policy, not whether every bot obeys it.

Q: Why do podbean.com and descript.com block when the hosts do not?

A: Both are platform tools that may be guarding app-area or product paths rather than show pages. The pure hosting and directory names — buzzsprout.com, libsyn.com, and the rest — leave everything open because discoverability is their product.

Q: How is this different from a re-query of the same sites?

A: A sealed snapshot is content-addressed and frozen, so the exact counts can be reproduced under sha 77d0521dc8809a6c. A fresh re-query would read whatever the files say that day. The value of monitoring is comparing the sealed baseline against later reads to catch policy changes.

Q: What does the llms.txt count add to this picture?

A: Across the corpus, 141 sites publish an llms.txt file — 23% of those with a robots.txt. It is a newer convention for declaring how AI systems may use a site's content, separate from the disallow rules in robots.txt. For a discoverability-driven category like podcasting, it is one more published signal worth tracking alongside the block rate.

Q: Is a 20% block rate stable or likely to climb?

A: This is one snapshot day, so we make no trend claim. What we can say is that the openness fits the medium's open-syndication design, which suggests a stable baseline rather than a category mid-shift. Only repeated sealed reads over time could confirm whether the figure moves.

Put AI-Access Data to Work

A podcast-hosting platform product manager — at a service like buzzsprout.com or transistor.fm — can run this as a standing competitive watch: re-crawl podbean.com, descript.com, and every peer host weekly and get alerted the moment one adds or removes an AI-crawler disallow, since whether hosted show pages feed AI answer engines is a direct lever on creator discoverability and the platform's pitch. A podcast-network growth marketer can monitor whether the show pages it relies on stay open to AI surfaces. A retrieval-systems engineer can watch the corpus bot leaderboard for threshold shifts in CCBot or GPTBot blocks.

Each is a recurring, automatable job: the snapshot count anchors it, and the value is detecting drift on a fixed cadence. US Tech Automations automates that monitoring with scheduled robots.txt and llms.txt crawls, change alerts, and an AI-access policy dashboard. See how the workflow runs.

Key Takeaways

Podcast is among the most open categories in the snapshot: 2 of 10 sites block an AI crawler, a 20% rate well below the 31.9% corpus line. The hosting and distribution stack stays open because discoverability is its product; only two platform tools gate. As an editable single-day reading, the durable insight is watching it change — the recurring monitoring US Tech Automations runs.

For the whole-web baseline behind the Podcast category, see our national study on how many top websites block AI crawlers.

Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 14, 2026 (snapshot sha 77d0521dc8809a6c).

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “Do Podcast Sites Block AI Crawlers? 2 of 10 Do.” https://ustechautomations.com/resources/blog/do-podcast-sites-block-ai-crawlers-2026

Sealed snapshot sha256: 77d0521dc8809a6c

Machine-readable data: CSV · JSON · All research & methodology