Research & Data

Do Music Sites Block AI Crawlers? Sealed robots.txt Data

Jun 14, 2026

Music platforms carry two of the most legally sensitive classes of content on the web: lyrics and user-generated audio metadata. Both are the subject of active licensing disputes, and both are prime targets for AI training pipelines that want to understand cultural output at scale. Given that context, the fact that 6 of 9 Music sites block at least one AI crawler — a block rate of 66.7% — is not surprising. What is worth examining is the one site with no robots.txt at all, and the three that allow every crawler despite the prevailing industry caution.

A robots.txt file is a plain-text file at a website root that communicates crawl permissions to automated bots under an honor-system protocol. This report presents verbatim counts from a sealed snapshot of public robots.txt files collected on June 14, 2026, across 260 sites and 24 categories. The snapshot is content-addressed with sha 834f1e2f07af24fd. To be explicit, nothing is estimated, modeled, or extrapolated — every count is a direct read from that sealed file.

The Outlier: discogs.com Has No robots.txt

The most unusual data point in the Music category is discogs.com. Discogs returned no parseable robots.txt file in our snapshot. That means it is neither a blocker nor an allower in the standard sense: in the absence of a robots.txt, the default under the robots.txt protocol is that all paths are open to all crawlers. But unlike a site that explicitly publishes an open robots.txt, discogs.com communicates nothing — crawlers make their own inference about what is permitted.

Discogs hosts one of the most extensive music metadata databases on the web: release data, label information, format details, and contributor-sourced discographies across millions of recordings. That breadth makes it particularly interesting to AI training pipelines. The absence of a robots.txt does not mean discogs.com has a policy of AI openness — it may simply mean the robots.txt has not been maintained or deployed for this purpose.

The 66.7% rate is computed across the 9 sites that returned a parseable robots.txt — not across all 10 Music sites. discogs.com is counted separately.

Key Takeaways

6 of 9 Music sites with a parseable robots.txt block at least one AI crawler. That 66.7% rate places Music in a three-way tie with Entertainment and Healthcare for the same block rate.

The corpus-wide block rate across all 223 sites with a parseable robots.txt is 46.6%. Music at 66.7% sits well above that corpus average.

discogs.com is the one Music site in our snapshot that returned no parseable robots.txt file.

The three allowers — last.fm, stereogum.com, and ultimate-guitar.com — each allow all tracked AI crawlers under an honor-system reading of their robots.txt.

CCBot (Common Crawl) is blocked by 85 sites across all 223 surveyed — the most-blocked bot in the full corpus.

Lyrics-focused and catalog-focused music sites are the most likely to block: genius.com and bandcamp.com are among the 6 blockers, along with the streaming and event-tracking platforms pitchfork.com, soundcloud.com, songkick.com, and residentadvisor.net.

Music Sites: The Snapshot

Metric	Count
Music sites checked	10
Sites with a parseable robots.txt	9
Sites blocking at least one AI crawler	6
Block rate (of sites with robots.txt)	66.7%

Of the 10 Music sites we checked, 9 returned a parseable robots.txt. discogs.com was the exception. Of the 9 with a file, 6 blocked at least one AI crawler.

The 6 blockers: pitchfork.com, genius.com, bandcamp.com, soundcloud.com, songkick.com, and residentadvisor.net. The 3 allowers: last.fm, stereogum.com, and ultimate-guitar.com.

6 of 9 Music sites with a parseable robots.txt block at least one AI crawler — a 66.7% rate that exceeds the corpus-wide 46.6%.

The blockers span an interesting range of business models. genius.com holds what may be the most defensible blocking case: lyric attribution is the subject of active licensing agreements with labels, and training AI systems on Genius lyric databases has been a flashpoint in the music-AI debate. bandcamp.com hosts independent artist pages including detailed track listings, album artwork credits, and in some cases full-text liner notes. soundcloud.com holds user-uploaded audio metadata. pitchfork.com carries decades of editorial coverage. songkick.com and residentadvisor.net carry event and ticketing data.

The 3 allowers have looser IP exposure. last.fm is a scrobbling and recommendation platform whose core data is user listening behavior — publicly disclosed by definition. stereogum.com is a music blog without the dense catalog or lyric assets that make blocking attractive. ultimate-guitar.com hosts user-submitted guitar tabs, a category that has its own complex copyright status but that the site has apparently chosen not to gate via robots.txt.

How Music Compares Across All 24 Categories

Category	Sites Checked	With robots.txt	Blocking	Block Rate
Gaming	9	9	8	88.9%
News	20	17	14	82.4%
Food	10	10	7	70%
Tech	15	13	9	69.2%
Entertainment	9	9	6	66.7%
Healthcare	10	9	6	66.7%
Music	10	9	6	66.7%
Reference	14	11	6	54.5%
Science	10	10	5	50%
Automotive	10	9	4	44.4%
HomeGarden	10	9	4	44.4%
Fashion	9	7	3	42.9%
Social	10	10	4	40%
Sports	10	10	4	40%
Jobs	10	8	3	37.5%
Travel	9	9	3	33.3%
Weather	10	6	2	33.3%
Legal	10	7	2	28.6%
RealEstate	10	7	2	28.6%
Finance	12	11	2	18.2%
Retail	15	12	2	16.7%
Education	9	7	1	14.3%
Government	9	8	1	12.5%
Nonprofit	10	6	0	0%

Music at 66.7% sits in the upper tier of the distribution — in the same band as Entertainment and Healthcare. It falls below the content-heavy editorial categories of News (82.4%) and Gaming (88.9%), but well above the corpus average of 46.6%.

The categories where blocking is rarest — Finance (18.2%), Retail (16.7%), Education (14.3%), Government (12.5%), Nonprofit (0%) — are structurally different from Music in that they rely less on proprietary text content as a core commercial asset. Compare Music to Science (50%) where the split follows a clear open-access vs subscription fault line, or to Gaming (88.9%) where the blocking is near-unanimous.

Music's 66.7% block rate matches Entertainment and Healthcare — the category lands well above the corpus-wide 46.6% across all 223 sites with a parseable robots.txt.

Corpus-Wide Bot and Operator Counts

The following tables cover all 223 sites with a parseable robots.txt in the June 2026 snapshot — not just Music sites.

Bots blocked most often (across all 223 sites):

Bot	Sites Blocking It	Share of Corpus
CCBot	85	38.1%
ClaudeBot	74	33.2%
Bytespider	69	30.9%
GPTBot	64	28.7%
Meta-ExternalAgent	63	28.3%
PerplexityBot	60	26.9%
Applebot-Extended	60	26.9%
Google-Extended	57	25.6%
Amazonbot	50	22.4%

Operators blocked most often (across all 223 sites):

Operator	Sites Blocking Them
Common Crawl	85
Anthropic	80
Meta	73
ByteDance	69
OpenAI	66
Perplexity	60
Apple	60
Google	57
Cohere	56
Diffbot	55
Amazon	50
Mistral	21

Common Crawl and Anthropic occupy the top two operator positions at 85 and 80 sites respectively. The full 12-operator span reflects how rapidly the field of AI crawlers has expanded: from Common Crawl and OpenAI, which were among the earliest active AI-training crawlers, to Mistral at 21 sites — a count that reflects more recent deployment.

For music-specific contexts, the blocking sites in this category are most likely targeting the large training-pipeline operators: Common Crawl, Anthropic, OpenAI, and Meta. Those four represent the pipelines most likely to turn lyric and catalog data into generated music content or music-aware AI responses.

Methodology

US Tech Automations fetched robots.txt files from 260 prominent web domains across 24 categories on June 14, 2026. Each file was parsed against a fixed list of 9 AI crawler user-agent strings drawn from publicly documented bot identities. The snapshot is content-addressed with sha 834f1e2f07af24fd — immutable after the sealing date. Nothing is estimated, modeled, or extrapolated. A site is classified as blocking when it disallows at least one of the 9 tracked bots. Sites returning no file are counted as no-robots and excluded from the block-rate denominator.

The collection steps:

Fetch. Each domain root was queried for its robots.txt. Failures and missing files were recorded and excluded from the blocking-rate calculation.
Parse. Each file was decomposed into user-agent blocks and evaluated for Disallow directives against the 9 tracked AI bots.
Seal. The full dataset was hashed on June 14, 2026, producing sha 834f1e2f07af24fd — content-addressed and verifiable.
Aggregate. All counts were computed from the sealed data with no estimation, rounding, or interpolation.

For categories that are tracking policy differently, see our Jobs report and the Nonprofit report for the open-access end of the spectrum.

Frequently Asked Questions

Q: Why does genius.com block AI crawlers specifically?

A: Genius holds licensing agreements for lyric display with labels and publishers. Training an AI on Genius lyric data could produce a system capable of reproducing lyrics at scale, which would compete with or undermine those agreements. robots.txt blocking is the first and lowest-cost layer of policy enforcement for this concern. It is not a guarantee — robots.txt is honor-system only — but it communicates the site operator intent to AI companies that have committed to honoring it.

Q: What does it mean that discogs.com has no robots.txt?

A: In the absence of a robots.txt, the default under the protocol is that all paths are open to all crawlers. Well-behaved AI crawlers should treat the absence of a robots.txt the same as an open one. However, the absence communicates no explicit intent — discogs.com has not made a public statement either for or against AI crawling through its robots.txt. This is different from a site that publishes a robots.txt with no AI disallows, which is an explicit open signal.

Q: Is the 66.7% Music block rate higher or lower than the web average?

A: Higher. The corpus-wide block rate across 223 sites with a parseable robots.txt is 46.6%. Music at 66.7% sits about 20 percentage points above that average. Music is in the same band as Entertainment and Healthcare, and below News and Gaming which are the highest-blocking categories.

Q: Why do last.fm, stereogum.com, and ultimate-guitar.com allow all AI crawlers?

A: Each has a different structural reason. last.fm is built on publicly shared scrobble data — its core asset is aggregated user listening behavior that users themselves have made public. stereogum.com is a blog with standard editorial content and no dense proprietary catalog. ultimate-guitar.com hosts user-contributed tabs with a different copyright profile than lyrics. None of them carry the same combination of licensing agreements and proprietary catalog that makes blocking the natural choice for genius.com or bandcamp.com.

Q: Does robots.txt blocking prevent AI companies from using music content legally?

A: No. robots.txt is an honor-system protocol, not a legal instrument. It communicates preferences to crawlers; it does not create enforceable rights. Reputable AI operators like Anthropic and OpenAI have publicly committed to respecting robots.txt disallows, but legal rights — including copyright in lyrics, album artwork, or editorial text — are governed by copyright law and terms of service, not robots.txt.

Put AI-Access Data to Work

Music at 66.7% is a category where the blocking majority is clear but the allowers include meaningful platforms. That distribution creates three distinct recurring workflows.

An SEO or content strategy lead at a music media company should monitor whether competitors shift posture. If stereogum.com is currently allowing all AI crawlers, it gains AI-surface visibility that a blocked peer like pitchfork.com does not. A weekly re-crawl of the 9 Music sites with a robots.txt — alerting the moment any site adds or removes an AI-crawler disallow — gives your strategy team early warning of posture shifts across the competitive set.

A publisher RevOps or licensing lead at a music platform (particularly one holding lyric licenses, like genius.com or bandcamp.com) should track when new AI operators appear in the broader corpus. Mistral is currently at only 21 sites in the 12-operator leaderboard — well below Common Crawl (85) and Anthropic (80). As Mistral and similar emerging operators scale their crawl activity, blocking policies may need updating. An alert when a new operator token appears corpus-wide gives advance notice to update disallow lists.

A retrieval engineer building a music-knowledge AI layer needs to know that of the 9 Music sites with a parseable robots.txt, only last.fm, stereogum.com, and ultimate-guitar.com are accessible under honor-system robots.txt readings. The 6 blockers — including the two most catalog-rich sites, genius.com and bandcamp.com — are off-limits. Monitoring the three allowers for policy changes is a production-relevant signal for any retrieval pipeline.

US Tech Automations automates scheduled robots.txt and llms.txt crawls across your music domain set, sends change-diff alerts when any site updates its AI-access policy, and maintains a real-time AI-access dashboard against the sealed baseline — so your team knows the moment the access landscape shifts.

Automate AI-access monitoring with agentic workflows

Curious how Music sites compare across every vertical? Our flagship study tracks how many top websites block AI crawlers.

Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 14, 2026 (snapshot sha 834f1e2f07af24fd).

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “Do Music Sites Block AI Crawlers? Sealed robots.txt Data.” https://ustechautomations.com/resources/blog/do-music-sites-block-ai-crawlers-2026

Sealed snapshot sha256: 834f1e2f07af24fd

Machine-readable data: CSV · JSON · All research & methodology