Research & Data

Do Book Sites Block AI Crawlers? 1 of 8 Do

Jun 14, 2026

Among the 9 Book sites we checked for the June 2026 Closing Web edition, 8 returned a parseable robots.txt file. Of those 8, just 1 blocks at least one AI crawler — a 12.5% block rate. The sole blocker is bookriot.com. The seven sites that have a parseable robots.txt and allow all AI crawlers are librarything.com, barnesandnoble.com, kirkusreviews.com, publishersweekly.com, literaryhub.com, gutenberg.org, and penguinrandomhouse.com. This is sealed-snapshot data, point-in-time as of June 14, 2026 (sha 27ca61d890a647db).

The most distinctive feature of the Books category is not the low block rate itself — it is who is doing the blocking. bookriot.com is an editorial publication: a book review, recommendation, and criticism outlet whose core product is original written analysis. The seven sites that are open include a major commercial retailer (barnesandnoble.com), two trade publications (kirkusreviews.com, publishersweekly.com), a major commercial publisher (penguinrandomhouse.com), a reader community platform (librarything.com), a literary journal (literaryhub.com), and a public-domain text archive (gutenberg.org). The pattern mirrors what we observe in categories like Beauty, where editorial outlets block while retail and institutional sites remain open.

1 of 8 Book sites block at least one AI crawler.

Book sites post a 12.5% AI-crawler block rate.

Corpus-wide, 139 of 354 sites block at least one AI crawler.

Key Takeaways

1 of 8 Book sites with a parseable robots.txt blocks at least one AI crawler.

bookriot.com is the only Book category blocker in this sealed snapshot.

The Books block rate of 12.5% is well below the 39.3% corpus-wide average.

7 of 8 Book sites with parseable robots.txt files allow every AI crawler we checked.

The Only Blocker in a Category of Open Platforms

bookriot.com's decision to add AI-crawler restrictions to its robots.txt is consistent with a broader pattern across editorial content sites. Book Riot publishes staff-written reviews, recommendation lists, and commentary — exactly the kind of original content that editorial outlets worry about AI systems summarizing or training on without licensing arrangements. The seven open sites take a different posture, which tracks with their different relationships to content ownership.

barnesandnoble.com's public-facing content is primarily product catalog and marketing copy — content that benefits from AI indexability for product discovery. gutenberg.org publishes public-domain texts that by definition carry no current copyright claim. penguinrandomhouse.com's public site is largely promotional content for its catalog — again, content that AI visibility may help rather than harm. publishersweekly.com and kirkusreviews.com are trade publications, but their decisions on AI-crawler access in this snapshot landed on the open side.

bookbub.com returned no parseable robots.txt at all in this snapshot. Like a site with a no-restrict robots.txt, bookbub.com presents no explicit AI-access barrier through this mechanism.

Of 9 Book sites checked, 8 returned a parseable robots.txt; 1 of those 8 blocks at least one AI crawler — a 12.5% block rate as of June 14, 2026.

Compare the editorial-blocker pattern in Books to Insurance sites, where a single large financial institution is the lone blocker in an otherwise open category — a parallel structure, different industry logic.

Who Gets Blocked and Who Stays Open

Site	robots.txt Present	Blocks Any AI Crawler
bookriot.com	Yes	Yes
librarything.com	Yes	No
barnesandnoble.com	Yes	No
kirkusreviews.com	Yes	No
publishersweekly.com	Yes	No
literaryhub.com	Yes	No
gutenberg.org	Yes	No
penguinrandomhouse.com	Yes	No
bookbub.com	No	—

Seven of the eight sites with parseable robots.txt files have chosen not to restrict AI crawlers. That consistency across a range of site types — retailer, community, trade press, literary journal, archive, major publisher — suggests that the Book category's default posture is openness, with bookriot.com as the intentional exception.

1 of 8 Book sites with a parseable robots.txt blocks at least one AI crawler.

Where Books Sits Among All 40 Categories

Books lands at 12.5%, placing it in the bottom third of the 40-category block-rate spectrum and well below the corpus-wide 39.3% average. The full cross-category ranked table follows.

Category	Sites Checked	With robots.txt	Any Blocker	Block Rate
Gaming	9	9	8	88.9%
News	20	17	14	82.4%
Food	10	10	7	70%
Tech	15	13	9	69.2%
Entertainment	9	9	6	66.7%
Healthcare	10	9	6	66.7%
Music	10	9	6	66.7%
Parenting	10	8	5	62.5%
Outdoors	10	5	3	60%
Reference	14	11	6	54.5%
Science	10	10	5	50%
Wedding	10	8	4	50%
Automotive	10	9	4	44.4%
HomeGarden	10	9	4	44.4%
Fashion	9	7	3	42.9%
Social	10	10	4	40%
Sports	10	10	4	40%
Fitness	10	10	4	40%
Photography	10	10	4	40%
Genealogy	10	10	4	40%
Jobs	10	8	3	37.5%
Travel	9	9	3	33.3%
Weather	10	6	2	33.3%
Beauty	10	6	2	33.3%
Legal	10	7	2	28.6%
RealEstate	10	7	2	28.6%
Pets	10	7	2	28.6%
Crafts	10	8	2	25%
Finance	12	11	2	18.2%
Retail	15	12	2	16.7%
Education	9	7	1	14.3%
Government	9	8	1	12.5%
Crypto	9	8	1	12.5%
Books	9	8	1	12.5%
Religion	10	9	1	11.1%
Insurance	10	9	1	11.1%
Productivity	10	10	1	10%
Nonprofit	10	6	0	0%
Streaming	10	10	0	0%
Dating	10	5	0	0%

Books at 12.5% ties with Government and Crypto — all three categories have exactly 1 blocker out of 8 parseable files. The corpus-wide 39.3% block rate is more than three times Books' figure. Only Religion, Insurance, Productivity, and the three zero-block categories are more permissive.

Books' 12.5% block rate is well below the 39.3% corpus-wide average across 354 sites.

How Operators Fare Across the Whole Corpus

Because Books contributes only a single blocking site, the per-operator breakdown within the category is not meaningful in isolation. The broader picture — how AI operators are blocked across all 354 sites with parseable robots.txt in this edition — provides the necessary context.

Operator	Sites Blocking (all 354)
Common Crawl	109
Anthropic	104
Meta	89
OpenAI	87
ByteDance	83
Google	76
Perplexity	74
Apple	74
Cohere	68
Amazon	64
Diffbot	64
Mistral	24

Common Crawl faces blocks at 109 sites across all 354 with parseable robots.txt — the most-blocked operator in the corpus. Anthropic is second at 104, followed by Meta at 89 and OpenAI at 87. In the Books category, bookriot.com has added at least one of these operator-level disallows; the specific bots it targets are captured in the full snapshot but not summarized at the category level here.

For a category like Books — where a public-domain archive (gutenberg.org) and a major publisher (penguinrandomhouse.com) both allow all crawlers — the contrast with high-block categories like Gaming (88.9%) and News (82.4%) is sharp. Books lands near the permissive end because the dominant site types in the category have different content economics than gaming publishers or news organizations.

Snapshot Methodology

We crawled the robots.txt path at each of the 418 sites in this edition on June 14, 2026, parsed the response for User-agent directives matching known AI crawler tokens, and sealed the full dataset as sha 27ca61d890a647db. Every figure in this report is a verbatim count from that sealed snapshot — nothing is estimated, modeled, or extrapolated.

Collect. Each site was requested at its canonical /robots.txt URL. Only public, unauthenticated responses were captured. Response content was stored verbatim.
Parse. Each robots.txt was parsed for User-agent blocks matching the 9 AI bot tokens tracked in this edition. A "block" is any site with at least one recognized token paired with a broad Disallow directive.
Seal. The complete parsed result set was content-addressed at sha 27ca61d890a647db. That hash is the immutable anchor for every number in this report.

Sites without a robots.txt — bookbub.com in this case — are counted in the sites total (9) but not in the withRobots denominator (8). The 12.5% block rate is computed from the 8 sites with parseable files, not the full 9.

Frequently Asked Questions

Q: Why does bookriot.com block AI crawlers when major publishers like Penguin Random House do not?

A: The sealed snapshot records the existence of the block, not the rationale. bookriot.com is an editorial content site — its product is original writing about books. Major commercial publishers like penguinrandomhouse.com publish primarily promotional content on their public sites; their actual book content is behind retail paywalls. The robots.txt blocking behavior tracks with that difference in what is publicly indexable and what each site has to lose from AI summarization.

Q: What does it mean that gutenberg.org — an archive of public-domain texts — has no AI-crawler restrictions?

A: It means exactly what it looks like: gutenberg.org has not added any disallow directives for AI crawlers in this snapshot. Public-domain texts by definition carry no current copyright claim, so an archive like Project Gutenberg has little motivation to restrict AI training access to its collection. The absence of a block is consistent with the site's open-access mission.

Q: Is bookbub.com blocking AI crawlers?

A: In this snapshot, bookbub.com returned no parseable robots.txt. That means it has not published a public AI-access policy through this mechanism. By convention, the absence of a robots.txt means crawlers may proceed. bookbub.com is excluded from the block-rate calculation because it falls in the withRobots denominator only when a robots.txt file is present.

Q: How should someone building a book-recommendations AI interpret the openness of this category?

A: With precision. The 7 open sites cover distinct content types: a catalog retailer, trade publications, a literary journal, a reader-community platform, a major publisher, and a public-domain archive. Each has different content, different ownership claims, and potentially different licensing stances. Robots.txt openness means no technical access barrier through this mechanism — it does not mean the content is licensed for AI training or commercial use. Terms of service and specific licensing agreements govern that.

Put AI-Access Data to Work

Three specific professional audiences derive recurring, automatable value from monitoring the Books category AI-access landscape.

AI product teams building book recommendation engines, literary AI assistants, or reading-list tools that draw on public web content need to track whether bookriot.com expands its block or whether the seven currently-open sites add restrictions. A weekly automated re-crawl of these 9 Book sites — with alerts triggered by any change to a Disallow directive — gives engineering and content teams early warning before a source disappears from a training or retrieval pipeline. US Tech Automations builds and operates exactly this kind of automated monitoring: scheduled robots.txt crawls, change-diff alerting, and API-integrated notifications.

Publishers, literary agents, and rights teams tracking how AI operators access book-industry content benefit from a repeatable audit comparing the current state of these 9 sites against the sealed June 14, 2026 baseline. If penguinrandomhouse.com or barnesandnoble.com adds a blocking directive in a future snapshot, that is a material policy shift that affects AI-licensing conversations. A quarterly re-audit against sha 27ca61d890a647db creates a time-stamped record useful for rights negotiations or regulatory disclosure.

Competitive-intelligence teams at editorial book platforms want to know whether peers are restricting AI access and when that changes. bookriot.com is the only editorial outlet in this category that has done so in this snapshot. Monitoring whether kirkusreviews.com, literaryhub.com, or publishersweekly.com follow suit — or whether bookriot.com reverses its policy — is actionable intelligence for content strategy.

US Tech Automations automates this monitoring — turning a sealed snapshot into a live alert system for any team that needs to stay current on AI-access drift.

For comparison, see how Genealogy sites approach AI-access policy — a category with a similarly open default — and the Outdoors category report for a mid-range block rate with an interesting robots.txt coverage pattern.

Zoom out: Book is just one vertical in a much larger picture — our cross-industry study measures how many top websites block AI crawlers.

Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 14, 2026 (snapshot sha 27ca61d890a647db).

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “Do Book Sites Block AI Crawlers? 1 of 8 Do.” https://ustechautomations.com/resources/blog/do-book-sites-block-ai-crawlers-2026

Sealed snapshot sha256: 27ca61d890a647db

Machine-readable data: CSV · JSON · All research & methodology