Research & Data

Do Book Sites Block AI Crawlers? 1 of 8 Do

Jun 14, 2026

Among the 9 Book sites we checked for the June 2026 Closing Web edition, 8 returned a parseable robots.txt file. Of those 8, just 1 blocks at least one AI crawler — a 12.5% block rate. The sole blocker is bookriot.com. The seven sites that have a parseable robots.txt and allow all AI crawlers are librarything.com, barnesandnoble.com, kirkusreviews.com, publishersweekly.com, literaryhub.com, gutenberg.org, and penguinrandomhouse.com. This is sealed-snapshot data, point-in-time as of June 14, 2026 (sha 27ca61d890a647db).

The most distinctive feature of the Books category is not the low block rate itself — it is who is doing the blocking. bookriot.com is an editorial publication: a book review, recommendation, and criticism outlet whose core product is original written analysis. The seven sites that are open include a major commercial retailer (barnesandnoble.com), two trade publications (kirkusreviews.com, publishersweekly.com), a major commercial publisher (penguinrandomhouse.com), a reader community platform (librarything.com), a literary journal (literaryhub.com), and a public-domain text archive (gutenberg.org). The pattern mirrors what we observe in categories like Beauty, where editorial outlets block while retail and institutional sites remain open.

1 of 8 Book sites block at least one AI crawler.

Book sites post a 12.5% AI-crawler block rate.

Corpus-wide, 139 of 354 sites block at least one AI crawler.

Key Takeaways

1 of 8 Book sites with a parseable robots.txt blocks at least one AI crawler.

bookriot.com is the only Book category blocker in this sealed snapshot.

The Books block rate of 12.5% is well below the 39.3% corpus-wide average.

7 of 8 Book sites with parseable robots.txt files allow every AI crawler we checked.

The Only Blocker in a Category of Open Platforms

bookriot.com's decision to add AI-crawler restrictions to its robots.txt is consistent with a broader pattern across editorial content sites. Book Riot publishes staff-written reviews, recommendation lists, and commentary — exactly the kind of original content that editorial outlets worry about AI systems summarizing or training on without licensing arrangements. The seven open sites take a different posture, which tracks with their different relationships to content ownership.

barnesandnoble.com's public-facing content is primarily product catalog and marketing copy — content that benefits from AI indexability for product discovery. gutenberg.org publishes public-domain texts that by definition carry no current copyright claim. penguinrandomhouse.com's public site is largely promotional content for its catalog — again, content that AI visibility may help rather than harm. publishersweekly.com and kirkusreviews.com are trade publications, but their decisions on AI-crawler access in this snapshot landed on the open side.

bookbub.com returned no parseable robots.txt at all in this snapshot. Like a site with a no-restrict robots.txt, bookbub.com presents no explicit AI-access barrier through this mechanism.

Of 9 Book sites checked, 8 returned a parseable robots.txt; 1 of those 8 blocks at least one AI crawler — a 12.5% block rate as of June 14, 2026.

Compare the editorial-blocker pattern in Books to Insurance sites, where a single large financial institution is the lone blocker in an otherwise open category — a parallel structure, different industry logic.

Who Gets Blocked and Who Stays Open

Siterobots.txt PresentBlocks Any AI Crawler
bookriot.comYesYes
librarything.comYesNo
barnesandnoble.comYesNo
kirkusreviews.comYesNo
publishersweekly.comYesNo
literaryhub.comYesNo
gutenberg.orgYesNo
penguinrandomhouse.comYesNo
bookbub.comNo

Seven of the eight sites with parseable robots.txt files have chosen not to restrict AI crawlers. That consistency across a range of site types — retailer, community, trade press, literary journal, archive, major publisher — suggests that the Book category's default posture is openness, with bookriot.com as the intentional exception.

1 of 8 Book sites with a parseable robots.txt blocks at least one AI crawler.

Where Books Sits Among All 40 Categories

Books lands at 12.5%, placing it in the bottom third of the 40-category block-rate spectrum and well below the corpus-wide 39.3% average. The full cross-category ranked table follows.

CategorySites CheckedWith robots.txtAny BlockerBlock Rate
Gaming99888.9%
News20171482.4%
Food1010770%
Tech1513969.2%
Entertainment99666.7%
Healthcare109666.7%
Music109666.7%
Parenting108562.5%
Outdoors105360%
Reference1411654.5%
Science1010550%
Wedding108450%
Automotive109444.4%
HomeGarden109444.4%
Fashion97342.9%
Social1010440%
Sports1010440%
Fitness1010440%
Photography1010440%
Genealogy1010440%
Jobs108337.5%
Travel99333.3%
Weather106233.3%
Beauty106233.3%
Legal107228.6%
RealEstate107228.6%
Pets107228.6%
Crafts108225%
Finance1211218.2%
Retail1512216.7%
Education97114.3%
Government98112.5%
Crypto98112.5%
Books98112.5%
Religion109111.1%
Insurance109111.1%
Productivity1010110%
Nonprofit10600%
Streaming101000%
Dating10500%

Books at 12.5% ties with Government and Crypto — all three categories have exactly 1 blocker out of 8 parseable files. The corpus-wide 39.3% block rate is more than three times Books' figure. Only Religion, Insurance, Productivity, and the three zero-block categories are more permissive.

Books' 12.5% block rate is well below the 39.3% corpus-wide average across 354 sites.

How Operators Fare Across the Whole Corpus

Because Books contributes only a single blocking site, the per-operator breakdown within the category is not meaningful in isolation. The broader picture — how AI operators are blocked across all 354 sites with parseable robots.txt in this edition — provides the necessary context.

OperatorSites Blocking (all 354)
Common Crawl109
Anthropic104
Meta89
OpenAI87
ByteDance83
Google76
Perplexity74
Apple74
Cohere68
Amazon64
Diffbot64
Mistral24

Common Crawl faces blocks at 109 sites across all 354 with parseable robots.txt — the most-blocked operator in the corpus. Anthropic is second at 104, followed by Meta at 89 and OpenAI at 87. In the Books category, bookriot.com has added at least one of these operator-level disallows; the specific bots it targets are captured in the full snapshot but not summarized at the category level here.

For a category like Books — where a public-domain archive (gutenberg.org) and a major publisher (penguinrandomhouse.com) both allow all crawlers — the contrast with high-block categories like Gaming (88.9%) and News (82.4%) is sharp. Books lands near the permissive end because the dominant site types in the category have different content economics than gaming publishers or news organizations.

Snapshot Methodology

We crawled the robots.txt path at each of the 418 sites in this edition on June 14, 2026, parsed the response for User-agent directives matching known AI crawler tokens, and sealed the full dataset as sha 27ca61d890a647db. Every figure in this report is a verbatim count from that sealed snapshot — nothing is estimated, modeled, or extrapolated.

  1. Collect. Each site was requested at its canonical /robots.txt URL. Only public, unauthenticated responses were captured. Response content was stored verbatim.

  2. Parse. Each robots.txt was parsed for User-agent blocks matching the 9 AI bot tokens tracked in this edition. A "block" is any site with at least one recognized token paired with a broad Disallow directive.

  3. Seal. The complete parsed result set was content-addressed at sha 27ca61d890a647db. That hash is the immutable anchor for every number in this report.

Sites without a robots.txt — bookbub.com in this case — are counted in the sites total (9) but not in the withRobots denominator (8). The 12.5% block rate is computed from the 8 sites with parseable files, not the full 9.

Frequently Asked Questions

Q: Why does bookriot.com block AI crawlers when major publishers like Penguin Random House do not?

A: The sealed snapshot records the existence of the block, not the rationale. bookriot.com is an editorial content site — its product is original writing about books. Major commercial publishers like penguinrandomhouse.com publish primarily promotional content on their public sites; their actual book content is behind retail paywalls. The robots.txt blocking behavior tracks with that difference in what is publicly indexable and what each site has to lose from AI summarization.

Q: What does it mean that gutenberg.org — an archive of public-domain texts — has no AI-crawler restrictions?

A: It means exactly what it looks like: gutenberg.org has not added any disallow directives for AI crawlers in this snapshot. Public-domain texts by definition carry no current copyright claim, so an archive like Project Gutenberg has little motivation to restrict AI training access to its collection. The absence of a block is consistent with the site's open-access mission.

Q: Is bookbub.com blocking AI crawlers?

A: In this snapshot, bookbub.com returned no parseable robots.txt. That means it has not published a public AI-access policy through this mechanism. By convention, the absence of a robots.txt means crawlers may proceed. bookbub.com is excluded from the block-rate calculation because it falls in the withRobots denominator only when a robots.txt file is present.

Q: How should someone building a book-recommendations AI interpret the openness of this category?

A: With precision. The 7 open sites cover distinct content types: a catalog retailer, trade publications, a literary journal, a reader-community platform, a major publisher, and a public-domain archive. Each has different content, different ownership claims, and potentially different licensing stances. Robots.txt openness means no technical access barrier through this mechanism — it does not mean the content is licensed for AI training or commercial use. Terms of service and specific licensing agreements govern that.

Put AI-Access Data to Work

Three specific professional audiences derive recurring, automatable value from monitoring the Books category AI-access landscape.

AI product teams building book recommendation engines, literary AI assistants, or reading-list tools that draw on public web content need to track whether bookriot.com expands its block or whether the seven currently-open sites add restrictions. A weekly automated re-crawl of these 9 Book sites — with alerts triggered by any change to a Disallow directive — gives engineering and content teams early warning before a source disappears from a training or retrieval pipeline. US Tech Automations builds and operates exactly this kind of automated monitoring: scheduled robots.txt crawls, change-diff alerting, and API-integrated notifications.

Publishers, literary agents, and rights teams tracking how AI operators access book-industry content benefit from a repeatable audit comparing the current state of these 9 sites against the sealed June 14, 2026 baseline. If penguinrandomhouse.com or barnesandnoble.com adds a blocking directive in a future snapshot, that is a material policy shift that affects AI-licensing conversations. A quarterly re-audit against sha 27ca61d890a647db creates a time-stamped record useful for rights negotiations or regulatory disclosure.

Competitive-intelligence teams at editorial book platforms want to know whether peers are restricting AI access and when that changes. bookriot.com is the only editorial outlet in this category that has done so in this snapshot. Monitoring whether kirkusreviews.com, literaryhub.com, or publishersweekly.com follow suit — or whether bookriot.com reverses its policy — is actionable intelligence for content strategy.

US Tech Automations automates this monitoring — turning a sealed snapshot into a live alert system for any team that needs to stay current on AI-access drift.

For comparison, see how Genealogy sites approach AI-access policy — a category with a similarly open default — and the Outdoors category report for a mid-range block rate with an interesting robots.txt coverage pattern.

Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 14, 2026 (snapshot sha 27ca61d890a647db).

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “Do Book Sites Block AI Crawlers? 1 of 8 Do.” https://ustechautomations.com/resources/blog/do-book-sites-block-ai-crawlers-2026

Sealed snapshot sha256: 27ca61d890a647db

Machine-readable data: CSV · JSON · All research & methodology

About the Author

Garrett Mullins
Garrett Mullins
Workflow Specialist

Helping businesses leverage automation for operational efficiency.