Research & Data

Do Museum Sites Block AI Crawlers? 1 of 9 Do

Jun 18, 2026

Museums sit in a strange spot in the AI-access debate. They are the institutions that exist to make culture public, yet they also steward digitized collections worth protecting — and that tension shows up in exactly one robots.txt file.

1 of 9 Museum sites block at least one AI crawler.

Of the museum domains we checked, 9 returned a parseable robots.txt — the root-level file that tells automated agents which paths they may fetch — and a single one of those disallows an AI crawler. That works out to an 11.1% block rate. Every figure here is read straight from the sealed snapshot; nothing is estimated, modeled, or extrapolated.

The lone blocker is si.edu, the Smithsonian Institution. The rest of the policied museums leave the door open. Against the corpus, where 305 of 1123 sites with a policy gate at least one crawler for a 27.2% rate, museums sit well under half the average — one of the more open culture categories in this edition.

The One Museum That Gates, and the Eight That Do Not

What makes museums distinctive is not how many block, but which one does — and how comprehensively. si.edu is the only gate in the set, and it is not a half-measure. The Smithsonian's robots.txt carries a Disallow: / group for GPTBot, ClaudeBot, Google-Extended, CCBot, Bytespider, Meta-ExternalAgent, Amazonbot, and Applebot-Extended — that is OpenAI, Anthropic, Google, Common Crawl, ByteDance, Meta, Amazon, and Apple, all named explicitly. When the Smithsonian closes, it closes the whole leaderboard at once.

The open museums are a who's-who of the art world: moma.org, britishmuseum.org, getty.edu, guggenheim.org, tate.org.uk, rijksmuseum.nl, nga.gov, and louvre.fr. None of them disallows an AI agent. A national gallery or a flagship modern-art museum runs on public reach — the collection database, the exhibition pages, the scholarship are meant to be found, cited, and surfaced, including by an AI assistant fielding a question about a painting.

The only museum blocker in the set is si.edu, the Smithsonian Institution.

One museum domain — metmuseum.org — returned no parseable robots.txt at the seal (it answered with a rate-limited "too many requests" response). It is therefore silent: neither an allow nor a block, and excluded from the rate entirely. That single timeout is why the denominator is 9 rather than the 10 sites we checked. It would be wrong to read the Met's silence as a stance; it is an artifact of a busy host at one moment in time.

What This 11.1% Block Rate Actually Means

A building permit is a public record; a robots.txt directive is a public request — and the museum read is almost entirely "request granted." The honest interpretation is that, as a category, museums behave far more like open publishers than like data fortresses. The collections they have spent years digitizing are, for most of them, an outreach asset rather than a competitive moat, so keeping them readable by retrieval agents extends the institution's reach rather than threatening it.

The Smithsonian is the instructive exception. As a federally chartered complex of museums and research centers, it has both an unusually large digitized holding and a clear institutional reason to control bulk automated harvesting of it. That single decision is the entire museum block rate. In a nine-file sample, one comprehensive blocker is enough to put a number on the board, and it lands the category at 11.1%.

The small sample sharpens this rather than weakening it. With nine policied files, the read is really a story about ten named institutions and one decision at si.edu. That concentration is itself the finding: in museums, AI-access posture is not set by a broad wave of gating but by whether the largest stewards of digitized collections choose to wall them off. Track those stewards and you have tracked most of what moves the category's number.

Museum sites post an 11.1% AI-crawler block rate.

This is a different shape of story than the most-gated categories in the edition. Where news sites overwhelmingly block AI crawlers because their archives are the product, museums treat their archives as a reason to be visited. The contrast is the point: a 27.2% corpus average hides categories that range from culture-as-outreach to data-as-asset, and museums sit firmly on the outreach side.

Where Museums Sit Among Similar Categories

An 11.1% block rate places Museums in the lower-middle of the ranking — open, but not at the zero-block floor. The focused window below shows Museums beside its nearest neighbors, verbatim from the sealed snapshot, name first and no rank column.

Category	Sites	With robots.txt	Block at least 1 crawler	Block rate
Numismatics	10	8	1	12.5%
Museums	10	9	1	11.1%
Religion	10	9	1	11.1%
Insurance	10	9	1	11.1%
Cybersecurity	10	9	1	11.1%
Coffee	10	9	1	11.1%
Productivity	10	10	1	10%

Museums share their 11.1% reading with a broad, unglamorous band — Religion, Insurance, Cybersecurity, and Coffee all land on the same single-blocker mark. It is a crowded part of the ranking, which is itself a sign that one-in-nine-or-ten is a common posture: most sites in these categories want to be readable. The extremes show what the ends look like:

Category	Sites	With robots.txt	Block at least 1 crawler	Block rate
Gaming	9	9	8	88.9%
News	20	17	14	82.4%
Ticketing	10	9	0	0%
Hotels	10	3	0	0%

Museums sit far below Gaming and News, and a notch above the zero-block floor that hotel chains define with their open policies. The category is open by disposition, gated by exception.

The Bots the Smithsonian Reaches For

The single museum blocker is comprehensive, so the more useful corpus context is which bots get gated most broadly — the tokens an institution names first when it decides to close. The cut below shows the most-disallowed bots across all 1123 sites with a robots.txt, bot name first, count next.

Bot	Sites disallowing (of 1123)	Rate
CCBot	228	20.3%
GPTBot	204	18.2%
ClaudeBot	202	18%
Bytespider	195	17.4%
Meta-ExternalAgent	174	15.5%

CCBot, Common Crawl's agent, tops the corpus blocklist at 228 sites, with GPTBot and ClaudeBot close behind. si.edu names all five of these — and more — in its disallow group, so the Smithsonian is not improvising; it is gating the highest-volume training crawlers the whole corpus gates first, just all at once.

Corpus-wide, 305 of 1123 sites block at least one AI crawler.

How the Museum Snapshot Was Sealed

These figures come from one point-in-time crawl of public robots.txt files, sealed June 18, 2026 under snapshot sha 74d390d8f5175d21. For each museum domain we fetched robots.txt at the root, parsed its user-agent and disallow directives, and recorded whether any AI crawler token was disallowed. We report verbatim counts; nothing is estimated, modeled, or extrapolated. The one domain with no parseable file — metmuseum.org, which returned a rate-limited response — is logged as silent, neither allow nor block.

The counting rule is deliberately narrow. A block is an explicit Disallow aimed at a named AI agent — GPTBot, ClaudeBot, CCBot, and the other leaderboard tokens. A museum can disallow administrative, search, or print-view paths without naming an AI agent, and that does not count as an AI block here. Only a directive that names one moves a site into the blocker column, which is why the museum count is a clean 1: si.edu names them, the rest do not.

A note on what the snapshot deliberately does not do. It does not retry a slow host until a file appears, does not follow a redirect into a different domain's policy, and does not infer a block from a site that merely looks unfriendly to bots.

Each museum domain is read once, at seal time, exactly as it answered. That single-read rule is what makes the result content-addressable: anyone holding sha 74d390d8f5175d21 can re-derive the same nine policied files and the same one blocker. The cost is that the Met, briefly rate-limiting at seal, lands in the silent bucket rather than the allow column — the method favors reproducibility over a generous reading.

Frequently Asked Questions

Q: Which museum site blocks AI crawlers?

A: si.edu, the Smithsonian Institution. It is the only one of the 9 museums with a parseable robots.txt that disallows an AI crawler, and it does so comprehensively — naming GPTBot, ClaudeBot, Google-Extended, CCBot, Bytespider, Meta-ExternalAgent, Amazonbot, and Applebot-Extended. That single gate is the entire 11.1% block rate.

Q: Why do the big art museums leave AI crawlers in?

A: Reach. moma.org, britishmuseum.org, getty.edu, the Louvre, the Rijksmuseum, and the others run on public discovery — their digitized collections and scholarship are meant to be found and cited, including by AI assistants. For an institution whose mission is public access, being readable extends that mission rather than threatening it.

Q: Does the 11.1% rate cover all the museum sites you found?

A: No. It covers the 9 sites that returned a parseable robots.txt. One more, metmuseum.org, produced no parseable file at the seal — it answered with a rate-limited response — so it is excluded from the rate rather than counted as an allow or a block.

Q: Does a Disallow in robots.txt actually stop an AI crawler?

A: Not by force. robots.txt is an honor-system standard: a cooperative crawler reads it and complies, but the file enforces nothing technically. si.edu signals that AI agents should stay out of its paths; each crawler decides whether to honor that request.

Put AI-Access Data to Work

For a museum digital director or collections-and-rights lead — the person who owns how a digitized collection appears online — this snapshot is a baseline worth watching. Most peers stay open while the Smithsonian gates comprehensively, and that mix can shift when a new rights policy or a board decision lands. The open-by-disposition posture museums share with the theme park sites that gate nothing is exactly the kind of norm a single flagship decision can move.

Set a recurring crawl that re-reads robots.txt for si.edu, getty.edu, britishmuseum.org, and your own domain weekly, and alert the moment any peer institution adds an AI crawler token to its disallow list — a single change at a flagship museum can reset the norm the rest of the field measures itself against. US Tech Automations runs exactly that kind of scheduled robots.txt crawl with change alerts, so a policy shift surfaces the week it lands rather than at the next annual audit.

A second fit is an AI-search or GEO analyst tracking which cultural institutions remain eligible to surface in answer engines. Their job is to know, continuously, whether the collection pages they rely on are still readable, and whether a metmuseum.org-style silence is a timeout or a hardening stance. US Tech Automations monitors that drift across a watchlist of domains and routes the alert when an institution flips, so the analyst is not re-checking files by hand. See how the agentic monitoring works, and you have a standing read on museum AI-access posture instead of a one-time count.

Corpus-wide, 298 of 1123 sites publish an llms.txt file.

Key Takeaways

Of the 9 Museum sites with a parseable robots.txt, 1 blocks at least one AI crawler — an 11.1% rate, well below the corpus average.
The only blocker is si.edu, the Smithsonian; it gates comprehensively, naming GPTBot, ClaudeBot, Google-Extended, CCBot, Bytespider, Meta-ExternalAgent, Amazonbot, and Applebot-Extended.
The open museums — moma.org, britishmuseum.org, getty.edu, guggenheim.org, tate.org.uk, rijksmuseum.nl, nga.gov, and louvre.fr — all allow every crawler.
metmuseum.org returned no parseable file at the seal (a rate-limited response) and is excluded from the rate.
Corpus-wide, 305 of 1123 sites (27.2%) gate at least one crawler, so museums sit well under half the average.

Curious how Museums sites compare across every vertical? Our flagship study tracks how many top websites block AI crawlers.

Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 18, 2026 (snapshot sha 74d390d8f5175d21).

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “Do Museum Sites Block AI Crawlers? 1 of 9 Do.” https://ustechautomations.com/resources/blog/do-museum-sites-block-ai-crawlers-2026

Sealed snapshot sha256: 74d390d8f5175d21

Machine-readable data: CSV · JSON · All research & methodology