Research & Data

Do Photography Sites Block AI Crawlers? Sealed robots.txt Data

Jun 14, 2026

Photography platforms sit at an unusual intersection when it comes to AI access policy: they host some of the most commercially sensitive visual assets on the web, yet our sealed snapshot shows the category lands almost exactly at the corpus average.

Of 10 Photography sites checked, all 10 returned a parseable robots.txt, and 4 of those 10 block at least one AI crawler — a 40% block rate. That precise alignment with the 42% corpus average is the most revealing fact in this data: a sector defined by licensing disputes and image-scraping litigation is not, as of June 14, 2026, dramatically more restrictive than the web at large.

A robots.txt file is a publicly accessible text document that instructs crawlers — including AI training and retrieval bots — which paths or user agents to avoid. It operates on an honor system; no technical enforcement exists. What the sealed snapshot records is intent, not outcome.

Photography sites carry a 40% AI-crawler block rate.

Across 293 sites, 123 block at least one AI crawler — a 42% rate.

Key Takeaways

4 of 10 Photography sites block at least one AI crawler.

Shutterstock, dpreview.com, behance.net, and photographylife.com each carry at least one disallow directive targeting an AI bot.

The Photography block rate of 40% sits just under the 42% corpus average across all 293 sites.

Across all 293 sites in this edition, 123 block at least one AI crawler — the corpus-wide rate is 42%.

Only 48 of 293 sites publish an llms.txt file, a 16.4% adoption rate across the full corpus.

"4 of 10 Photography sites with a parseable robots.txt block at least one AI crawler — a 40% block rate as of June 14, 2026."

"Of 123 sites across 32 categories that block any AI crawler, 42% is the corpus-wide rate; Photography lands just below that line."

Which Photography Sites Are Blocking — and Which Are Not

The four blocking sites share a common thread: they are either commercial stock libraries or editorial communities with clear financial stakes in controlling downstream use of their imagery. Shutterstock.com is the most prominent example — a platform whose entire business model depends on licensing fees that AI training would circumvent.

Dpreview.com hosts extensive gear reviews, sample images, and forum content; its parent company has historically been protective of that archive. Behance.net is Adobe's portfolio network, where designers publish work they have not licensed for AI reproduction. Photographylife.com is an editorial publication funded by advertising and e-commerce; blocking bots may reflect either content-protection instincts or advertiser-driven policy.

The six sites that allow all AI crawlers span a different part of the market. Flickr.com, 500px.com, and smugmug.com are community hosting and portfolio platforms whose value proposition is discoverability — restricting crawlers would cut against their core network-effects business. Gettyimages.com is notable: a licensing giant that might plausibly block, yet as of this snapshot its robots.txt does not disallow AI agents. Petapixel.com and fstoppers.com are editorial media properties; their revenue from pageviews may make crawler access more welcome than harmful.

This split illustrates why a single block rate cannot summarize a category. The blockers are protecting licensable assets; the allowers are chasing reach or have determined that other mechanisms — watermarks, legal terms, API gating — are more effective than robots.txt.

For context on how Photography compares across content verticals, see our report on fitness sites and AI crawlers, which covers a category with an identical 40% block rate from a very different set of motivations.

Where Photography Sits in the 32-Category Corpus

The table below shows every category in the June 2026 edition, ordered by block rate. Photography shares a cluster at 40% with Social, Sports, and Fitness — all categories where discoverability matters as much as protection.

Category	Sites Checked	With robots.txt	Blocking	Block Rate
Gaming	9	9	8	88.9%
News	20	17	14	82.4%
Food	10	10	7	70%
Tech	15	13	9	69.2%
Entertainment	9	9	6	66.7%
Healthcare	10	9	6	66.7%
Music	10	9	6	66.7%
Parenting	10	8	5	62.5%
Reference	14	11	6	54.5%
Science	10	10	5	50%
Automotive	10	9	4	44.4%
HomeGarden	10	9	4	44.4%
Fashion	9	7	3	42.9%
Social	10	10	4	40%
Sports	10	10	4	40%
Fitness	10	10	4	40%
Photography	10	10	4	40%
Jobs	10	8	3	37.5%
Travel	9	9	3	33.3%
Weather	10	6	2	33.3%
Legal	10	7	2	28.6%
RealEstate	10	7	2	28.6%
Pets	10	7	2	28.6%
Crafts	10	8	2	25%
Finance	12	11	2	18.2%
Retail	15	12	2	16.7%
Education	9	7	1	14.3%
Government	9	8	1	12.5%
Crypto	9	8	1	12.5%
Religion	10	9	1	11.1%
Nonprofit	10	6	0	0%
Streaming	10	10	0	0%

Gaming (88.9%) and News (82.4%) top the chart — sectors with intense intellectual-property concerns and proven scraping litigation histories. Photography at 40% is close to the median of this 32-category set, which is itself an interesting signal: it is neither a fortress sector nor an open one.

The parenting sites report is worth reading alongside this one — Parenting sits at 62.5%, substantially above Photography despite having fewer obvious commercial IP stakes, which suggests that community-platform governance (not just licensing revenue) drives blocking decisions.

The Operator and Bot Picture Across All 293 Sites

The category-level block rate answers "how many sites block something." The leaderboard tables answer "who is being blocked, and how often." These counts span the full corpus of 293 sites, not just Photography.

Operator	Sites Blocking (of 293)
Common Crawl	97
Anthropic	93
Meta	80
OpenAI	77
ByteDance	75
Perplexity	69
Apple	67
Google	66
Cohere	63
Diffbot	60
Amazon	56
Mistral	23

Bot Token	Sites Blocking (of 293)	Block Rate
CCBot	97	33.1%
ClaudeBot	87	29.7%
Bytespider	75	25.6%
GPTBot	74	25.3%
Meta-ExternalAgent	70	23.9%
PerplexityBot	68	23.2%
Applebot-Extended	67	22.9%
Google-Extended	66	22.5%
Amazonbot	56	19.1%

Common Crawl's CCBot leads at 97 sites blocked — a function of its age and ubiquity in AI training datasets. Anthropic's ClaudeBot is second at 87 sites, reflecting that many publishers have added explicit disallows after the wave of public discourse about LLM training. Mistral's bot appears in only 23 disallow lists, most likely because it is newer and less familiar to webmasters maintaining robots.txt files manually.

How the Snapshot Was Sealed

This report draws from the Closing Web snapshot sealed June 14, 2026 (sha a5ca246fbdc79954). The methodology is point-in-time collection: we fetched the robots.txt file at the root of each domain, parsed every User-agent block, and flagged any site that carried a disallow directive for at least one of the 9 tracked AI crawler tokens.

A site is marked "blocking" if any one of those tokens is disallowed — even if others are permitted. A site with no parseable robots.txt is recorded separately and is not counted in the block-rate denominator. nothing is estimated, modeled, or extrapolated; every count is a verbatim read from the sealed file set.

The 339 domains in this edition span 32 categories. Of those, 293 returned a parseable robots.txt. The 123 blocking sites represent 42% of those 293.

A note on interpretation: robots.txt signals declared intent. A site that disallows CCBot but not GPTBot has made a deliberate or inadvertent distinction — both are plausible. A site that allows every bot may have a different enforcement posture via API terms or legal action. This data does not measure compliance or enforcement; it measures the public signal each domain has chosen to publish.

Frequently Asked Questions

Q: Does a 40% Photography block rate mean most image content is off-limits to AI crawlers?

A: Not necessarily. A block in robots.txt is a declared preference, not a technical barrier. Crawlers that honor the standard will skip the disallowing domains; crawlers that do not may index them anyway. The 40% figure reflects how many Photography sites we checked have expressed that preference as of June 14, 2026 — it does not measure how much image content is actually indexed.

Q: Why would a platform like Flickr or Getty allow crawlers when their images are commercially licensed?

A: Platform strategy varies. Some rights-holders believe that AI-indexed content drives discovery and inbound licensing demand. Others may enforce access through separate terms of service, DMCA processes, or API controls rather than robots.txt. The sealed snapshot only records the robots.txt signal, not the full legal or technical posture.

Q: Is the Photography block rate likely to change?

A: This is a point-in-time snapshot — cross-sectional only. No trend claims are possible from a single observation. To detect drift, the snapshot would need to be re-run at a later date and the two reads compared. That is exactly what a recurring monitoring workflow does.

Q: What does it mean that all 10 Photography sites returned a parseable robots.txt?

A: It means every site in the Photography sample has published a publicly accessible robots.txt file. That is the prerequisite for any AI-access declaration — without a parseable file, no bot-level policy exists. Photography is one of several categories where the coverage rate is complete in this sample.

Q: How does the Photography block rate compare to Crafts or Religion?

A: Photography (40%) sits well above Crafts (25%) and substantially above Religion (11.1%). Those comparisons are within this same sealed dataset. Our crafts sites report and religion sites report cover those categories in detail.

Put AI-Access Data to Work

Three roles find the Photography category data immediately actionable, and each has a workflow that recurs on a schedule rather than running once.

An SEO or content-strategy lead at a photography media brand needs to know which competitor and peer domains have declared AI-access restrictions — because that shapes whose content will appear in AI-generated answers and whose will not. A practical workflow: re-run the Photography category crawl weekly; flag any new domain that adds a disallow for CCBot or ClaudeBot; update the content brief to emphasize topics where AI-visible competitors are absent. The trigger is a change in the competitor robots.txt set; the cadence is weekly.

A publisher RevOps lead at a licensing platform — think stock photography or portfolio hosting — needs to monitor whether their own policy drift is aligned with their legal team's stance on AI training. A concrete workflow: audit their own robots.txt monthly against the current operator leaderboard (Common Crawl at 97, Anthropic at 93, OpenAI at 77) and confirm that every operator they intend to restrict is actually listed. A missed entry is an unintended gap in declared policy. The trigger is any internal policy update or a new operator appearing in the corpus.

A retrieval or data-pipeline engineer building a visual-content RAG system needs to know which Photography domains are crawl-permitted sources. A usable workflow: maintain a live allowlist derived from this snapshot, re-verify each domain weekly for robots.txt changes, and remove any newly-blocking domain from the fetch queue before the next training or indexing run.

US Tech Automations automates all three workflows — scheduled robots.txt fetches, change-alert routing, and AI-access policy dashboards — so teams get a notification the moment a domain shifts its stance rather than discovering the change weeks later. Start with the agentic workflows platform.

Curious how Photography sites compare across every vertical? Our flagship study tracks how many top websites block AI crawlers.

Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 14, 2026 (snapshot sha a5ca246fbdc79954).

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “Do Photography Sites Block AI Crawlers? Sealed robots.txt Data.” https://ustechautomations.com/resources/blog/do-photography-sites-block-ai-crawlers-2026

Sealed snapshot sha256: a5ca246fbdc79954

Machine-readable data: CSV · JSON · All research & methodology