Do Photography Sites Block AI Crawlers? Sealed robots.txt Data
Photography platforms sit at an unusual intersection when it comes to AI access policy: they host some of the most commercially sensitive visual assets on the web, yet our sealed snapshot shows the category lands almost exactly at the corpus average.
Of 10 Photography sites checked, all 10 returned a parseable robots.txt, and 4 of those 10 block at least one AI crawler — a 40% block rate. That precise alignment with the 42% corpus average is the most revealing fact in this data: a sector defined by licensing disputes and image-scraping litigation is not, as of June 14, 2026, dramatically more restrictive than the web at large.
A robots.txt file is a publicly accessible text document that instructs crawlers — including AI training and retrieval bots — which paths or user agents to avoid. It operates on an honor system; no technical enforcement exists. What the sealed snapshot records is intent, not outcome.
Photography sites carry a 40% AI-crawler block rate.
Across 293 sites, 123 block at least one AI crawler — a 42% rate.
Key Takeaways
4 of 10 Photography sites block at least one AI crawler.
Shutterstock, dpreview.com, behance.net, and photographylife.com each carry at least one disallow directive targeting an AI bot.
The Photography block rate of 40% sits just under the 42% corpus average across all 293 sites.
Across all 293 sites in this edition, 123 block at least one AI crawler — the corpus-wide rate is 42%.
Only 48 of 293 sites publish an llms.txt file, a 16.4% adoption rate across the full corpus.
"4 of 10 Photography sites with a parseable robots.txt block at least one AI crawler — a 40% block rate as of June 14, 2026."
"Of 123 sites across 32 categories that block any AI crawler, 42% is the corpus-wide rate; Photography lands just below that line."
Which Photography Sites Are Blocking — and Which Are Not
The four blocking sites share a common thread: they are either commercial stock libraries or editorial communities with clear financial stakes in controlling downstream use of their imagery. Shutterstock.com is the most prominent example — a platform whose entire business model depends on licensing fees that AI training would circumvent.
Dpreview.com hosts extensive gear reviews, sample images, and forum content; its parent company has historically been protective of that archive. Behance.net is Adobe's portfolio network, where designers publish work they have not licensed for AI reproduction. Photographylife.com is an editorial publication funded by advertising and e-commerce; blocking bots may reflect either content-protection instincts or advertiser-driven policy.
The six sites that allow all AI crawlers span a different part of the market. Flickr.com, 500px.com, and smugmug.com are community hosting and portfolio platforms whose value proposition is discoverability — restricting crawlers would cut against their core network-effects business. Gettyimages.com is notable: a licensing giant that might plausibly block, yet as of this snapshot its robots.txt does not disallow AI agents. Petapixel.com and fstoppers.com are editorial media properties; their revenue from pageviews may make crawler access more welcome than harmful.
This split illustrates why a single block rate cannot summarize a category. The blockers are protecting licensable assets; the allowers are chasing reach or have determined that other mechanisms — watermarks, legal terms, API gating — are more effective than robots.txt.
For context on how Photography compares across content verticals, see our report on fitness sites and AI crawlers, which covers a category with an identical 40% block rate from a very different set of motivations.
Where Photography Sits in the 32-Category Corpus
The table below shows every category in the June 2026 edition, ordered by block rate. Photography shares a cluster at 40% with Social, Sports, and Fitness — all categories where discoverability matters as much as protection.
| Category | Sites Checked | With robots.txt | Blocking | Block Rate |
|---|---|---|---|---|
| Gaming | 9 | 9 | 8 | 88.9% |
| News | 20 | 17 | 14 | 82.4% |
| Food | 10 | 10 | 7 | 70% |
| Tech | 15 | 13 | 9 | 69.2% |
| Entertainment | 9 | 9 | 6 | 66.7% |
| Healthcare | 10 | 9 | 6 | 66.7% |
| Music | 10 | 9 | 6 | 66.7% |
| Parenting | 10 | 8 | 5 | 62.5% |
| Reference | 14 | 11 | 6 | 54.5% |
| Science | 10 | 10 | 5 | 50% |
| Automotive | 10 | 9 | 4 | 44.4% |
| HomeGarden | 10 | 9 | 4 | 44.4% |
| Fashion | 9 | 7 | 3 | 42.9% |
| Social | 10 | 10 | 4 | 40% |
| Sports | 10 | 10 | 4 | 40% |
| Fitness | 10 | 10 | 4 | 40% |
| Photography | 10 | 10 | 4 | 40% |
| Jobs | 10 | 8 | 3 | 37.5% |
| Travel | 9 | 9 | 3 | 33.3% |
| Weather | 10 | 6 | 2 | 33.3% |
| Legal | 10 | 7 | 2 | 28.6% |
| RealEstate | 10 | 7 | 2 | 28.6% |
| Pets | 10 | 7 | 2 | 28.6% |
| Crafts | 10 | 8 | 2 | 25% |
| Finance | 12 | 11 | 2 | 18.2% |
| Retail | 15 | 12 | 2 | 16.7% |
| Education | 9 | 7 | 1 | 14.3% |
| Government | 9 | 8 | 1 | 12.5% |
| Crypto | 9 | 8 | 1 | 12.5% |
| Religion | 10 | 9 | 1 | 11.1% |
| Nonprofit | 10 | 6 | 0 | 0% |
| Streaming | 10 | 10 | 0 | 0% |
Gaming (88.9%) and News (82.4%) top the chart — sectors with intense intellectual-property concerns and proven scraping litigation histories. Photography at 40% is close to the median of this 32-category set, which is itself an interesting signal: it is neither a fortress sector nor an open one.
The parenting sites report is worth reading alongside this one — Parenting sits at 62.5%, substantially above Photography despite having fewer obvious commercial IP stakes, which suggests that community-platform governance (not just licensing revenue) drives blocking decisions.
The Operator and Bot Picture Across All 293 Sites
The category-level block rate answers "how many sites block something." The leaderboard tables answer "who is being blocked, and how often." These counts span the full corpus of 293 sites, not just Photography.
| Operator | Sites Blocking (of 293) |
|---|---|
| Common Crawl | 97 |
| Anthropic | 93 |
| Meta | 80 |
| OpenAI | 77 |
| ByteDance | 75 |
| Perplexity | 69 |
| Apple | 67 |
| 66 | |
| Cohere | 63 |
| Diffbot | 60 |
| Amazon | 56 |
| Mistral | 23 |
| Bot Token | Sites Blocking (of 293) | Block Rate |
|---|---|---|
| CCBot | 97 | 33.1% |
| ClaudeBot | 87 | 29.7% |
| Bytespider | 75 | 25.6% |
| GPTBot | 74 | 25.3% |
| Meta-ExternalAgent | 70 | 23.9% |
| PerplexityBot | 68 | 23.2% |
| Applebot-Extended | 67 | 22.9% |
| Google-Extended | 66 | 22.5% |
| Amazonbot | 56 | 19.1% |
Common Crawl's CCBot leads at 97 sites blocked — a function of its age and ubiquity in AI training datasets. Anthropic's ClaudeBot is second at 87 sites, reflecting that many publishers have added explicit disallows after the wave of public discourse about LLM training. Mistral's bot appears in only 23 disallow lists, most likely because it is newer and less familiar to webmasters maintaining robots.txt files manually.
How the Snapshot Was Sealed
This report draws from the Closing Web snapshot sealed June 14, 2026 (sha a5ca246fbdc79954). The methodology is point-in-time collection: we fetched the robots.txt file at the root of each domain, parsed every User-agent block, and flagged any site that carried a disallow directive for at least one of the 9 tracked AI crawler tokens.
A site is marked "blocking" if any one of those tokens is disallowed — even if others are permitted. A site with no parseable robots.txt is recorded separately and is not counted in the block-rate denominator. nothing is estimated, modeled, or extrapolated; every count is a verbatim read from the sealed file set.
The 339 domains in this edition span 32 categories. Of those, 293 returned a parseable robots.txt. The 123 blocking sites represent 42% of those 293.
A note on interpretation: robots.txt signals declared intent. A site that disallows CCBot but not GPTBot has made a deliberate or inadvertent distinction — both are plausible. A site that allows every bot may have a different enforcement posture via API terms or legal action. This data does not measure compliance or enforcement; it measures the public signal each domain has chosen to publish.
Frequently Asked Questions
Q: Does a 40% Photography block rate mean most image content is off-limits to AI crawlers?
A: Not necessarily. A block in robots.txt is a declared preference, not a technical barrier. Crawlers that honor the standard will skip the disallowing domains; crawlers that do not may index them anyway. The 40% figure reflects how many Photography sites we checked have expressed that preference as of June 14, 2026 — it does not measure how much image content is actually indexed.
Q: Why would a platform like Flickr or Getty allow crawlers when their images are commercially licensed?
A: Platform strategy varies. Some rights-holders believe that AI-indexed content drives discovery and inbound licensing demand. Others may enforce access through separate terms of service, DMCA processes, or API controls rather than robots.txt. The sealed snapshot only records the robots.txt signal, not the full legal or technical posture.
Q: Is the Photography block rate likely to change?
A: This is a point-in-time snapshot — cross-sectional only. No trend claims are possible from a single observation. To detect drift, the snapshot would need to be re-run at a later date and the two reads compared. That is exactly what a recurring monitoring workflow does.
Q: What does it mean that all 10 Photography sites returned a parseable robots.txt?
A: It means every site in the Photography sample has published a publicly accessible robots.txt file. That is the prerequisite for any AI-access declaration — without a parseable file, no bot-level policy exists. Photography is one of several categories where the coverage rate is complete in this sample.
Q: How does the Photography block rate compare to Crafts or Religion?
A: Photography (40%) sits well above Crafts (25%) and substantially above Religion (11.1%). Those comparisons are within this same sealed dataset. Our crafts sites report and religion sites report cover those categories in detail.
Put AI-Access Data to Work
Three roles find the Photography category data immediately actionable, and each has a workflow that recurs on a schedule rather than running once.
An SEO or content-strategy lead at a photography media brand needs to know which competitor and peer domains have declared AI-access restrictions — because that shapes whose content will appear in AI-generated answers and whose will not. A practical workflow: re-run the Photography category crawl weekly; flag any new domain that adds a disallow for CCBot or ClaudeBot; update the content brief to emphasize topics where AI-visible competitors are absent. The trigger is a change in the competitor robots.txt set; the cadence is weekly.
A publisher RevOps lead at a licensing platform — think stock photography or portfolio hosting — needs to monitor whether their own policy drift is aligned with their legal team's stance on AI training. A concrete workflow: audit their own robots.txt monthly against the current operator leaderboard (Common Crawl at 97, Anthropic at 93, OpenAI at 77) and confirm that every operator they intend to restrict is actually listed. A missed entry is an unintended gap in declared policy. The trigger is any internal policy update or a new operator appearing in the corpus.
A retrieval or data-pipeline engineer building a visual-content RAG system needs to know which Photography domains are crawl-permitted sources. A usable workflow: maintain a live allowlist derived from this snapshot, re-verify each domain weekly for robots.txt changes, and remove any newly-blocking domain from the fetch queue before the next training or indexing run.
US Tech Automations automates all three workflows — scheduled robots.txt fetches, change-alert routing, and AI-access policy dashboards — so teams get a notification the moment a domain shifts its stance rather than discovering the change weeks later. Start with the agentic workflows platform.
Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 14, 2026 (snapshot sha a5ca246fbdc79954).
Get this data as a daily feed
The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.
Prefer to talk first? Contact us.
Cite this report
US Tech Automations Research, 2026-06 edition. “Do Photography Sites Block AI Crawlers? Sealed robots.txt Data.” https://ustechautomations.com/resources/blog/do-photography-sites-block-ai-crawlers-2026
Sealed snapshot sha256: a5ca246fbdc79954
Machine-readable data: CSV · JSON · All research & methodology
About the Author

Helping businesses leverage automation for operational efficiency.