Research & Data

Do Stock Media Sites Block AI Crawlers? 4 of 8 Do

Jun 18, 2026

Stock media is the rare category where the product and the threat are the same thing. These sites sell licensed images, video, and vectors — and those same libraries are exactly what generative-AI image models train on. So when half the readable stock-media sites post a robots.txt line shutting AI crawlers out, it reads less like routine bot management and more like creators fencing off the raw material of their own competition.

4 of 8 Stock Media sites block at least one AI crawler.

Of the stock-media sites we checked, 8 returned a parseable robots.txt — the root-level file that tells automated agents which paths they may fetch — and 4 of those disallow an AI crawler. That is a 50% block rate, nearly double the corpus figure of 27.2% and one of the highest readings in this batch. Every number here is read straight from the sealed file; nothing is estimated, modeled, or extrapolated.

A robots.txt block is a posted, honor-system request — a line naming a crawler and asking it to stay out, not a technical wall. What makes stock media stand apart is that the request is so common here: four of the eight readable libraries chose to write one, and they did not all reach for the same tokens.

Who Is Gating the Crawlers Here

The four blockers span the marketplace giants and the free-image platforms, and the breadth of each block tells you how seriously that library treats the training-data question.

stock.adobe.com — Adobe Stock — disallows GPTBot, Google-Extended, CCBot, and Applebot-Extended, covering the OpenAI, Anthropic, Google, Common Crawl, and Apple operators. For a paid licensing marketplace, keeping that catalog out of training crawlers protects both the contributors who upload work and the licensing model that pays them.

pixabay.com runs the widest block in the category. Its file disallows GPTBot, ClaudeBot, Google-Extended, CCBot, Bytespider, and Applebot-Extended — a near-sweep of the major training crawlers across the OpenAI, Anthropic, Google, Common Crawl, ByteDance, and Apple operators. That a free-image site gates this broadly is the surprise: it suggests the concern is about the imagery itself feeding models, not just about protecting a paywall.

123rf.com takes a middle path, disallowing GPTBot, Google-Extended, CCBot, and Amazonbot — the OpenAI, Anthropic, Google, Common Crawl, and Amazon operators — while leaving others unlisted.

dreamstime.com is the narrow outlier. It disallows exactly one token: Meta-ExternalAgent, the Meta operator's crawler. A single-bot block like this usually reflects a reaction to one specific crawler rather than a blanket anti-training stance.

Two of the four stock-media blockers — stock.adobe.com and pixabay.com — disallow the OpenAI, Anthropic, Google, Common Crawl, and Apple crawlers; pixabay.com adds ByteDance on top.

The four that allow everything are istockphoto.com, depositphotos.com, vecteezy.com, and freepik.com — a mix of a Getty-owned marketplace, two stock libraries, and a graphics-and-vector platform. None disallows an AI agent. Two more domains say nothing parseable at all: pexels.com refused our request and alamy.com returned a rate-limited response. With no readable file at seal time, both are logged as silent — excluded from the rate, never counted as an allow or a block.

What a 50% Block Rate Actually Means

A robots.txt block is a request, not an enforced barrier — the same way a posted "no trespassing" sign relies on people honoring it. The 50% figure measures how many readable stock-media files carry that sign for an AI crawler, not how many crawlers obeyed it. We divide by the 8 sites that published a parseable robots.txt, not by every stock-media site we looked at, so the rate stays clean: pexels.com and alamy.com, which returned no readable file, are neither allows nor blocks.

At 50%, stock media gates at nearly twice the corpus rate of 27.2%, and the reason is structural rather than incidental. Most categories in this edition are content-or-commerce sites deciding whether an AI assistant should read them for discovery. Stock-media libraries face a sharper calculus: their core asset — licensed visual work — is the literal training fuel for the image models that increasingly compete with them. When the thing you sell is also the thing being ingested, the incentive to post a disallow line is unusually direct, and four of eight libraries acted on it.

Stock Media sites post a 50% AI-crawler block rate.

The honest read is a category genuinely divided down the middle, and that split is itself the finding. It is not a sector in consensus. The paid marketplace Adobe Stock and the free platform Pixabay both gate broadly, while the Getty-owned istockphoto.com and the popular freepik.com stay fully open. Business model does not cleanly predict posture here — both a paywalled giant and a free library landed in the blocker column, and both a paid and a free service landed among the allowers.

How Stock Media Ranks Against Other Categories

A 50% reading places stock media near the upper third of the 138-category ranking — well above the corpus average, in a band shared with a handful of unrelated verticals. The focused window below shows stock media beside its nearest neighbors, verbatim from the sealed snapshot — category name first, no rank column.

Category	Sites	With robots.txt	Block ≥1 crawler	Block rate
Climbing	10	9	5	55.6%
StockMedia	10	8	4	50%
Reference	14	12	6	50%
Science	10	10	5	50%
Accounting	10	8	4	50%
Woodworking	10	10	5	50%
Automotive	10	9	4	44.4%

Stock media shares its 50% mark with Reference, Science, Accounting, and Woodworking — a band where roughly half of readable sites gate. It sits just below Climbing and above the 44.4% group, firmly in the upper portion of the ranking rather than the crowded middle where most categories cluster. The extremes table shows the full spread:

Category	Sites	With robots.txt	Block ≥1 crawler	Block rate
Gaming	9	9	8	88.9%
News	20	17	14	82.4%
Grocery	10	7	1	14.3%
Casinos	10	8	0	0%

Stock media sits below the Gaming and News peaks, where nearly every policied site gates, but far above the zero-block floor that categories like casino sites and grocery sites occupy. Among consumer-facing commerce categories, a 50% rate is high — a reflection of how directly the training-data question lands on a business that licenses visual work.

Which Bots Stock-Media Libraries Gate Most

The four stock-media blockers add to a much larger corpus pattern, and knowing which bots get gated most tells a library which token a competitor reached for first. The cut below shows the most-disallowed bots across all 1123 sites with a parseable robots.txt, bot name first, count next.

Bot	Sites disallowing	Rate
CCBot	228	20.3%
GPTBot	204	18.2%
ClaudeBot	202	18%
Bytespider	195	17.4%
Meta-ExternalAgent	174	15.5%

CCBot, Common Crawl's agent, tops the corpus blocklist, with GPTBot and ClaudeBot close behind. Three of the four stock-media blockers — stock.adobe.com, pixabay.com, and 123rf.com — disallow both CCBot and GPTBot, the two highest-volume training crawlers, joining this broad corpus pattern. dreamstime.com is the exception, naming only Meta-ExternalAgent, a token that 174 of the 1123 readable sites disallow.

Corpus-wide, 305 of 1123 sites block at least one AI crawler.

How the Stock-Media Snapshot Was Sealed

These figures come from one point-in-time crawl of public robots.txt files, sealed June 18, 2026 under snapshot sha 74d390d8f5175d21. For each stock-media domain we fetched robots.txt at the root, parsed its user-agent and disallow directives, and recorded whether any AI crawler token carried a Disallow. We report exactly what the files declared; nothing is estimated, modeled, or extrapolated. Domains with no parseable file — pexels.com and alamy.com — are logged as silent, neither allow nor block.

The counting rule is strict. A block is an explicit Disallow aimed at a named AI agent — GPTBot, ClaudeBot, CCBot, Bytespider, and the other tracked tokens. A library can disallow administrative, search, or image-hotlink paths without naming an AI agent, and that does not count as an AI block here. Only a directive that names one moves a site into the blocker column, which is why the stock-media count is a clean 4. A sealed snapshot is content-addressed: anyone holding sha 74d390d8f5175d21 can re-derive the same eight readable files and the same four blockers.

US Tech Automations runs this read across 1374 sites checked, 1123 with a parseable robots.txt, spanning 138 categories. Stock media contributes 8 of those readable files. The method deliberately does not retry a slow host until a file appears, does not follow a redirect into a different domain's policy, and does not infer a block from a site that merely looks unfriendly to bots — which is why pexels.com, refusing our request, and alamy.com, returning a rate-limited response, land in the silent bucket rather than the allow column.

Frequently Asked Questions

Q: Which four stock-media sites block AI crawlers?

A: stock.adobe.com, pixabay.com, dreamstime.com, and 123rf.com. They are the four domains among the 8 with a parseable robots.txt that disallow an AI crawler, together making the 50% block rate. The other readable libraries — istockphoto.com, depositphotos.com, vecteezy.com, and freepik.com — allow every crawler.

Q: Why do stock-media sites gate so much more than the corpus average?

A: Their core asset is the training fuel. Licensed images, video, and vectors are exactly what generative image models ingest, so a stock library has an unusually direct reason to keep its catalog out of AI training crawlers. At 50%, stock media gates at nearly double the 27.2% corpus rate.

Q: Why is pixabay.com, a free-image site, one of the broadest blockers?

A: pixabay.com disallows GPTBot, ClaudeBot, Google-Extended, CCBot, Bytespider, and Applebot-Extended — the widest block in this category — even though it offers free images. That suggests the concern is the imagery itself feeding models, not protecting a paywall, since a free library still has a stake in whether AI generators are trained on its content.

Q: Why are Pexels and Alamy not counted in the rate?

A: Neither returned a parseable robots.txt at the seal. pexels.com refused our request and alamy.com answered with a rate-limited response. With no readable file, both are logged as silent and excluded from the block-rate math rather than counted as allows or blocks.

Put AI-Access Data to Work

For a stock-media or visual-content platform's data-licensing or product lead — the person who owns whether the library is readable by AI agents and how its work is used in training — this snapshot is the competitive baseline. The field is split exactly in half: stock.adobe.com and pixabay.com gate broadly while istockphoto.com and freepik.com stay open.

Set a recurring crawl that re-reads robots.txt for your own domain plus the full peer set, and alert the moment a competitor adds or drops an AI crawler token — because in a category this evenly divided, a single library changing its disallow list shifts the balance and signals where the industry's licensing stance is heading.

A brand- or competitive-intelligence analyst covering the creative-tools and media space is the second fit: they can monitor the same eight domains to catch when a silent site like pexels.com finally publishes a readable policy, or when a narrow blocker like dreamstime.com widens beyond its single Meta token, since either move reshapes what AI image models can legitimately train on.

The catalog-defense logic looks very different one vertical over — auction houses gate barely above the corpus line — which is the contrast worth watching as listing-data businesses decide how far to fence their content. US Tech Automations runs these scheduled robots.txt crawls with change alerts so a policy shift surfaces the week it lands rather than at the next manual audit. See how the agentic monitoring works.

Corpus-wide, 298 of 1123 sites publish an llms.txt file.

Key Takeaways

Of the 8 Stock Media sites with a parseable robots.txt, 4 block at least one AI crawler — a 50% rate, nearly double the 27.2% corpus figure and one of the highest in this batch.
The blockers span paid and free libraries: stock.adobe.com, pixabay.com, dreamstime.com, and 123rf.com. pixabay.com runs the widest block; dreamstime.com names only Meta-ExternalAgent.
The allowers — istockphoto.com, depositphotos.com, vecteezy.com, and freepik.com — disallow no crawler, so business model does not cleanly predict posture.
pexels.com and alamy.com returned no parseable file and are excluded from the rate as silent.
CCBot is the most-disallowed bot across all 1123 readable sites; three of the four stock-media blockers gate both CCBot and GPTBot.

Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 18, 2026 (snapshot sha 74d390d8f5175d21).

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “Do Stock Media Sites Block AI Crawlers? 4 of 8 Do.” https://ustechautomations.com/resources/blog/do-stock-media-sites-block-ai-crawlers-2026

Sealed snapshot sha256: 74d390d8f5175d21

Machine-readable data: CSV · JSON · All research & methodology