Research & Data

Do Retail Sites Block AI Crawlers? Sealed robots.txt Data

Jun 13, 2026

Who This Is For

This report is for e-commerce SEO leads, retail technology teams, and digital-commerce strategists who need to understand how large retail platforms position themselves toward AI crawlers. If your organization operates a retail website or monitors competitor access policy, the sealed figures below provide the sector baseline for June 2026.

TL;DR

2 of 12 Retail sites with a parseable robots.txt block at least one AI crawler — a 16.7% rate. Retail ranks eighth across all 10 content categories in this corpus. This is well below the corpus-wide average of 44.9%. The Retail sector is broadly permissive, with the majority of sites either allowing AI crawlers or even publishing an llms.txt structured-guidance file.

The Retail Category Finding

Of the 15 Retail sites checked, 12 returned a parseable robots.txt file. Of those 12, only 2 block at least one AI crawler — a category block rate of 16.7%.

Ten sites — walmart.com, target.com, bestbuy.com, etsy.com, homedepot.com, wayfair.com, ikea.com, nordstrom.com, nike.com, and shopify.com — have robots.txt files that do not restrict any of the 21 tracked AI crawlers. Three sites — lowes.com, costco.com, and macys.com — returned no parseable robots.txt at all.

The 2 blocking sites are: amazon.com and ebay.com.

Only 2 of 12 Retail sites with a parseable robots.txt block at least one AI crawler — a 16.7% rate.

Three Retail sites — walmart.com, target.com, and shopify.com — publish an llms.txt structured-guidance file. This is one of the higher llms.txt adoption rates within any single category in the corpus. These 3 sites are not simply permissive; they have taken an active step to provide structured AI-access guidance.

Only 2 of 12 Retail sites with a robots.txt restrict AI crawlers in June 2026.

Methodology

US Tech Automations Research fetched the public robots.txt file for each of the 122 sites in this corpus on June 13, 2026, using standard HTTP GET requests. Each response was parsed for User-agent directives targeting any of 21 tracked AI crawler identifiers across 12 operators. A site is counted as "blocking" if at least one Disallow rule for at least one AI crawler token applies to a non-empty path.

Only responses with parseable content are counted. Sites returning other status codes are categorized as "no robots.txt." The snapshot is sealed under sha 741353c4304216ee. Nothing is estimated, modeled, or extrapolated — every count in this report is a verbatim figure from the sealed snapshot.

Parsing used a strict literal token-match: each of the 21 tracked AI crawler user-agent strings was compared against the User-agent directives in the robots.txt file. No inference was applied to wildcards beyond standard robots.txt precedence rules. No data was filled in, imputed, or guessed where a response was missing. A site either returned a parseable robots.txt with a matching Disallow directive, or it did not count as blocking.

Retail Category Summary Table

Metric	Value
Sites checked	15
Sites with parseable robots.txt	12
Sites blocking at least one AI crawler	2
Category block rate	16.7%
Sites with llms.txt	3
Sites with no robots.txt	3

Cross-Category Ranking

The Retail category ranks eighth among all 10 content categories measured. The full cross-category table from the sealed snapshot is below.

Rank	Category	Sites	With robots.txt	Any block	Block rate
1	News	20	17	14	82.4%
2	Tech	15	13	9	69.2%
3	Entertainment	9	9	6	66.7%
4	Reference	14	11	6	54.5%
Five	Social	10	10	4	40%
6	Travel	9	9	3	33.3%
7	Finance	12	11	2	18.2%
8	Retail	15	12	2	16.7%
9	Education	9	7	1	14.3%
10	Government	9	8	1	12.5%

Retail at 16.7% is well below the corpus-wide rate of 44.9% (48 of 107 sites). The category sits near the bottom of the ranking, just above Education (14.3%) and Government (12.5%). This places Retail in a clearly permissive cluster at the low end of the spectrum.

For sector context, the News category report represents the most restrictive end of the spectrum at 82.4%. The Reference category report covers a mid-tier sector at 54.5%, which shows how differently content-dependent industries approach AI access.

The corpus-wide block rate is 44.9% across 107 sites. Retail at 16.7% is far below that baseline.

Most-Blocked Operators and Bots (Corpus-Wide, All 107 Sites)

These figures are corpus-wide counts — not Retail-specific. They show which AI operators face the most restrictions globally across all 107 sites.

Most-Blocked Operators (all 107 sites)

Operator	Sites blocking their crawlers
Common Crawl	40
Anthropic	39
ByteDance	37
OpenAI	35
Meta	35
Apple	31
Diffbot	30
Perplexity	29
Cohere	27
Google	25
Amazon	22
Mistral	12

Common Crawl is restricted by 40 of 107 sites. Anthropic (39) and ByteDance (37) follow. OpenAI and Meta each face restrictions from 35 sites. Mistral at 12 is at the tail of the leaderboard. These figures span all 10 categories and are not Retail-specific.

ByteDance is blocked by 37 of the 107 sites in the full corpus.

Site-Level Analysis

The 2 blocking Retail sites — amazon.com and ebay.com — are marketplace platforms where third-party sellers list products. Both operate large-scale proprietary inventory and pricing data systems. Their blocking posture may reflect concerns about AI systems ingesting competitive product and pricing information rather than concerns about editorial content rights.

The 10 allower sites span a wide range of retail models. walmart.com, target.com, bestbuy.com, homedepot.com, and costco.com are large-format physical and digital retailers. etsy.com and shopify.com are marketplace platforms. wayfair.com, ikea.com, and nordstrom.com represent home-goods and apparel retail. nike.com covers branded direct-to-consumer. All have robots.txt files but none restricts any of the 21 tracked AI crawlers.

Notably, walmart.com, target.com, and shopify.com go further by publishing llms.txt files — a structured-guidance approach to AI access that is newer than robots.txt and reflects proactive engagement with AI-access policy rather than restriction. These 3 sites are effectively telling AI systems what they may and may not index, on the site's own terms.

Three sites — lowes.com, costco.com, and macys.com — returned no parseable robots.txt. For costco.com, this is notable given its scale; the absence of a robots.txt does not indicate permission, only the absence of a machine-readable signal.

Automation Bridge

For e-commerce SEO leads and digital-commerce teams, the Retail sector's permissive posture does not eliminate the need for monitoring. If a competitor shifts from allowing to blocking — or publishes an llms.txt that changes what AI systems can index — that change can affect how AI tools surface product and pricing information to consumers.

US Tech Automations builds automated workflows that schedule, fetch, and parse robots.txt and llms.txt files across defined site lists, detect changes over time, and route change-detection alerts to the appropriate team. Monitoring AI-access policy at scale across a competitive retail landscape is exactly the kind of ongoing automation problem US Tech Automations solves.

For comparison, the Social media category report shows another sector with a relatively low block rate (40%) and an interesting llms.txt adoption pattern — a useful adjacent data point for teams thinking about platform-level AI-access strategy.

Key Takeaways

Only 2 of 12 Retail sites with a parseable robots.txt block at least one AI crawler — 16.7%.
Retail ranks eighth across all 10 categories in the June 2026 sealed snapshot.
The corpus-wide baseline is 48 of 107 sites (44.9%); Retail is far below that baseline.
10 Retail sites allow all tracked AI crawlers; 3 of those also publish llms.txt files.
3 Retail sites — lowes.com, costco.com, macys.com — have no parseable robots.txt.
The only 2 blockers are amazon.com and ebay.com.
Corpus-wide, Common Crawl is blocked by 40 of 107 sites; the tail operator Mistral by 12.
Nothing in this report is estimated, modeled, or extrapolated — all counts are from the sealed June 13, 2026 snapshot.

FAQ

Q: Does blocking a crawler in robots.txt actually stop it?

A: No. robots.txt is an honor-system standard. A Disallow directive carries no technical enforcement mechanism. A non-compliant crawler can fetch the page regardless of what the file states. The file expresses the site operator's preference; whether that preference is respected depends on the crawler operator.

Q: Why would large retail sites like walmart.com and target.com allow AI crawlers?

A: The sealed data shows that walmart.com and target.com have robots.txt files that do not restrict any of the 21 tracked AI crawlers. Both also publish llms.txt structured-guidance files. The reasons behind any site's access decision are outside the scope of this sealed snapshot; the data records only what the robots.txt file states. An open posture does not mean unrestricted use — contractual or legal governance may apply independently of the robots.txt file.

Q: What is an llms.txt file, and why do 3 Retail sites publish one?

A: llms.txt is an emerging voluntary standard that gives AI systems structured guidance about a site — what to index, what to avoid, how to attribute content. walmart.com, target.com, and shopify.com have each published one. This is distinct from blocking: these sites are actively engaging with AI-access policy by providing machine-readable guidance rather than issuing a crawl restriction. Across the full corpus, 20 of 107 sites with a robots.txt also publish an llms.txt — an 18.7% corpus-wide rate.

Q: Why are amazon.com and ebay.com the only 2 Retail blockers?

A: The sealed data shows that amazon.com and ebay.com each return a robots.txt file with at least one AI-crawler Disallow directive. Why these 2 sites block and the other 10 do not is not determinable from the robots.txt file alone. The data records the observed posture; it does not record the business rationale.

Q: What does the absence of a robots.txt mean for lowes.com, costco.com, and macys.com?

A: These 3 sites returned no parseable robots.txt response at the time of the snapshot. In this pipeline, nothing is estimated, modeled, or extrapolated — only confirmed parseable responses count. The absence of a robots.txt is not a positive permission for AI crawlers; it simply means no machine-readable crawl-governance signal was found at the standard location.

Curious how Retail sites compare across every vertical? Our flagship study tracks how many top websites block AI crawlers.

Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 13, 2026 (snapshot sha 741353c4304216ee).

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “Do Retail Sites Block AI Crawlers? Sealed robots.txt Data.” https://ustechautomations.com/resources/blog/do-retail-sites-block-ai-crawlers-2026

Sealed snapshot sha256: 741353c4304216ee

Machine-readable data: CSV · JSON · All research & methodology