Research & Data

How Many Top Sites Block Amazonbot? Sealed robots.txt Data

Jun 13, 2026

Key Takeaways

22 of 107 top sites block Amazonbot.

Amazonbot is blocked at a 20.6% rate across 107 sites.

48 of 107 sites block at least one AI crawler.

Amazonbot is the Amazon crawler. It is used to index web content for Amazon's AI and search products. This report documents how 107 prominent sites have responded to Amazonbot via their robots.txt files as of June 13, 2026, placing it in the full context of the 9-bot and 12-operator landscape measured in this edition. Amazonbot holds the distinction of being the least-blocked crawler in this snapshot — a data point that merits careful analysis against the broader AI-access-policy picture.

22 of 107 sites with a parseable robots.txt file block Amazonbot as of June 13, 2026.

At 20.6%, Amazonbot has the lowest block rate among the 9 AI crawlers measured in this snapshot.

What Is Amazonbot and Why Does It Matter?

Amazonbot is the Amazon crawler — the robot that Amazon operates to index publicly accessible web content for its AI systems and products. Publishers who want to restrict their content from Amazon AI use declare that preference using the User-agent: Amazonbot token in their robots.txt files.

The Amazon AI product context includes services like Alexa and Amazon's broader AI initiatives. Unlike some other AI crawlers in this snapshot, Amazonbot operates under the Amazon brand name, which also encompasses an enormous e-commerce platform and cloud services business. That context may influence how publishers perceive the crawler and decide whether to block it.

This report is grounded entirely in a point-in-time read of public robots.txt files. The data covers 122 prominent sites measured on June 13, 2026. Of those, 107 returned a parseable robots.txt file, forming the denominator for all rate calculations. Every figure is a raw read; nothing is estimated, modeled, or extrapolated — every figure is a verbatim count from the sealed snapshot (sha 741353c4304216ee).

Snapshot Summary: Sites, Coverage, and Block Rate

The snapshot covered 122 prominent sites across news, e-commerce, professional platforms, social media, finance, travel, education, government, and entertainment verticals.

Metric	Count
Sites in scope	122
Sites with a parseable robots.txt	107
Sites blocking Amazonbot	22
Block rate for Amazonbot	20.6%

At 20.6%, Amazonbot has the lowest block rate among all 9 crawlers measured in this edition. The corpus-wide any-AI-block benchmark is 44.9% — 48 of 107 sites block at least one crawler. Amazonbot at 20.6% is more than half that benchmark, meaning a significant majority of sites that block some AI crawler do not specifically target Amazonbot.

The 22 blocks still represent a real signal: more than one in five sites with a readable robots.txt file has chosen to disallow Amazonbot specifically. That is not zero. But in the context of this snapshot, Amazonbot sits at the low end of a spectrum that runs from 22 to 40.

Who Blocks Amazonbot? The Named Sites

The following 22 sites explicitly block Amazonbot in their robots.txt files as of June 13, 2026. This list is complete as recorded in the snapshot.

News and media: washingtonpost.com, theguardian.com, bbc.com, cnn.com, apnews.com, bloomberg.com, forbes.com, theatlantic.com, usatoday.com, newsweek.com, wired.com, arstechnica.com, cnet.com, zdnet.com, mashable.com, venturebeat.com

Health: healthline.com

E-commerce: ebay.com

Publishing and community: tumblr.com, medium.com

Reviews and local: tripadvisor.com

Government: congress.gov

The news and media vertical accounts for the large majority of this list. Major publishers including washingtonpost.com, theguardian.com, bbc.com, bloomberg.com, and cnn.com all appear. The pattern mirrors what is seen across other bots in this edition: premium editorial and journalism properties are the most consistent AI-crawler blockers across the board.

The allower group for Amazonbot is expansive. Prominent sites that do not list Amazonbot in their disallow rules include nytimes.com, reuters.com, wsj.com, businessinsider.com, latimes.com, vox.com, techcrunch.com, theverge.com, engadget.com, gizmodo.com, wikipedia.org, reddit.com, linkedin.com, shopify.com, walmart.com, target.com, bestbuy.com, etsy.com, homedepot.com, amazon.com itself, netflix.com, spotify.com, youtube.com, rollingstone.com, variety.com, hollywoodreporter.com, billboard.com, espn.com, and a broad set of government, finance, travel, and education sites.

Notably, amazon.com does not block Amazonbot — a site owned by the same operator as the crawler. That is expected behavior since a company would not typically restrict its own bot from its own domain. Several entertainment and sports properties that do appear in other bots' blocker lists — rollingstone.com, variety.com, billboard.com, espn.com — do not block Amazonbot in this snapshot. The selective nature of blocking decisions is visible across this list.

For comparison with a crawler that has a substantially higher block rate, the CCBot report covers the top-of-leaderboard position in this edition.

Cross-Bot Leaderboard: Where Amazonbot Ranks

The table below shows blocking counts for all 9 crawlers measured in this snapshot, across all 107 sites with a parseable robots.txt file.

Bot	Sites Blocking	Block Rate
CCBot	40	37.4%
ClaudeBot	38	35.5%
Bytespider	37	34.6%
GPTBot	33	30.8%
Applebot-Extended	31	29%
Meta-ExternalAgent	30	28%
PerplexityBot	29	27.1%
Google-Extended	25	23.4%
Amazonbot	22	20.6%

Amazonbot ranks ninth — the lowest in the nine-bot dataset. The gap between Amazonbot at 22 and Google-Extended at 25 immediately above it is three sites. The gap from Amazonbot to the top of the table (CCBot at 40) is eighteen sites — nearly the entire range of the leaderboard.

The top three — CCBot, ClaudeBot, and Bytespider — are all associated with AI training data collection at scale. The bottom three — PerplexityBot, Google-Extended, and Amazonbot — are associated with AI product and inference use cases or publisher-controlled opt-out mechanisms. That rough clustering suggests publishers may perceive training-dataset crawlers as presenting greater risk to their content interests than product-use-case crawlers.

Operator Leaderboard: Amazon in Context

The operator leaderboard aggregates blocks at the company level — counting how many of the 107 sites block at least one crawler associated with each operator.

Rank	Operator	Sites Blocking (any crawler)
1	Common Crawl	40
2	Anthropic	39
3	ByteDance	37
4	OpenAI	35
4	Meta	35
6	Apple	31
7	Diffbot	30
8	Perplexity	29
9	Cohere	27
10	Google	25
11	Amazon	22
12	Mistral	12

Amazon ranks eleventh among the 12 operators, just above Mistral at 12. The operator count of 22 matches the Amazonbot bot count exactly, consistent with Amazonbot being the only Amazon AI crawler in the nine-bot measurement set.

Common Crawl leads at 40 and Anthropic follows at 39. Both have substantially higher blocking counts than Amazon. The spread from the top (40) to Amazon's position (22) reinforces the picture of a market where different AI operators face meaningfully different levels of stated publisher resistance.

Mistral at the bottom of the operator table at 12 has the lowest blocking count among all 12 operators. For a detailed view of the operators just above Amazon in the leaderboard, the Google-Extended report covers the adjacent rank at tenth position.

Methodology and Data Integrity

This report is part of the US Tech Automations Closing Web research edition. The methodology is consistent across all reports in this edition:

A list of 122 prominent sites was assembled, covering major verticals with meaningful web traffic and publisher diversity.
Each site's robots.txt was fetched and parsed on June 13, 2026. Sites that returned a parseable robots.txt file formed the analysis denominator of 107.
For each named AI crawler, the parser checked for a Disallow rule targeting the root path or a broad pattern. A site is counted as "blocking" if its robots.txt instructs the named crawler to avoid at least the root.
All counts are verbatim. Nothing is estimated, modeled, or extrapolated. The snapshot is sealed under sha 741353c4304216ee.

The robots.txt standard is advisory. Crawler operators can honor or ignore Disallow directives — the standard relies on voluntary compliance rather than technical enforcement. This report records stated AI-access policy, not guaranteed behavior. Publishers who require enforceable restrictions must use supplementary technical or legal mechanisms.

All block-rate calculations use 107 as the denominator, covering only sites that returned a parseable robots.txt file. The 15 sites that did not return a parseable file are excluded from the denominator and from blocking counts, ensuring the rates reflect real policy declarations rather than absences.

Frequently Asked Questions

Q: Why does Amazonbot have the lowest block rate when Amazon is a major technology company?

A: Block rates in this dataset reflect stated publisher policy in robots.txt files — they do not reflect company size or technology influence. Several factors may contribute to Amazonbot being at the low end. Publishers may prioritize blocking crawlers they associate more directly with AI training data at scale (CCBot, ClaudeBot, Bytespider). They may also be less focused on Amazonbot because Amazon's AI products are perceived as less directly competitive with publisher content than answer-engine or training-dataset use cases. The data records what is declared, not the reasoning.

Q: How does 20.6% compare to the corpus-wide benchmark?

A: The corpus-wide benchmark is 44.9% — 48 of 107 sites block at least one AI crawler. Amazonbot at 20.6% is well below that line. Fewer than half of the sites that block any AI crawler have added Amazonbot specifically to their disallow rules. That gap between the any-block rate and the Amazonbot-specific rate is the largest such gap among all 9 bots in this snapshot.

Q: Does amazon.com itself block Amazonbot?

A: According to this snapshot, amazon.com does not block Amazonbot. This is expected — a company does not typically restrict its own crawler from its own domain. amazon.com appears in the allower group for this report. The blocking decisions recorded here are from other independent publisher domains, not from the crawler operator itself.

Q: Could a site block Amazonbot as part of a blanket AI-crawler block without targeting Amazon specifically?

A: Yes. Many sites apply blanket blocks by listing multiple AI crawler tokens together. A site might disallow Amazonbot as part of a broad policy of blocking all AI crawlers — not because of a specific concern about Amazon's products. The robots.txt file records what tokens are disallowed, not the reasoning. This snapshot cannot distinguish between targeted decisions and blanket policies.

Q: What would cause the Amazonbot block rate to rise in future snapshots?

A: This report does not forecast policy changes. Publisher AI-access policies are actively evolving, and a future snapshot could show a different distribution. Factors that could drive changes include Amazon releasing new AI products that attract publisher attention, broader industry shifts toward AI-access licensing, or publishers updating their robots.txt files in response to new guidance from industry bodies. For context on how higher-blocked crawlers are currently tracked, see the Bytespider report.

Put AI-Access Data to Work

An SEO team lead at a content publisher needs to understand not just their own robots.txt policy but how competitors and peer publishers across the landscape are configuring theirs. That kind of competitive and market-level intelligence cannot be gathered manually at any meaningful scale — checking dozens of robots.txt files individually is slow, error-prone, and gives only a snapshot that goes stale immediately.

US Tech Automations builds agentic workflows that automate this monitoring: scheduled robots.txt fetches across a defined site list, detection of changes when disallow rules are added or removed for named crawlers, and structured reporting that surfaces policy shifts across the tracked portfolio. Instead of a quarterly manual scan, the workflow runs continuously and delivers change alerts to the right stakeholder when they happen.

Explore agentic monitoring workflows on the platform to see how AI-access policy tracking can become a continuous operational process rather than a research one-off.

Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 13, 2026 (snapshot sha 741353c4304216ee).

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “How Many Top Sites Block Amazonbot? Sealed robots.txt Data.” https://ustechautomations.com/resources/blog/how-many-sites-block-amazonbot-2026

Sealed snapshot sha256: 741353c4304216ee

Machine-readable data: CSV · JSON · All research & methodology