Research & Data

How Many Top Sites Block GPTBot? Sealed robots.txt Data

Jun 13, 2026

Key Takeaways

33 of 107 top sites block GPTBot.

GPTBot is blocked at a 30.8% rate across 107 sites.

48 of 107 sites block at least one AI crawler.

Of the 122 prominent sites in our starting universe, 107 returned a parseable robots.txt file. Of those 107, exactly 33 block GPTBot — a block rate of 30.8%. GPTBot ranks fourth among the 9 crawlers measured in this edition, behind CCBot (40 sites), ClaudeBot (38 sites), and Bytespider (37 sites), and ahead of Applebot-Extended (31 sites).

GPTBot is the OpenAI training crawler. OpenAI uses GPTBot to gather web content for training large language models. OpenAI as an operator is blocked by 35 sites — two more than the GPTBot-specific count of 33, suggesting that some sites apply a broader OpenAI-level block that covers additional crawler tokens beyond GPTBot alone.

Of 107 prominent sites with a parseable robots.txt, 33 block GPTBot — a 30.8% block rate, placing it 4th of 9 crawlers measured in June 2026.

48 of 107 sites (44.9%) block at least one AI crawler; GPTBot block rate of 30.8% is well below that corpus-wide threshold.

What Is GPTBot and Why Do Publishers Block It

GPTBot is the OpenAI training crawler. It identifies itself with the user-agent string GPTBot. OpenAI publishes documentation on how to block GPTBot in robots.txt, making it one of the more explicitly acknowledged AI crawlers from a publisher communication standpoint.

Despite that transparency, 33 of 107 sites in this corpus still choose to block it. The block rate of 30.8% places GPTBot in the middle of the leaderboard — meaningfully below the top three but ahead of the remaining five crawlers. The gap between the top three (CCBot, ClaudeBot, Bytespider) and GPTBot is notable: seven sites separate CCBot from GPTBot.

GPTBot belongs to OpenAI, whose crawlers collectively are blocked by 35 sites. The two-site gap between the per-bot count (33) and the operator count (35) indicates that some sites apply OpenAI-level blocks that catch additional crawler tokens beyond GPTBot. A publisher who wants to block all OpenAI crawl activity needs to ensure every OpenAI-attributed user-agent string is covered.

For the top-ranked CCBot comparison, see how many sites block CCBot. For the third-ranked Bytespider, see how many sites block Bytespider.

Methodology

US Tech Automations Research collected publicly accessible robots.txt files from 122 prominent sites on June 13, 2026. For each site, we fetched the file, parsed its directives, and checked whether GPTBot would be covered by a broad disallow instruction — either via a User-agent: GPTBot section or a catch-all User-agent: * section with a broad disallow.

Of the 122 sites, 107 returned a parseable robots.txt file. The remaining 15 are excluded from all percentages and counts. Nothing is estimated, modeled, or extrapolated. Every number in this report is a verbatim count from raw public file text, sealed June 13, 2026 with snapshot sha 741353c4304216ee.

Metric	Value
Sites in starting universe	122
Sites with a parseable robots.txt	107
Sites blocking GPTBot	33
Block rate	30.8%

Sites That Block GPTBot

The 33 sites that explicitly block GPTBot span news, health, entertainment, e-commerce, travel, and government. The presence of government content in the blocker list is a notable signal — congress.gov blocking GPTBot indicates that public-sector sites are also engaged with AI-crawler policy.

Major news outlets blocking GPTBot include nytimes.com, bbc.com, cnn.com, apnews.com, bloomberg.com, forbes.com, usatoday.com, latimes.com, and newsweek.com. Entertainment and culture titles rolling stone.com, variety.com, hollywoodreporter.com, billboard.com, and espn.com are all in the blocker list.

Technology publications on the GPTBot blocker list include techcrunch.com, cnet.com, zdnet.com, and mashable.com. The reference segment is represented by dictionary.com. Health sites webmd.com and healthline.com both block GPTBot. Goodreads.com blocks it in the media and literature space.

E-commerce and professional platforms blocking GPTBot include amazon.com, ebay.com, linkedin.com, medium.com, vimeo.com, tripadvisor.com, and yelp.com. Travel reference site lonelyplanet.com blocks GPTBot. Entertainment platform hulu.com and government portal congress.gov round out the roster.

Notable allowers — sites that do not block GPTBot — include a broad set of media outlets: washingtonpost.com, theguardian.com, reuters.com, wsj.com, businessinsider.com, theatlantic.com, time.com, and vox.com. Technology publications theverge.com, wired.com, arstechnica.com, gizmodo.com, venturebeat.com, and engadget.com permit GPTBot. Community platforms reddit.com, github.com, hackernews.com, slashdot.org, pinterest.com, tumblr.com, substack.com, wordpress.com, and blogger.com all allow it.

Financial sites chase.com, bankofamerica.com, wellsfargo.com, fidelity.com, paypal.com, nerdwallet.com, bankrate.com, morningstar.com, marketwatch.com, fool.com, and coinbase.com permit GPTBot. Government portals cdc.gov, medlineplus.gov, usa.gov, irs.gov, sec.gov, whitehouse.gov, census.gov, nasa.gov, and uspto.gov do not block GPTBot. Educational institutions mit.edu, harvard.edu, stanford.edu, coursera.org, edx.org, and duolingo.com allow it. Reference sites wikipedia.org, britannica.com, merriam-webster.com, and investopedia.com permit GPTBot as well. E-commerce allowers include walmart.com, target.com, bestbuy.com, etsy.com, homedepot.com, wayfair.com, ikea.com, nordstrom.com, nike.com, and shopify.com. Entertainment platforms netflix.com, spotify.com, and youtube.com allow GPTBot.

Cross-Bot Leaderboard (all 107 sites)

The table below provides the full ranked list of the 9 measured crawlers across the same 107-site corpus, enabling direct comparison of GPTBot with all other measured bots.

Bot	Sites Blocking	Block Rate
CCBot	40	37.4%
ClaudeBot	38	35.5%
Bytespider	37	34.6%
GPTBot	33	30.8%
Applebot-Extended	31	29%
Meta-ExternalAgent	30	28%
PerplexityBot	29	27.1%
Google-Extended	25	23.4%
Amazonbot	22	20.6%

GPTBot sits in fourth place at 33. Seven sites separate it from the leader CCBot. Despite being the OpenAI training crawler — from one of the most prominent AI companies by public awareness — GPTBot does not lead the block count. That may reflect the timing of when different robots.txt rules were added, or it may reflect publishers distinguishing between training data gathered for an open archive versus a proprietary product.

For a profile of the fifth-ranked Applebot-Extended, see how many sites block Applebot-Extended. For the overall Closing Web series, links to sibling reports appear throughout this article.

Operator Leaderboard (all 107 sites)

This table counts, by operator, the number of sites that block at least one crawler from that organization. A high operator count means publishers are addressing that organization as a whole, not just targeting a single bot token.

Rank	Operator	Sites Blocking
1	Common Crawl	40
2	Anthropic	39
3	ByteDance	37
4	OpenAI	35
4	Meta	35
6	Apple	31
7	Diffbot	30
8	Perplexity	29
9	Cohere	27
10	Google	25
11	Amazon	22
12	Mistral	12

OpenAI and Meta are tied at 35, sharing fourth place on the operator leaderboard. OpenAI operator count is two higher than the GPTBot per-bot count (33), meaning some sites apply a broader rule that blocks additional OpenAI-attributed tokens beyond GPTBot. That two-site gap is meaningful for a publisher who assumes GPTBot coverage is sufficient for full OpenAI blocking.

At the bottom, Mistral appears in disallow lists at only 12 sites — roughly a third of the OpenAI operator count. The variation across operators underscores that publisher policy is granular, not a blanket AI-block.

Frequently Asked Questions

Q: Why does GPTBot rank 4th even though OpenAI is a prominent AI company?

A: Block rates in robots.txt reflect publisher policy decisions, not company prominence. The data shows the state of robots.txt files on June 13, 2026, and nothing is estimated, modeled, or extrapolated from that snapshot. The ranking may reflect when different publishers added rules, the order in which crawlers attracted attention, or distinctions publishers draw between training-for-a-public-archive versus training-for-a-proprietary-product.

Q: Does blocking GPTBot prevent content from appearing in OpenAI products?

A: robots.txt is an honor-system protocol, and it governs only crawl access at the network level. A compliant crawler respects disallow rules; a non-compliant one does not. Blocking GPTBot in robots.txt does not technically prevent access — it signals a preference. Whether the preference is respected is verifiable only through server log analysis.

Q: What is the gap between the GPTBot per-bot count (33) and the OpenAI operator count (35)?

A: Two sites block at least one OpenAI-attributed crawler token that is not GPTBot. This means a publisher who adds only a GPTBot rule may still leave those two OpenAI tokens unblocked. The two-site gap is small but illustrates why operator-level coverage is more comprehensive than per-bot coverage.

Q: What does the corpus-wide 44.9% figure mean in context of GPTBot?

A: 48 of 107 sites (44.9%) block at least one AI crawler. GPTBot block rate of 30.8% is well below that line. A site can be counted in the 44.9% figure by blocking any single crawler in the 9-bot set. GPTBot being below the 44.9% line means many sites that have taken some AI-blocking position have not specifically targeted GPTBot.

How to Read This Number

A block count is a measure of stated policy, not of enforcement. When a site lists GPTBot in a disallow rule, it has published an intent: it does not want OpenAI to fetch its pages with that user-agent. Whether OpenAI honors that intent is a separate question that robots.txt alone cannot answer. The value of a cross-site count is that it turns thousands of individual policy decisions into one comparable signal — a way to see whether avoiding GPTBot is a fringe practice or a mainstream one among prominent publishers.

Reading the figure well also means respecting what it excludes. The count reflects only sites that returned a parseable robots.txt file, and only the exact user-agent token GPTBot. A site that blocks OpenAI under a different token, or with a server-side rule invisible to a robots.txt read, is not counted here. The number is therefore a floor on stated objection, not a ceiling — a deliberately conservative reading that matches the sealed-data discipline of this series, where nothing is estimated, modeled, or extrapolated.

Put AI-Access Data to Work

An SEO lead managing content for a media organization needs to understand not just whether their site has robots.txt rules in place, but whether those rules are consistent with their editorial policy and whether they cover all the crawlers the organization wants to address. A rule added for one bot years ago may be outdated relative to the current crawler landscape.

US Tech Automations makes it straightforward to monitor this continuously: agentic workflows check robots.txt files across a defined portfolio, parse per-bot disallow rules, and surface policy changes in near-real time. Explore the agentic workflow platform to bring structured AI-access monitoring into your editorial operations.

Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 13, 2026 (snapshot sha 741353c4304216ee).

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “How Many Top Sites Block GPTBot? Sealed robots.txt Data.” https://ustechautomations.com/resources/blog/how-many-sites-block-gptbot-2026

Sealed snapshot sha256: 741353c4304216ee

Machine-readable data: CSV · JSON · All research & methodology