Research & Data

Who Blocks OpenAI's GPTBot? 35 of 107 Top Sites Do

Jun 13, 2026

OpenAI operates three distinct web-crawling user-agents: GPTBot (training data collection), OAI-SearchBot (search index), and ChatGPT-User (real-time browsing). Across 107 prominent sites that returned a parseable robots.txt file in our June 13, 2026 snapshot, 35 of them have blocked at least one of those crawlers.

35 of 107 top sites block at least one OpenAI crawler in June 2026.

That figure places OpenAI squarely in the middle of the AI-access landscape. Of the 48 sites that block any AI crawler at all, more than two-thirds have drawn a line specifically against OpenAI. The signal is clear: content owners are not just worried about AI crawling in the abstract — they are naming OpenAI's agents by user-agent string and refusing entry.

Snapshot Methodology

US Tech Automations fetched robots.txt files from 122 prominent sites on June 13, 2026. Of those 122, 107 returned a parseable file; all percentages in this report are computed over that 107-site base. The snapshot is point-in-time and sealed — nothing is estimated, modeled, or extrapolated. Every numeral in this report is a verbatim count from public robots.txt directives as they existed on that date.

The snapshot sha is 741353c4304216ee, which pins the exact state of the dataset used here. robots.txt is an honor-system standard — it measures a site operator's stated intent, not a technical firewall. These numbers will not change as sites later edit their files; they describe a specific moment.

The 122-site panel spans 10 content categories and 21 tracked bot user-agents across 12 AI operators. The broader finding: 48 of 107 sites (44.9%) block some AI crawler, and 20 of 107 (18.7%) have adopted llms.txt, a newer consent-signaling format. A further 9 sites (8.4%) earned what the research team calls "star" status, indicating the most comprehensive AI-access restrictions observed. The OpenAI-specific findings reported here sit within that wider corpus context and should be read alongside sibling reports for the other operators in this study.

How Often OpenAI Is Refused

The headline figure of 35 blocked sites obscures an important detail: each of OpenAI's 3 crawlers attracts a different level of resistance. GPTBot — the training-data agent — is the most widely blocked, named by 33 of the 107 sites.

GPTBot is blocked by 33 of 107 sites — the most-blocked OpenAI user-agent.

ChatGPT-User (the browsing-mode agent) is blocked by 23 sites. OAI-SearchBot, the search indexer, is the least-blocked at 16 sites — possibly because some publishers are still willing to be surfaced in ChatGPT search results even while refusing training ingestion.

OpenAI User-Agent	Sites Blocking (of 107)
GPTBot	33
ChatGPT-User	23
OAI-SearchBot	16

The gap between GPTBot (33) and OAI-SearchBot (16) suggests a nuanced strategy by some publishers: allow indexing for potential referral traffic, while preventing bulk training ingestion. That distinction may disappear as AI products blur the line between "search" and "synthesis."

Sealed finding: 35 of 107 top sites blocked at least one OpenAI crawler as of June 13, 2026 — more than 1 in 3 of all sites with a parseable robots.txt file.

The operator-level count of 35 is higher than the per-bot GPTBot count of 33 because a small number of sites block ChatGPT-User or OAI-SearchBot without blocking GPTBot itself. Those sites have made function-specific choices rather than blanket operator-level blocks.

Sealed finding: Of 48 sites that block any AI crawler, more than 2 in 3 have also blocked at least one OpenAI user-agent.

Which Industries Block OpenAI

News dominates the blocking landscape. With 10 sites in that category refusing at least one OpenAI crawler, News publishers account for the largest single source of friction. Entertainment (6 sites), Tech (5), and Reference (5) follow. Travel and Social each contribute 3 blockers, while Retail adds 2 and Government adds 1.

Category	Sites Blocking OpenAI
News	10
Entertainment	6
Tech	5
Reference	5
Social	3
Travel	3
Retail	2
Government	1

The News category's dominance reflects a sector where content is literally the product. Major journalism outlets have watched AI companies summarize their reporting without compensation, and many have responded by locking their robots.txt. The Entertainment vertical's 6 blockers tell a similar story — properties like Rolling Stone, Variety, and Billboard have deep archives of original writing they are unwilling to hand over as training material.

Tech publishers present a more complicated picture. Several major tech media brands block OpenAI even while covering AI extensively — a sign that editorial coverage does not imply data-licensing consent. Reference sites like Healthline face a different calculus: AI-generated summaries that displace their traffic represent a direct threat to ad revenue.

News leads all categories with 10 sites blocking OpenAI — the highest single-category count.

Travel and Social each register 3 blockers, suggesting these categories have not yet adopted blocking as a default posture, though the sites that have — TripAdvisor and LinkedIn — are influential enough to matter. For context on how Anthropic fares across these same categories, see who blocks Anthropic ClaudeBot across industries.

The Named Sites That Block OpenAI

The following table shows 12 representative sites from the 35 that block at least one OpenAI crawler. The "headline crawlers blocked" figure counts how many of the 9 highest-volume bots tracked in this corpus each site refuses — it is a proxy for how aggressively a site restricts AI access overall.

Site	Category	Headline Crawlers Blocked (of 9)
bbc.com	News	9
bloomberg.com	News	9
usatoday.com	News	9
nytimes.com	News	8
cnn.com	News	8
cnet.com	Tech	8
ebay.com	Retail	8
congress.gov	Government	8
rollingstone.com	Entertainment	8
variety.com	Entertainment	8
linkedin.com	Social	7
tripadvisor.com	Travel	7

BBC, Bloomberg, and USA Today all score 9 out of 9 tracked headline bots blocked — they have effectively closed the door to the entire surveyed AI crawling ecosystem. The New York Times and CNN follow at 8. At the other end of the 35-site list, Lonely Planet (1 headline bot blocked), Goodreads (2), and Hulu (2) block far fewer crawlers overall, suggesting more selective policies rather than blanket AI restrictions.

The presence of congress.gov — a government site with publicly funded content — at 8 headline crawlers blocked is notable. It implies that even institutions whose data is nominally public have made deliberate choices to restrict AI training access. Other notable full-list members include Forbes (8 headline bots), Mashable (8), ZDNet (8), and Vox (7), as well as APNews (6), TechCrunch (6), Medium (6), and Quora (5).

For a wider view of how these same sites treat Common Crawl, see who blocks Common Crawl CCBot across the full 40-site list.

Per-Industry Analysis: Reading the Patterns

The 8-category distribution reveals two structural patterns. First, the most content-dependent industries — News and Entertainment — lead in blocking. Both depend on original writing as their core product, making AI training ingestion a direct revenue threat. Second, categories with platform-mediated content (Social, Travel review sites) show moderate blocking at 3 sites each, while Retail and Government trail.

Reference's 5 blockers include Healthline, WebMD, Dictionary.com, Quora, and Goodreads. Each of these serves queries that AI-generated answers can now satisfy without a click — making AI training ingestion an existential traffic question, not just a philosophical one.

Tech's 5 blockers (CNET, ZDNet, Mashable, The Verge, TechCrunch) are all primarily editorial rather than tool-based. A purely developer-tool or SaaS site does not appear in this blocker list — the resisters are the sites that write about technology, not the ones that sell it.

Social's 3 blockers — LinkedIn, Medium, Vimeo — each host user-generated or creator content. These platforms face a secondary consent question: their own users never agreed to AI training. See how Google-Extended fares in the Social category for a comparative view on a different operator.

Put This Data to Work

If you are a RevOps lead, content strategist, or retrieval-pipeline engineer who depends on GPTBot or OAI-SearchBot access to third-party content, this data is operationally relevant. A site that blocks the GPTBot crawler today could be granting or denying access to your AI knowledge base without your team knowing — unless you have an automated tracking workflow.

US Tech Automations builds exactly this kind of monitoring pipeline. A scheduled robots.txt fetch — run nightly or weekly against a list of domains your product depends on — can diff the current state against a baseline and fire a Slack or email alert the moment a publisher adds or removes an AI user-agent block.

The 35 sites blocking OpenAI as of June 13, 2026 will not stay at 35. Some will add restrictions; others may negotiate licensing deals and lift them. Without automated tracking, your team is flying blind on an operational dependency that can shift at any time.

Frequently Asked Questions

Q: Does blocking GPTBot actually stop OpenAI from crawling?

A: No enforcement is guaranteed. robots.txt is an honor-system protocol. OpenAI and most major AI companies have stated they respect Disallow directives. However, there is no technical enforcement — a determined crawler could ignore the file. The data here measures stated intent, not confirmed compliance.

Q: Does blocking GPTBot affect my Google Search ranking?

A: No. GPTBot is a separate user-agent from Googlebot. Blocking GPTBot has no effect on your Google Search indexing or ranking. For how Google handles its own AI crawler, see who blocks Google-Extended and why the count differs.

Q: Why do some sites block GPTBot but not OAI-SearchBot?

A: The distinction likely reflects a business calculation: allowing OAI-SearchBot may drive referral traffic from ChatGPT search results, while blocking GPTBot prevents the site content from being used in training datasets. Publishers are separating distribution from training — 33 sites block GPTBot but only 16 block OAI-SearchBot, a gap of 17 sites.

Q: Are the 35 blocking sites uniformly opposed to AI, or is it more selective?

A: It is selective. The headline-crawlers-blocked scores range from 1 (Lonely Planet) to 9 (BBC, Bloomberg, USA Today). Many sites block only certain user-agents. The data shows a spectrum of positions rather than a binary block-or-allow stance.

Q: Will this data become outdated?

A: Yes, for any reader arriving after June 13, 2026. This snapshot is sealed at that date and will not update. Individual site policies change frequently. For current policy state, an automated monitoring workflow is required — US Tech Automations can help you build one.

Key Takeaways

35 of 107 top sites block at least one OpenAI crawler — placing OpenAI in the middle tier of a 12-operator blocking spectrum.
GPTBot is the most widely refused of the 3 OpenAI crawlers, blocked by 33 sites; OAI-SearchBot is the least blocked at 16, suggesting publishers distinguish training from search indexing.
News (10 sites) is the highest-resistance category, followed by Entertainment (6), Tech (5), and Reference (5).
Sites blocking OpenAI tend to block broadly: BBC, Bloomberg, and USA Today each block 9 of 9 tracked headline bots.
48 of 107 sites (44.9%) block some AI crawler; the 35 OpenAI blockers represent the majority of that group.
The sealed snapshot sha 741353c4304216ee pins the exact dataset; nothing is derived or estimated from secondary sources.

Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 13, 2026 (snapshot sha 741353c4304216ee).

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “Who Blocks OpenAI's GPTBot? 35 of 107 Top Sites Do.” https://ustechautomations.com/resources/blog/who-blocks-openai-gptbot-2026

Sealed snapshot sha256: 741353c4304216ee

Machine-readable data: CSV · JSON · All research & methodology