Research & Data

How Many Sites Block PerplexityBot? robots.txt Data

Jun 13, 2026

Key Takeaways

29 of 107 top sites block PerplexityBot.

PerplexityBot is blocked at a 27.1% rate across 107 sites.

48 of 107 sites block at least one AI crawler.

PerplexityBot is the Perplexity answer-engine crawler. Unlike a general web search bot, PerplexityBot powers an AI answer engine that synthesizes content from publisher sources and returns direct answers to users — often without sending traffic back to the originating site. That dynamic has made it a particular focus for publishers weighing AI-access policy. This report gives a complete, data-grounded view of how 107 prominent sites have responded to PerplexityBot via their robots.txt files as of June 13, 2026.

29 of 107 sites with a parseable robots.txt file block PerplexityBot as of June 13, 2026.

The 27.1% block rate for PerplexityBot falls below the 44.9% corpus-wide any-AI-crawler benchmark.


What Is PerplexityBot and Why Does It Matter?

PerplexityBot is the Perplexity answer-engine crawler — the robot that fetches web content to power Perplexity's AI-driven search and answer product. Publishers who want to restrict their content from appearing in Perplexity-generated answers use the User-agent: PerplexityBot token in their robots.txt files.

The answer-engine model differs from traditional search in a way that makes the blocking question especially salient for content publishers. A traditional search result sends users to the publisher's site. An AI answer engine may synthesize the publisher's content into a direct response, potentially reducing click-through traffic. Publishers who depend on that traffic — particularly news organizations and subscription-based media — have been among the most active in adding AI crawler blocks.

This report is grounded entirely in a point-in-time read of public robots.txt files. The data covers 122 prominent sites measured on June 13, 2026. Of those, 107 returned a parseable robots.txt file, forming the denominator for all rates. Every figure is a raw read; nothing is estimated, modeled, or extrapolated — every figure is a verbatim count from the sealed snapshot (sha 741353c4304216ee).


Snapshot Summary: Sites, Coverage, and Block Rate

The snapshot covered 122 prominent sites across verticals including news, e-commerce, professional platforms, social media, finance, travel, education, government, and entertainment. Of those 122, 107 provided a parseable robots.txt file. Sites without a parseable file are excluded from rate calculations.

MetricCount
Sites in scope122
Sites with a parseable robots.txt107
Sites blocking PerplexityBot29
Block rate for PerplexityBot27.1%

The 27.1% block rate places PerplexityBot below the corpus-wide any-block benchmark of 44.9%. A substantial share of sites that block at least one AI crawler have not added PerplexityBot to their disallow list. That said, 29 blocks among 107 measured sites is a meaningful signal of publisher concern about the answer-engine model.

PerplexityBot ranks seventh among the 9 crawlers in this snapshot. The cross-bot leaderboard in the following section provides the full comparative picture.


Who Blocks PerplexityBot? The Named Sites

The following 29 sites explicitly block PerplexityBot in their robots.txt files as of June 13, 2026. This list is complete as recorded in the snapshot — it reflects stated robots.txt policy only.

News and media: nytimes.com, washingtonpost.com, theguardian.com, bbc.com, cnn.com, apnews.com, bloomberg.com, forbes.com, theatlantic.com, usatoday.com, newsweek.com, vox.com, theverge.com, wired.com, arstechnica.com, cnet.com, zdnet.com, mashable.com, rollingstone.com, variety.com, hollywoodreporter.com, billboard.com

Finance: investopedia.com

Community and social: quora.com, linkedin.com, yelp.com

E-commerce: amazon.com, ebay.com

Government: congress.gov

The news and media vertical accounts for a large share of this list. Organizations like nytimes.com, washingtonpost.com, theguardian.com, bloomberg.com, and cnn.com are represented — a cluster of publications that have been publicly vocal about AI companies using their journalism without licensing agreements. The presence of entertainment titles like rollingstone.com, variety.com, hollywoodreporter.com, and billboard.com extends the pattern into the entertainment-journalism sector.

Notably, several prominent news sites do not block PerplexityBot according to this snapshot. The allower group includes reuters.com, wsj.com, businessinsider.com, latimes.com, time.com, and engadget.com. The tech press is split: theverge.com and arstechnica.com block PerplexityBot, while techcrunch.com, engadget.com, gizmodo.com, and venturebeat.com do not.

Sites like reddit.com, wikipedia.org, github.com, webmd.com, and the major retail properties — walmart.com, target.com, bestbuy.com, etsy.com — do not list PerplexityBot in their disallow rules. That indicates a tiered response to AI crawlers where the most blocking-active sites tend to be those whose core product is original content that competes directly with AI-generated summaries.

For comparison, the Meta-ExternalAgent report shows a closely related blocking pattern where the news and media vertical also dominates the blocker list.


Cross-Bot Leaderboard: Where PerplexityBot Ranks

The table below shows blocking counts for all 9 crawlers measured in this snapshot, across all 107 sites with a parseable robots.txt file.

BotSites BlockingBlock Rate
CCBot4037.4%
ClaudeBot3835.5%
Bytespider3734.6%
GPTBot3330.8%
Applebot-Extended3129%
Meta-ExternalAgent3028%
PerplexityBot2927.1%
Google-Extended2523.4%
Amazonbot2220.6%

PerplexityBot at 29 blocks sits at rank seven, one position below Meta-ExternalAgent at 30. The gap between Meta-ExternalAgent and PerplexityBot is narrow — a single site. The gap between PerplexityBot and Google-Extended below it is larger at four sites. That makes PerplexityBot part of a tight cluster of crawlers rather than an outlier.

CCBot leads the table at 40, followed closely by ClaudeBot at 38 and Bytespider at 37. Those three top crawlers are associated with training-dataset use cases, which publishers have shown particular sensitivity toward.


Operator Leaderboard: Perplexity in Context

The operator leaderboard aggregates blocks at the company level — counting how many of the 107 sites block at least one crawler associated with each operator.

RankOperatorSites Blocking (any crawler)
1Common Crawl40
2Anthropic39
3ByteDance37
4OpenAI35
4Meta35
6Apple31
7Diffbot30
8Perplexity29
9Cohere27
10Google25
11Amazon22
12Mistral12

Perplexity sits at rank eight with 29 — the same number as the PerplexityBot bot-level count. That is expected: when an operator runs a single measured crawler, the bot count and operator count align. It confirms that the 29 sites blocking PerplexityBot account for the full Perplexity operator footprint in this snapshot.

Mistral at the bottom of the operator table at 12 is the lowest-blocked operator measured. Google at 25 also sits on the lower end despite being one of the most recognized names in AI — a dynamic explored in the Google-Extended report.


Methodology and Data Integrity

This report is part of the US Tech Automations Closing Web research edition. The data collection methodology is consistent across all reports in this edition:

  1. A list of 122 prominent sites was assembled, covering major verticals with meaningful web traffic and publisher diversity.

  2. Each site's robots.txt was fetched and parsed on June 13, 2026. Sites that returned a parseable robots.txt file formed the analysis denominator of 107.

  3. For each named AI crawler, the parser checked for a Disallow rule targeting the root path or a broad pattern. A site is counted as "blocking" if its robots.txt instructs the named crawler to avoid at least the root.

  4. All counts are verbatim. Nothing is estimated, modeled, or extrapolated. The snapshot is sealed under sha 741353c4304216ee.

The robots.txt standard is advisory, not technically enforceable. A crawler operator can choose to ignore a Disallow directive and still fetch the content. This report documents stated AI-access policy — what sites have chosen to declare — rather than guaranteed behavior. Publishers who need binding restrictions must use other mechanisms.

The 107-site denominator is fixed for all bot calculations in this snapshot, ensuring that cross-bot comparisons in the leaderboard table are apples-to-apples.


Frequently Asked Questions

Q: Why would a site block PerplexityBot but allow traditional search crawlers?

A: The distinction comes down to how the content is used. A traditional search crawler indexes content to generate links that drive traffic back to the publisher. An answer engine like Perplexity synthesizes content into direct answers, which may reduce or eliminate the traffic referral the publisher would otherwise receive. Publishers making a deliberate AI-access decision often differentiate between these use cases, blocking answer-engine crawlers while permitting search-index crawlers.

Q: Does the 29 block count mean 29 sites oppose Perplexity specifically?

A: Not necessarily. Many sites apply blanket AI crawler blocks by listing multiple user-agent tokens together. A site might have adopted a broad policy of blocking all AI crawlers rather than making a targeted judgment about PerplexityBot. The robots.txt file records what is declared, not the reasoning behind it.

Q: How does the PerplexityBot block rate compare to the corpus benchmark?

A: The corpus-wide benchmark is 44.9% — 48 of 107 sites block at least one AI crawler. PerplexityBot at 27.1% falls notably below that line. That means fewer than two-thirds of the sites that block any AI crawler have added PerplexityBot to their disallow rules specifically. The gap between the any-AI-block rate and this bot rate is one of the larger gaps in the leaderboard.

Q: What does it mean that Perplexity the operator has 29 blocks — matching the bot count?

A: Perplexity runs a single crawler token that this snapshot measures — PerplexityBot. When a site blocks that token, it appears in both the bot count and the operator count. Because there is no additional Perplexity crawler in the nine-bot measurement set, the two numbers align exactly at 29. Operators that run multiple crawler tokens (such as Meta or OpenAI) can have operator counts that exceed any individual bot count.

Q: Can a site change its robots.txt policy after this snapshot?

A: Yes. The robots.txt file is editable at any time, and publisher policies are not static. This snapshot reflects the state on June 13, 2026. A site that allows PerplexityBot today might add a block tomorrow, or a site that currently blocks it might negotiate a licensing arrangement and remove the restriction. For tracking those changes over time, see the CCBot report which covers the crawler with the highest observed blocking rate in this edition.


Put AI-Access Data to Work

A publisher RevOps lead managing a portfolio of content properties needs to know whether their competitors or adjacent sites are changing their AI-access stance — not as a one-time research project, but as an ongoing operational signal. Manual robots.txt checks across dozens of domains are impractical at any meaningful scale.

US Tech Automations builds agentic workflows that automate exactly this monitoring: scheduled robots.txt fetches across a defined site list, diff detection when crawler tokens are added or removed, and alerting routed to the right team member. The workflow replaces a quarterly manual review with a continuous operational signal.

Explore agentic monitoring workflows on the platform to see how AI-access policy tracking can be built into your content operations stack.


Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 13, 2026 (snapshot sha 741353c4304216ee).

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “How Many Sites Block PerplexityBot? robots.txt Data.” https://ustechautomations.com/resources/blog/how-many-sites-block-perplexitybot-2026

Sealed snapshot sha256: 741353c4304216ee

Machine-readable data: CSV · JSON · All research & methodology

About the Author

Garrett Mullins
Garrett Mullins
Workflow Specialist

Helping businesses leverage automation for operational efficiency.