How Many Top Sites Block Bytespider? Sealed robots.txt Data
Key Takeaways
37 of 107 top sites block Bytespider.
Bytespider is blocked at a 34.6% rate across 107 sites.
48 of 107 sites block at least one AI crawler.
Of the 122 prominent sites in our starting universe, 107 returned a parseable robots.txt file. Of those 107, exactly 37 block Bytespider — a block rate of 34.6%. Among the 9 crawlers measured in this edition, Bytespider ranks third, trailing CCBot (40 sites) and ClaudeBot (38 sites) by a narrow margin.
Bytespider is the ByteDance crawler. ByteDance operates Bytespider to gather web content for products and AI systems. ByteDance as an operator is blocked by 37 sites — the same as the Bytespider per-bot count, indicating that Bytespider is the primary crawler token publishers target when they want to block ByteDance activity.
Of 107 prominent sites with a parseable robots.txt, 37 block Bytespider — a 34.6% block rate, third highest among 9 crawlers measured in June 2026.
48 of 107 sites (44.9%) block at least one AI crawler; Bytespider block rate of 34.6% falls below that corpus-wide threshold.
What Is Bytespider and Why Do Publishers Block It
Bytespider is the ByteDance crawler. It identifies itself with the user-agent string Bytespider. ByteDance operates a range of consumer and professional products, and Bytespider collects web content to support those products and their underlying AI systems.
Publishers who block Bytespider often do so alongside other AI crawlers, but the specific selection varies by site. Some sites block all three top crawlers (CCBot, ClaudeBot, Bytespider) while others draw distinctions. The robots.txt data cannot reveal why any individual site made a specific choice; it can only reveal what rules exist at a point in time.
The close clustering of CCBot, ClaudeBot, and Bytespider at the top of the leaderboard — just 3 sites separating first from third — suggests that many publishers who address AI-crawler policy at all address multiple crawlers simultaneously. Sites that take a selective approach to blocking are reflected in the differences between the three counts.
For the full picture of the top-ranked CCBot, see how many sites block CCBot. For a comparison with the second-ranked ClaudeBot, see how many sites block ClaudeBot.
Methodology
US Tech Automations Research collected publicly accessible robots.txt files from 122 prominent sites on June 13, 2026. For each site, we fetched the file, parsed its directives, and checked whether Bytespider would be covered by a broad disallow instruction via a User-agent: Bytespider section or a catch-all User-agent: * section with a broad disallow.
Of the 122 sites, 107 returned a parseable robots.txt file. The remaining 15 are excluded from all percentages and counts. Every figure is a raw read; nothing is estimated, modeled, or extrapolated. Every count is a verbatim read from the raw text of public files. The snapshot is sealed with sha256 hash 741353c4304216ee and the data window is point-in-time, June 13, 2026.
| Metric | Value |
|---|---|
| Sites in starting universe | 122 |
| Sites with a parseable robots.txt | 107 |
| Sites blocking Bytespider | 37 |
| Block rate | 34.6% |
Sites That Block Bytespider
The 37 sites that block Bytespider represent a cross-section of news, technology media, entertainment, health, e-commerce, financial, and government content. The category spread indicates that publisher concern about ByteDance crawl activity is not confined to a single vertical.
Major news and media outlets blocking Bytespider include nytimes.com, washingtonpost.com, theguardian.com, bbc.com, cnn.com, bloomberg.com, forbes.com, businessinsider.com, theatlantic.com, usatoday.com, latimes.com, newsweek.com, and vox.com. The entertainment publications rollingstone.com, variety.com, hollywoodreporter.com, and billboard.com are in the blocker list. Sports content is represented by espn.com.
Technology media on the Bytespider blocker list includes techcrunch.com, theverge.com, wired.com, arstechnica.com, cnet.com, zdnet.com, mashable.com, gizmodo.com, and venturebeat.com. The health segment is represented by healthline.com. E-commerce destinations amazon.com and ebay.com both block Bytespider, as does the professional network linkedin.com.
User-generated content platform medium.com blocks Bytespider. Video platform vimeo.com and the financial reference site fool.com appear in the blocker list. Review platform tripadvisor.com and government content portal congress.gov round out the roster.
Notable allowers in this snapshot — sites that do not block Bytespider — include reuters.com, apnews.com, wsj.com, time.com, and wikipedia.org. Health information sites webmd.com, medlineplus.gov, and cdc.gov do not block Bytespider. Financial reference sites investopedia.com, merriam-webster.com, britannica.com, and dictionary.com permit it. The broad e-commerce category — walmart.com, target.com, bestbuy.com, etsy.com, homedepot.com, wayfair.com, ikea.com, nordstrom.com, nike.com, and shopify.com — does not block Bytespider.
Financial services sites chase.com, bankofamerica.com, wellsfargo.com, fidelity.com, paypal.com, nerdwallet.com, bankrate.com, morningstar.com, marketwatch.com, and coinbase.com all permit Bytespider. Government portals usa.gov, irs.gov, sec.gov, whitehouse.gov, census.gov, nasa.gov, and uspto.gov do not block it. Educational institutions mit.edu, harvard.edu, stanford.edu, coursera.org, and edx.org allow Bytespider. Entertainment platforms netflix.com, spotify.com, youtube.com, and hulu.com do too. Travel aggregators expedia.com, booking.com, airbnb.com, kayak.com, marriott.com, and hilton.com also allow it.
Cross-Bot Leaderboard (all 107 sites)
The table below shows block counts and rates for all 9 measured crawlers, ranked from most-blocked to least-blocked across the same 107-site corpus.
| Bot | Sites Blocking | Block Rate |
|---|---|---|
| CCBot | 40 | 37.4% |
| ClaudeBot | 38 | 35.5% |
| Bytespider | 37 | 34.6% |
| GPTBot | 33 | 30.8% |
| Applebot-Extended | 31 | 29% |
| Meta-ExternalAgent | 30 | 28% |
| PerplexityBot | 29 | 27.1% |
| Google-Extended | 25 | 23.4% |
| Amazonbot | 22 | 20.6% |
Bytespider sits in third place, one site behind ClaudeBot and three behind CCBot. The top three crawlers are tightly grouped. The middle tier of GPTBot (33), Applebot-Extended (31), and Meta-ExternalAgent (30) is a step down from the top cluster. Google-Extended (25) and Amazonbot (22) trail the field.
For a profile of the fourth-ranked GPTBot, see how many sites block GPTBot. For the fifth-ranked Applebot-Extended, see how many sites block Applebot-Extended.
Operator Leaderboard (all 107 sites)
This table counts, by operator, how many sites block at least one crawler associated with that organization. An operator that runs multiple crawlers may have a count higher than any individual bot.
| Rank | Operator | Sites Blocking |
|---|---|---|
| 1 | Common Crawl | 40 |
| 2 | Anthropic | 39 |
| 3 | ByteDance | 37 |
| 4 | OpenAI | 35 |
| 4 | Meta | 35 |
| 6 | Apple | 31 |
| 7 | Diffbot | 30 |
| 8 | Perplexity | 29 |
| 9 | Cohere | 27 |
| 10 | 25 | |
| 11 | Amazon | 22 |
| 12 | Mistral | 12 |
ByteDance ranks third at 37 — identical to the Bytespider per-bot count, meaning Bytespider is the crawler token that captures the full ByteDance blocking signal. No additional ByteDance-attributed crawler token pushes the operator count above 37 in this snapshot.
At the bottom of the table, Mistral is blocked by only 12 sites, a dramatically lower figure than the top operators. That spread shows that publisher policy is highly operator-specific, not a blanket AI block across all organizations.
Frequently Asked Questions
Q: Is Bytespider blocked by the same sites that block CCBot and ClaudeBot?
A: Not necessarily. The counts shown in the leaderboard reflect per-bot totals across all 107 sites. A site counted in both the CCBot and Bytespider totals blocked both crawlers; a site counted only in CCBot blocked that one but allowed Bytespider. This report does not break out overlap counts — those would be derived numbers outside the sealed fact sheet.
Q: Does blocking Bytespider prevent ByteDance from accessing the site?
A: robots.txt is an honor-system standard. A compliant crawler reads the file and respects disallow directives. Blocking Bytespider does not technically prevent access — it signals a preference. Whether the crawler respects that preference is verifiable only through server log analysis, not through the robots.txt file itself.
Q: What does the operator-to-bot parity (both at 37) tell us?
A: When the ByteDance operator count equals the Bytespider bot count, it suggests that Bytespider is the only ByteDance-attributed crawler string found in any disallow rule across the 107 sites. Sites that want to block ByteDance activity can do so by targeting Bytespider. If ByteDance were to add a second crawler token, publishers might need to add a second rule.
Q: How should a data engineer interpret the allower list when building a training pipeline?
A: The allower list identifies sites that do not have a Bytespider-specific or catch-all block in their robots.txt. That is not the same as permission to train on their content. robots.txt governs crawl access at the protocol level; terms of service govern use of the content itself. A responsible training pipeline reviews both before ingesting content from any source.
How to Read This Number
A block count is a measure of stated policy, not of enforcement. When a site lists Bytespider in a disallow rule, it has published an intent: it does not want ByteDance to fetch its pages with that user-agent. Whether ByteDance honors that intent is a separate question that robots.txt alone cannot answer. The value of a cross-site count is that it turns thousands of individual policy decisions into one comparable signal — a way to see whether avoiding Bytespider is a fringe practice or a mainstream one among prominent publishers.
Reading the figure well also means respecting what it excludes. The count reflects only sites that returned a parseable robots.txt file, and only the exact user-agent token Bytespider. A site that blocks ByteDance under a different token, or with a server-side rule invisible to a robots.txt read, is not counted here. The number is therefore a floor on stated objection, not a ceiling — a deliberately conservative reading that matches the sealed-data discipline of this series, where nothing is estimated, modeled, or extrapolated.
Treated this way, the count becomes a baseline a publisher can track over time: if a later edition shows the same token blocked by more sites, that movement is itself the story, and a single point-in-time read like this one is the anchor against which any future drift is measured across the corpus.
Put AI-Access Data to Work
A data engineer building a training or retrieval pipeline needs to know — before fetching content — which high-value sources have placed crawl restrictions. Checking robots.txt once is a start, but policies change. A site that allowed Bytespider today might add a disallow rule next month.
US Tech Automations automates continuous robots.txt monitoring: agentic workflows check a defined universe of sites on a schedule, parse per-bot rules, and surface any changes as structured alerts routed to the right team. No manual checks, no policy drift going unnoticed. Explore the agentic workflow platform to build this monitoring into your data pipeline.
Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 13, 2026 (snapshot sha 741353c4304216ee).
Get this data as a daily feed
The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.
Prefer to talk first? Contact us.
Cite this report
US Tech Automations Research, 2026-06 edition. “How Many Top Sites Block Bytespider? Sealed robots.txt Data.” https://ustechautomations.com/resources/blog/how-many-sites-block-bytespider-2026
Sealed snapshot sha256: 741353c4304216ee
Machine-readable data: CSV · JSON · All research & methodology
About the Author

Helping businesses leverage automation for operational efficiency.