Research & Data

Who Blocks ByteDance's Bytespider? 37 of 107 Top Sites Do

Q: Does blocking Bytespider actually stop it from crawling?

robots.txt is an honor-system standard. A site that lists `User-agent: Bytespider` with `Disallow: /` is stating a preference; compliant crawlers will respect it, but the file is not a technical firewall. This dataset measures stated intent, not enforcement.

Jun 13, 2026

ByteDance Bytespider is blocked by 37 of 107 prominent sites in our June 2026 robots.txt snapshot — a 34.6% refusal rate that makes it one of the more widely restricted AI crawlers in the study. That figure puts ByteDance roughly on par with other major operators, yet reveals a distinct pattern: Bytespider draws opposition from media, tech, and entertainment publishers who view it as an AI training or data-extraction tool rather than a conventional search spider.

"Blocking" here means a site's robots.txt explicitly names the Bytespider user-agent with a Disallow: / directive covering a meaningful portion of the site. It is a signal of deliberate intent, not an accidental omission. ByteDance operates Bytespider as its primary AI crawler; the data below reflects exactly how many properties chose to exclude it.

Bytespider is blocked at 37 of 107 sites — a 34.6% refusal rate.

All figures are verbatim counts from public robots.txt files fetched and sealed point-in-time on June 13, 2026, across a curated set of 122 prominent sites; 107 returned a parseable robots.txt and percentages are over those 107. robots.txt is an honor-system standard — it measures the site operator's stated intent, not a technical firewall. These numbers will not change as sites later edit their files.

How Often ByteDance Is Refused

Bytespider is ByteDance's single tracked crawler in this study. It collected 37 blocks — the full operator block count — meaning every site that restricts ByteDance does so via this one user-agent string. No secondary crawler was observed in the studied robots.txt files.

User-Agent	Sites Blocking
Bytespider	37

For context: across the entire corpus of 21 tracked bots and 12 operators, 48 of 107 sites (44.9%) block at least one AI crawler, and 20 sites (18.7%) have an llms.txt file. Bytespider's 34.6% rate sits meaningfully below the "any AI block" ceiling, but it ranks ahead of several smaller operators in absolute count.

"37 of 107 prominent sites — 34.6% — have explicitly told Bytespider to stay out, as recorded in our point-in-time June 13, 2026 snapshot."

The single-user-agent profile is worth noting. Some operators use multiple crawler strings, which can inflate their apparent exposure to blocking on a per-bot basis. ByteDance's consolidated posture under Bytespider means the 37 figure is a clean, unduplicated count. Compare this profile with the OpenAI GPTBot block data to see how the two leading AI companies compare at the same snapshot date.

Bytespider blockers span 9 of 10 content categories in the study.

"9 of the study's 10 content categories contain at least one Bytespider blocker, confirming that opposition is not confined to a single publisher type."

Which Industries Block ByteDance

News publishers are the dominant force, with 13 sites in that category listing Bytespider in their robots.txt exclusions. Tech media comes in second at 9, followed by Entertainment at 5. The remaining categories contribute smaller but consistent signals.

Category	Sites Blocking Bytespider
News	13
Tech	9
Entertainment	5
Social	3
Reference	2
Retail	2
Finance	1
Travel	1
Government	1

The News dominance is not surprising. Publications like the New York Times, BBC, and Washington Post treat their editorial archives as core commercial assets. When an AI company's crawler appears in their logs, the legal and reputational calculus shifts quickly toward explicit exclusion.

Tech media at 9 sites tells a secondary story. Publications such as Wired, Ars Technica, and The Verge have been among the most vocal critics of scraping-based AI training. Their robots.txt policies reflect editorial positions, not just IT defaults.

Entertainment — Rolling Stone, Variety, Hollywood Reporter, Billboard — adds 5 more, likely driven by concerns about lyric and review content being fed into generative systems. The Social category's 3 sites, and the single Government block from congress.gov, suggest that resistance to Bytespider reaches beyond content-as-product businesses into community platforms and civic institutions.

For a complementary view of how another major non-US operator fares in this snapshot, see the Anthropic ClaudeBot report and the Meta AI crawler data from this same sealed dataset.

The Named Sites That Block ByteDance

The table below lists 12 representative sites from the 37 total blockers, selected for the highest headline crawler counts — a proxy for how broadly each site restricts AI access across all 21 tracked bots.

Site	Category	Headline Crawlers Blocked (of 9)
bbc.com	News	9
bloomberg.com	News	9
usatoday.com	News	9
nytimes.com	News	8
cnn.com	News	8
wired.com	Tech	8
arstechnica.com	Tech	8
ebay.com	Retail	8
rollingstone.com	Entertainment	8
variety.com	Entertainment	8
congress.gov	Government	8
linkedin.com	Social	7

BBC, Bloomberg, and USA Today each block 9 of the 9 headline crawlers tracked — a maximum-restriction posture. Their inclusion of Bytespider is part of a broader, deliberate policy against AI scrapers rather than a targeted ByteDance decision. Sites lower in the count — such as congress.gov (8) and linkedin.com (7) — still block Bytespider specifically, though their overall AI policy is slightly less sweeping.

Notable names from the full 37-site list that did not make this abbreviated table include The Atlantic, Forbes, CNN, Mashable, CNET, ZDNet, The Washington Post, The Guardian, Newsweek, Vox, The Verge, Healthline, Amazon, TripAdvisor, TechCrunch, Medium, Quora, Vimeo, VentureBeat, ESPN, Business Insider, the Los Angeles Times, Gizmodo, and The Motley Fool. That breadth — spanning 9 distinct content categories — is what distinguishes Bytespider's footprint from narrower-impact operators.

The presence of fool.com (Finance, 1 headline block) and tripadvisor.com (Travel, 7 headline blocks) underscores that Bytespider's reach extends into verticals beyond media and tech. Finance sites have strong intellectual-property incentives to restrict AI scrapers that could absorb proprietary investment commentary; travel platforms protect curated destination content and user-generated reviews with similar logic. Even at 1 Finance block and 1 Travel block, these categories confirm that no content vertical is fully outside Bytespider's policy footprint.

News (13) and Tech (9) produce 22 of 37 total Bytespider blocks.

Methodology and Data Integrity

The Closing Web snapshot was assembled by fetching the public /robots.txt file from each of the 122 curated domains on June 13, 2026. Each file was parsed into a structured user-agent-to-disallow map and sealed under snapshot sha 741353c4304216ee. Only sites that returned an HTTP 200 response with a parseable robots.txt were included in the denominator; 107 of 122 qualified.

Every figure in this report is a verbatim count drawn from that sealed snapshot. nothing is estimated, modeled, or extrapolated. Operator block counts are deduplicated at the domain level so that a site blocking multiple ByteDance user-agents still counts as 1 toward the operator total. Headline-crawler scores represent how many of the 9 headline AI bots a given site restricts — a secondary signal of policy breadth, not a separate block count.

The 9 starred sites (starCount: 9, starPct: 8.4%) are the most widely known properties in the corpus and serve as anchor points for cross-operator comparison. All 9 appear in Bytespider's named-blocker list, confirming that the highest-profile sites have acted on their AI-access policies.

Put This Data to Work

If you manage content strategy, SEO, or data-pipeline infrastructure, Bytespider's 34.6% block rate is a concrete input for decisions about AI visibility. Whether you are a RevOps lead wondering how this crawler affects your organic data moat, or a retrieval-pipeline engineer deciding whether to model competitor access policies, the underlying signal is the same: robots.txt posture toward specific operators is drifting, and it needs to be tracked systematically.

The practical automation is a scheduled robots.txt fetch-and-diff pipeline. US Tech Automations builds this kind of workflow: a nightly job that pulls /robots.txt from a defined site list, diffs the user-agent directives against the prior run, and fires a Slack or email alert when Bytespider's access status changes. That gives you a live signal rather than a point-in-time snapshot you have to rediscover manually.

US Tech Automations also builds the intake layer — parsing the raw robots.txt text into structured JSON, deduplicating overlapping rules, and routing changes into your data warehouse or CRM. If your team needs to report on AI crawler access for compliance, competitive, or product reasons, that structured layer is what makes the data actionable.

For companies trying to understand the full landscape of AI crawler restrictions, see the Common Crawl CCBot overview for the corpus-level numbers.

Frequently Asked Questions

Q: Does blocking Bytespider actually stop it from crawling?

A: robots.txt is an honor-system standard. A site that lists User-agent: Bytespider with Disallow: / is stating a preference; compliant crawlers will respect it, but the file is not a technical firewall. This dataset measures stated intent, not enforcement.

Q: Is Bytespider a search crawler or an AI training crawler?

A: ByteDance operates Bytespider primarily for AI data collection. It is distinct from standard search crawlers that power organic rankings. Sites can block Bytespider without affecting their visibility on conventional search engines.

Q: Does blocking Bytespider hurt my site's SEO?

A: No. Standard search engines use separate user-agent strings for their indexing crawlers. Blocking Bytespider has no known effect on Google Search or any major SERP ranking.

Q: Why do News sites lead the block count?

A: News publishers have a large surface area of licensed or proprietary editorial content and are commercially motivated to prevent AI companies from training on their archives without licensing agreements. 13 of the 37 Bytespider blocks fall in the News category.

Q: How fresh is this data?

A: All figures are from a single point-in-time fetch on June 13, 2026, sealed under snapshot sha 741353c4304216ee. robots.txt files can change at any time; this report reflects only the state captured on that date.

Key Takeaways

37 of 107 sites with parseable robots.txt (34.6%) explicitly block Bytespider as of June 13, 2026.
ByteDance operates a single tracked crawler — Bytespider — so the operator and per-bot block counts are identical at 37.
News (13 sites) and Tech (9 sites) drive more than half of all Bytespider blocks; Entertainment adds 5 more.
BBC, Bloomberg, and USA Today each block 9 of 9 headline crawlers, making Bytespider one of many AI user-agents they exclude.
48 of the 107-site corpus block at least one AI crawler (44.9%), placing Bytespider's 34.6% rate below the "any AI" ceiling but above several smaller operators.
9 of the study's 10 content categories contain at least one Bytespider blocker, spanning Finance, Government, Travel, and beyond.

Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 13, 2026 (snapshot sha 741353c4304216ee).

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “Who Blocks ByteDance's Bytespider? 37 of 107 Top Sites Do.” https://ustechautomations.com/resources/blog/who-blocks-bytedance-bytespider-2026

Sealed snapshot sha256: 741353c4304216ee

Machine-readable data: CSV · JSON · All research & methodology