Research & Data

Do Sports Sites Block AI Crawlers? Sealed robots.txt Data

Jun 13, 2026

4 of 10 Sports sites block at least one AI crawler.

Sports sites block AI crawlers at a 40% rate.

72 of 157 sites block at least one AI crawler across the corpus.

Key Takeaways

4 of 10 Sports sites block at least one AI crawler — every site had a parseable robots.txt.

Sports is the only category in the June 2026 edition where every site checked returned a parseable robots.txt file. All 10 sites had a file; 4 of those 10 are blocking at least one named AI crawler.

40% of Sports sites with robots.txt block at least one AI crawler.

That 40% block rate places Sports just below the corpus-wide average of 45.9%. The category is neither a clear leader in restriction nor a clear open-access bloc. It is split almost evenly, with major professional league sites generally allowing AI crawlers and independent media and subscription properties tending to block.

Corpus-wide, 72 of 157 sites block any AI crawler — a 45.9% rate.

Sports sits modestly below that line. Given that sports content is highly time-sensitive and often tied to licensed data and media rights, the split posture is notable. The 6 sites that allow AI crawlers are not doing so by default or neglect — all 10 sites had active robots.txt files, meaning every configuration is a deliberate choice.

What the Data Covers

This report is one installment of the US Tech Automations Closing Web series, which examines how publishers across 16 content categories configure their robots.txt files with respect to AI crawlers.

The June 2026 EXPANDED edition checked 182 sites across 16 categories. Of those, 157 returned a parseable robots.txt file. The snapshot was sealed on June 13, 2026 with sha 9ceca3bdf0dfeaca — and nothing is estimated, modeled, or extrapolated. Every figure in this report is a verbatim count from a public robots.txt file read at that moment.

The methodology is deliberately narrow: we fetch the root-level robots.txt, parse it for named AI-crawler User-agent strings, and record whether any Disallow rule applies to at least one of the 9 crawlers tracked in this edition. We do not estimate traffic impact, infer licensing motivations, or speculate on broadcast-rights strategy. The data is what was found.

For Sports, the 10 sites span professional league official sites, independent sports media, subscription sports journalism, and fantasy sports operations. Critically, every one of these 10 sites returned a parseable robots.txt. That full coverage rate is unique to Sports among the 16 categories in this edition — it means the 40% block figure is not skewed by missing data.

Site-by-Site Breakdown

The table below shows the four fields for the Sports category as recorded in the sealed snapshot.

Category	Sites Checked	With robots.txt	Blocking Any AI Crawler	Block Rate
Sports	10	10	4	40%

The 4 Blockers

bleacherreport.com is a major sports media property with a large library of editorial content spanning analysis, opinion, and news. Its decision to block AI crawlers is consistent with the commercial-media blocking pattern seen across the corpus — proprietary editorial inventory is treated as an asset to protect from AI training use.

cbssports.com is a high-traffic sports news and fantasy sports platform operated by a major broadcast and media company. The block configuration reflects both editorial content protection and, potentially, concern about how sports data and statistical content might be consumed at scale by AI systems.

nba.com is the official website of the National Basketball Association. Its presence in the blocking group is distinctive: this is a professional league restricting AI crawlers on its own official digital property. As a rights-holder for broadcast, data, and content, the NBA has a complex relationship with how its property appears in AI-generated outputs. The block may reflect proactive rights management.

theathletic.com is a subscription sports journalism platform with a large archive of long-form reporting. Subscription publishers have a strong economic rationale for restricting AI training use — their content is explicitly paywalled for human readers, and unrestricted AI crawling of that content works against that model directly.

The Open Sites

nfl.com is the official site of the National Football League. Like nba.com, it is a professional league official property — yet it returned a parseable robots.txt without blocking any of the 9 tracked AI crawlers. The contrast with nba.com on this single dimension is a notable finding of the sealed data.

mlb.com is the official site of Major League Baseball. It similarly returns a parseable file without disallowing any tracked crawlers, placing it in the open group alongside nfl.com.

si.com (Sports Illustrated) is a legacy sports media brand with an extensive editorial archive. Its open configuration may reflect a different editorial strategy than theathletic.com, despite both being editorial sports publishers.

foxsports.com is a major broadcast-affiliated sports news and streaming platform. Its open robots.txt configuration for AI crawlers is consistent with a broad-access approach.

nhl.com is the official site of the National Hockey League, returning a parseable file without AI-crawler disallows — placing it in the same open group as nfl.com and mlb.com, and distinct from nba.com.

pgatour.com is the official site of the PGA Tour golf circuit. Its open configuration rounds out the 6 sites that actively allow AI crawlers under the robots.txt standard.

How Sports Compares Across All 16 Categories

The table below shows all 16 categories from the sealed snapshot, sorted by block rate. Sports is tied with Social at 40%.

Category	Sites Checked	With robots.txt	Blocking Any	Block Rate
News	20	15	13	86.7%
Food	10	10	7	70%
Tech	15	13	9	69.2%
Entertainment	9	9	6	66.7%
Healthcare	10	9	6	66.7%
Reference	14	11	6	54.5%
Automotive	10	9	4	44.4%
Social	10	10	4	40%
Sports	10	10	4	40%
Travel	9	9	3	33.3%
Legal	10	7	2	28.6%
RealEstate	10	7	2	28.6%
Finance	12	11	2	18.2%
Retail	15	12	2	16.7%
Education	9	7	1	14.3%
Government	9	8	1	12.5%

Sports sits in the middle of the 16-category distribution, just below the corpus-wide 45.9% average. Above it are the more restrictive categories: News at 86.7%, Food at 70%, Tech at 69.2%, Healthcare at 66.7%, and Reference at 54.5%.

The categories at the bottom of the table — Finance at 18.2%, Education at 14.3%, Government at 12.5% — are substantially more open. Sports is clearly not in that tier. But it is also not in the high-restriction tier where most sites are actively blocking.

The comparison with Healthcare is informative: Healthcare produces similarly sized libraries of consumer content and lands at 66.7%, while Sports lands at 40%. The differences within Sports between official league sites (mostly open) and independent media and subscription publishers (mostly blocking) map onto the broader corpus pattern where commercial editorial publishers block at higher rates than official or mission-driven publishers. See the Healthcare report for the contrast in detail.

For a category with a similar 10-site check and full robots.txt coverage, the Food category report shows a 70% block rate — substantially higher than Sports despite a comparable commercial media mix.

Which AI Crawlers Are Most Commonly Blocked — Across All 157 Sites

The bot and operator leaderboards below reflect counts across the full 157-site corpus, not Sports alone. They show which crawlers publishers are most likely to name in AI-specific Disallow directives.

Bot Name	Sites Blocking (of 157)	Block Rate
CCBot	58	36.9%
ClaudeBot	53	33.8%
GPTBot	45	28.7%
Bytespider	44	28%
PerplexityBot	42	26.8%
Meta-ExternalAgent	39	24.8%
Applebot-Extended	39	24.8%
Google-Extended	37	23.6%
Amazonbot	31	19.7%

CCBot leads with 58 sites, followed by ClaudeBot at 53 and GPTBot at 45. These three are the most-named crawlers across the corpus. Amazonbot, at 31, is the least-named of the 9 tracked crawlers — still a significant share of the corpus, but lagging the leaders.

The operator-level view of the same data:

Operator	Sites Blocking (of 157)
Common Crawl	58
Anthropic	55
OpenAI	47
Meta	45
ByteDance	44
Perplexity	42
Apple	39
Google	37
Cohere	36
Diffbot	36
Amazon	31
Mistral	15

Mistral is the least-blocked operator at 15 sites. Cohere and Diffbot — two operators with lower public profiles than the top five — are each blocked at 36 sites, suggesting that publishers adopting broad AI-crawler blocks tend to include a wide range of named bots rather than targeting only the most prominent operators.

Across all 157 sites in the corpus, CCBot is the most frequently blocked crawler — appearing in Disallow configurations at 58 sites.

Sports sites block AI crawlers at 40%, placing the category just below the corpus-wide rate of 45.9% and roughly in the middle of all 16 categories.

Frequently Asked Questions

Q: Why do 3 of the 4 major professional league sites allow AI crawlers while the NBA blocks?

A: The sealed data records only the configuration fact as of June 13, 2026. It does not record why nfl.com, mlb.com, and nhl.com are open while nba.com is blocking. All 4 are official league properties with complex rights portfolios. The divergence in robots.txt configuration is a real observed fact in the snapshot; the reasoning behind each choice is outside the scope of this report.

Q: Why does Sports have complete robots.txt coverage when other categories do not?

A: Every one of the 10 Sports sites checked returned a parseable robots.txt file — a unique finding in this edition. The sealed data does not explain this. It may reflect the maturity of web operations teams at large sports media organizations, or the prevalence of platform-level robots.txt defaults in the CMS tools used in this vertical. What is certain is that the 40% block figure is not affected by missing data: all 10 sites contributed to the denominator.

Q: Does theathletic.com blocking AI crawlers mean its content cannot be used for AI training?

A: Robots.txt is an honor-system standard. A disallow directive signals the site operator is preference, but does not technically prevent a crawler from fetching the page. The sealed data records that theathletic.com has a Disallow directive for at least one of the 9 tracked AI crawlers as of June 13, 2026. Legal enforceability of that signal is a separate question outside the scope of this report.

Q: How is the 40% block rate calculated?

A: It is 4 blocking sites divided by 10 sites with a parseable robots.txt. Since all 10 Sports sites had parseable robots.txt files, the denominator equals the total sites checked. Nothing is estimated, modeled, or extrapolated — the 4 and the 10 are verbatim counts from the sealed snapshot.

Q: Could the Sports block rate change significantly in the near future?

A: The sealed snapshot is a point-in-time observation made on June 13, 2026. Robots.txt files can be updated at any time. The fact that nba.com is blocking while nfl.com and mlb.com are not suggests the configuration landscape is not settled — individual leagues and publishers are making independent decisions. A future edition could easily show divergent movement. Monitoring this category over time is the only way to track that drift.

Methodology Note

US Tech Automations Research fetched the robots.txt file at the canonical root path for each of the 10 Sports sites on June 13, 2026. Each file was parsed for User-agent strings matching the 9 AI crawlers in this edition. A site is recorded as "blocking" if any Disallow rule applies to at least one of those crawlers. The full corpus covered 182 sites across 16 categories; 157 returned a parseable file. Nothing is estimated, modeled, or extrapolated — counts are verbatim from the sealed snapshot (sha 9ceca3bdf0dfeaca).

For comparison with other consumer media categories, see Do Healthcare Sites Block AI Crawlers? at 66.7% and Do Legal Sites Block AI Crawlers? at 28.6%.

Put AI-Access Data to Work

The Sports category is at an inflection point: professional league sites are split, subscription media is blocking, and major broadcast-affiliated properties are open. That mixture means the configuration landscape can shift quickly. Three profiles get direct operational value from monitoring Sports robots.txt configurations on a recurring cadence:

An SEO or content strategist working in sports media needs to know when a competitor changes its AI-crawler stance. If nfl.com adds a disallow for GPTBot, or if si.com moves to block ClaudeBot, those are signals of a shifting competitive dynamic — one that affects how AI systems summarize and surface sports content. A weekly automated re-crawl of all 10 Sports sites, with an alert on any configuration change, gives the strategist visibility before the change becomes industry news.

A publisher RevOps or licensing lead at a sports data or media company tracks who is open vs. closed as a proxy for AI licensing posture. A league site moving from open to blocking is often a precursor to asserting data licensing rights with AI companies. Catching that shift at the robots.txt level gives the business development team an early trigger — before the formal announcement.

A retrieval or data engineer building a sports-domain knowledge base or RAG pipeline needs to maintain an accurate map of which sites are accessible under the robots.txt honor system. With 6 sites currently open and 4 blocking, the compliant surface covers the majority — but it is not guaranteed. An automated weekly check with a diff on each site removes the need for manual re-auditing and prevents a pipeline from operating on stale access assumptions.

US Tech Automations automates this class of monitoring — scheduled robots.txt fetches, per-site configuration diffs, and routed alerts — without requiring manual oversight. See how agentic workflows handle recurring access monitoring.

This snapshot of Sports sites is one slice of a wider dataset; read how many top websites block AI crawlers for the cross-industry view.

Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 13, 2026 (snapshot sha 9ceca3bdf0dfeaca).

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “Do Sports Sites Block AI Crawlers? Sealed robots.txt Data.” https://ustechautomations.com/resources/blog/do-sports-sites-block-ai-crawlers-2026

Sealed snapshot sha256: 9ceca3bdf0dfeaca

Machine-readable data: CSV · JSON · All research & methodology