Research & Data

Do Running Sites Block AI Crawlers? 3 of 7 Do

Jun 14, 2026

A runner searching for shoe reviews and a large language model reading the same pages now arrive through different doors. One reads what it wants; the other reads only what a site's robots.txt allows. We pulled that file from every running site we track and counted, by name, which ones tell AI crawlers to stay out.

The answer for running is split almost down the middle. Of the 9 Running sites we checked, 7 returned a parseable robots.txt, and 3 of those block at least one AI crawler. That is a 42.9% block rate — meaningfully above the corpus line but far from the gated-shut posture you see in News or Gaming. A robots.txt is the plain-text file at a domain's root where a site lists which automated agents may fetch its pages.

3 of 7 Running sites block at least one AI crawler.

This report is built entirely on a sha256-sealed snapshot of public robots.txt files, frozen on 14 June 2026 (snapshot sha c60e706824d5d127). Every count below is a direct read of those files — nothing is estimated, modeled, or extrapolated.

Which Running Sites Gate the Crawlers

Three named sites carry the entire running block rate: letsrun.com, runningwarehouse.com, and podiumrunner.com each publish a robots.txt that disallows at least one AI user-agent. The mix is telling — an editorial forum, a gear retailer, and a race-and-training publication. There is no single "type" of running site that gates; the decision tracks the people running each one, not the niche.

On the open side, four sites returned a robots.txt and wave every crawler through: runnersworld.com, runrepeat.com, irunfar.com, and runkeeper.com. That group spans a legacy magazine, a review aggregator, an ultrarunning outlet, and a tracking app — all currently leaving their pages fully readable to AI systems.

Of the 7 Running sites with a published policy, 3 block at least one AI crawler and the rest allow every one.

Two further sites — marathonhandbook.com and athleticsweekly.com — returned no parseable robots.txt at all. A missing file is not a block: by the honor-system default, an absent robots.txt leaves a site open to any crawler that asks. For comparable open-default categories, see how tattoo sites land at the permissive end of the same snapshot.

Running SiteAI Crawler Posture
letsrun.comBlocks at least one AI crawler
runningwarehouse.comBlocks at least one AI crawler
podiumrunner.comBlocks at least one AI crawler
runnersworld.comAllows all AI crawlers
runrepeat.comAllows all AI crawlers
irunfar.comAllows all AI crawlers
runkeeper.comAllows all AI crawlers

Where Running Sits Among Its Neighbors

Running's 42.9% rate is not an outlier; it sits in a tight cluster with Fashion and Surfing, all three reading 42.9%. Just above are a band of categories at 44.4% — Birding, Watches, HomeGarden, and Automotive — and just below sit Social, Sports, Fitness, Photography, and Genealogy at 40%. Running lands squarely among consumer-interest verticals, well above the corpus average and well below the gated press categories.

Running sites post a 42.9% AI-crawler block rate.

The focused window below shows running flanked by its nearest neighbors in the ranking. The story is continuity, not contrast: a runner's media diet sits in the same protective range as a surfer's or a fashion reader's.

CategorySites With robots.txtBlock at Least One CrawlerBlock Rate
Birding9444.4%
Watches9444.4%
HomeGarden9444.4%
Automotive9444.4%
Fashion7342.9%
Running7342.9%
Surfing7342.9%
Social10440%
Sports10440%
Fitness10440%

For the high end of the same ranking, Gaming tops every category at 88.9% and News follows at 82.4%; at the floor, Vinyl Record reads 0%. You can read the permissive extreme in our companion report on vinyl record sites.

What This Block Rate Actually Means

A 42.9% rate means a runner-facing AI assistant can read most of these sites but not all — and the gaps are not random. The blockers skew toward sites with proprietary value: a retailer's catalog, a forum's community archive, a publication's training and race content. The open sites tend to be brand-forward publishers and apps that benefit from being quoted back to users.

Corpus context frames it cleanly. Across the whole snapshot, 220 of 670 sites block at least one AI crawler — a 32.8% corpus rate. Running runs hotter than that baseline, which fits a vertical where original gear data and editorial reviews are the product.

Corpus-wide, 220 of 670 sites block at least one AI crawler.

The composition of the running set matters as much as the headline. A retailer like runningwarehouse.com gates to keep its product catalog and pricing out of bulk training sets, where it could be repackaged without a click ever reaching the store. A forum like letsrun.com guards years of community discussion that is genuinely unique. And a publication like podiumrunner.com protects training and race editorial that an answer engine could summarize away.

Three different business models, one shared instinct: when the content is the asset, gate the crawler. Climbing carries that instinct further, gating a clear majority of its sites — compare our climbing sites report.

Across all 670 sites in the snapshot, 152 publish an llms.txt file — a 22.7% adoption rate for the newer AI-policy standard.

Who Gets Disallowed Most Across the Corpus

Blocking is rarely all-or-nothing. Sites that gate usually name specific crawlers, and a clear hierarchy emerges corpus-wide. Common Crawl's CCBot is the single most-disallowed agent, named by 162 sites — followed closely by Anthropic's and OpenAI's crawlers. A running site that decides to gate will most likely add these same tokens first.

The focused operator cut below counts disallows across all 670 sites, not just running.

OperatorSites That Disallow It (all 670 sites)
Common Crawl162
Anthropic154
OpenAI144
Meta137
ByteDance133

The running blockers — letsrun.com, runningwarehouse.com, podiumrunner.com — fit this pattern: where a running site draws a line, it tends to draw it against the same handful of operators leading the corpus.

A second, quieter standard is worth tracking alongside robots.txt. Across all 670 sites, 152 publish an llms.txt file — a 22.7% adoption rate for the newer convention that lets sites describe their content and terms specifically for large language models. Where robots.txt is a blunt allow-or-disallow instruction, llms.txt is closer to a statement of intent.

A running publisher that wants to be cited accurately, rather than simply blocked or scraped, may reach for llms.txt before it ever touches a disallow line. For now adoption is the minority posture, and most running sites express their stance — open or gated — through robots.txt alone.

How the Snapshot Was Sealed

We fetched the robots.txt file from each site's root, parsed every User-agent and Disallow directive, and matched the agents against a fixed list of known AI crawlers. A site counts as a blocker if it disallows even one. The full set was hashed into a single sha256 fingerprint, c60e706824d5d127, on 14 June 2026, so any figure here can be re-verified against the frozen file. For this category, nothing is estimated, modeled, or extrapolated — every count is a literal read.

Coverage caveats matter. Of the 9 Running sites, only 7 returned a parseable robots.txt; marathonhandbook.com and athleticsweekly.com returned none, and we report them as no-policy rather than folding them into either column. We do not infer intent from a missing file. US Tech Automations runs this collection the same way for every category in the snapshot.

A word on the definition that drives every count: a running site qualifies as a blocker the moment its robots.txt disallows even one recognized AI agent, no matter how many others it lets through. runningwarehouse.com gating a single operator and letsrun.com gating several both register identically as blockers, because the measure captures the binary decision to draw any line at all.

That is the question most readers actually have — has this site decided to gate AI access, yes or no. The finer breakdown of which named agents each site disallows is preserved verbatim inside the sealed file and can be rebuilt from the same frozen snapshot, so a more granular analysis never requires touching the live web again.

Key Takeaways

  • Of 7 Running sites with a parseable robots.txt, 3 block at least one AI crawler — a 42.9% block rate.

  • The named blockers are letsrun.com, runningwarehouse.com, and podiumrunner.com; the open sites include runnersworld.com and runkeeper.com.

  • Running runs above the 32.8% corpus rate, clustered with Fashion and Surfing at 42.9%.

  • Corpus-wide, 220 of 670 sites block at least one AI crawler, and CCBot is the most-disallowed agent at 162 sites.

  • Two Running sites returned no robots.txt and are reported as no-policy, not as blockers.

Frequently Asked Questions

Q: Does blocking a crawler in a running site's robots.txt actually stop it?

A: Not by force. robots.txt is an honor-system standard: compliant AI crawlers read it and obey, but the file cannot technically prevent a fetch. When letsrun.com or runningwarehouse.com disallows an agent, it is a request that well-behaved operators honor — not a firewall.

Q: Why do 3 of 7 Running sites block when so many running sites stay open?

A: The blockers — letsrun.com, runningwarehouse.com, podiumrunner.com — own assets worth gating: a community forum archive, a gear catalog, and race-and-training editorial. The open sites such as runnersworld.com and irunfar.com lean on being cited back to readers, so leaving crawlers in serves their reach.

Q: What does it mean that marathonhandbook.com and athleticsweekly.com had no robots.txt?

A: A missing file is not a block. By default, an absent robots.txt leaves a site fully readable to any crawler that asks. We report those two as no-policy because there is nothing sealed to read — we never infer a block from silence.

Q: Is running's 42.9% block rate high compared to other categories?

A: It is above the 32.8% corpus rate but middle-of-the-pack among consumer verticals. Running ties Fashion and Surfing at 42.9%, sits just under Birding and Automotive at 44.4%, and well below gated press categories like News at 82.4%.

Put AI-Access Data to Work

A running-shoe DTC growth lead should treat this as a weekly monitoring job: re-crawl runningwarehouse.com and the rest of the running set every week and alert the moment a competitor adds GPTBot or CCBot to its disallow list — a signal that a rival is pulling its catalog out of AI answers, opening a window to be the cited source instead.

A running-media editorial ops manager can watch whether peers like letsrun.com tighten or loosen policy and decide whether their own archive stays open for citation reach. A retrieval-AI product manager building a fitness assistant needs the same feed to know which running sources are licit to index versus disallowed today.

US Tech Automations automates exactly this monitoring — scheduled robots.txt and llms.txt crawls, change alerts, and an AI-access policy dashboard that flags drift the day it happens. See how that runs inside our agentic workflows platform.

Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 14, 2026 (snapshot sha c60e706824d5d127).

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “Do Running Sites Block AI Crawlers? 3 of 7 Do.” https://ustechautomations.com/resources/blog/do-running-sites-block-ai-crawlers-2026

Sealed snapshot sha256: c60e706824d5d127

Machine-readable data: CSV · JSON · All research & methodology

About the Author

Garrett Mullins
Garrett Mullins
Workflow Specialist

Helping businesses leverage automation for operational efficiency.