Research & Data

Do Space Sites Block AI Crawlers? 2 of 8 Do

Jun 14, 2026

Most astronomy and spaceflight publishers leave their doors open to AI crawlers. Of the 9 Space sites we checked, 8 returned a parseable robots.txt, and only 2 of those disallow at least one AI user-agent — a 25% block rate. That puts the vertical well below where the wider web sits, and it makes Space one of the more open knowledge categories in this edition.

A robots.txt file is the plain-text rulebook a site publishes at its root to tell automated crawlers which paths they may fetch. We read each one literally, on June 14, 2026, and recorded only what the file says about AI user-agents. The headline is simple: in Space, the permissive posture is the norm, and the two sites that gate crawlers are the exceptions, not the rule.

2 of 8 Space sites block at least one AI crawler.

Where Space Lands Among Knowledge Verticals

Space sits at a 25% block rate, which is below the corpus line. For context, across the whole snapshot, 177 of 542 sites block at least one AI crawler — a 32.7% rate. Space comes in under that average, alongside other reference-leaning categories where free dissemination of information is part of the editorial mission.

That is a meaningful read. Categories built on selling subscriptions or protecting proprietary feeds tend to gate crawlers aggressively; categories built on public outreach and education tend not to. Space, where much of the source material is publicly funded science, behaves like the latter.

Space posts a 25% AI-crawler block rate, below the corpus-wide 32.7%.

The 8 highest-blocking categories tell the other side of the story. Gaming leads at 88.9%, News follows at 82.4%, and Food sits at 70%. Those are categories with strong commercial-content or paywall incentives. Space is nowhere near them.

Category	Sites	With robots.txt	Block at least one	Block rate
Gaming	9	9	8	88.9%
News	20	17	14	82.4%
Food	10	10	7	70%

Reading the Space Block Rate Against Its Neighbors

To see where Space really sits, the focused window below centers on Space and the categories nearest it in the block-rate ranking. Crafts and Interior Design land just above it; HR and Finance fall just below. This is the company Space keeps — verticals where most sites publish a policy and most of those policies allow crawlers.

Category	Sites	With robots.txt	Block at least one	Block rate
Beauty	10	6	2	33.3%
Agriculture	10	9	3	33.3%
Legal	10	7	2	28.6%
Pets	10	7	2	28.6%
Crafts	10	8	2	25%
Interior Design	4	4	1	25%
Space	9	8	2	25%
HR	10	9	2	22.2%
Finance	12	11	2	18.2%
Retail	15	12	2	16.7%

Notice how flat this band is: most of these categories carry the same small handful of blockers. Space is squarely in line with the categories around it, neither an outlier nor a holdout. If you work in astronomy media, the practical takeaway is that being permissive here is normal industry behavior, not an oversight.

The flatness is worth pressing on. When a band of unrelated categories all converge on roughly the same low block count, it usually means the gating decision is being driven by a site-level posture rather than an industry-wide one. A single cautious property here, a single subscription play there — but no coordinated category-wide retreat from AI crawlers.

That is exactly what we see in Space: the two blockers behave like individual editorial choices, not a sector-wide stance. For comparison with a similarly open hobby-and-knowledge vertical, our look at whether cycling sites gate AI crawlers shows the same pattern of a handful of blockers among mostly open publishers.

It also matters that the allowers in Space include some of the category's most-cited destinations. When a national space agency and several legacy magazines keep their doors open, the bulk of the category's authoritative content remains reachable to AI systems regardless of what the two blockers do.

In practice, an AI assistant answering a question about an upcoming meteor shower or a mission timeline still has access to the deepest, most trusted sources in the vertical. The 25% block rate, in other words, understates how open the high-value Space content actually is — the blocking is concentrated, not distributed across the sources that matter most.

Space sits below the corpus-wide 32.7% AI-crawler block rate.

Which Space Sites Gate the Crawlers, and Which Do Not

The two blockers in Space are space.com and earthsky.org. Both publish a robots.txt that disallows at least one AI user-agent. Everything else we checked with a policy leaves the door open.

The permissive group is the larger one: esa.int, skyandtelescope.com, astronomy.com, planetary.org, universetoday.com, and heavens-above.com all return a robots.txt that allows every AI crawler we track. One site, spaceweather.com, returned no parseable robots.txt at all — which means there is no published rule for a crawler to read, not that it has chosen to block anything.

Two Space sites — space.com and earthsky.org — disallow at least one AI user-agent.

The split is worth dwelling on. The allowers include a national space agency, several legacy astronomy magazines, and a nonprofit advocacy organization. Their incentives lean toward reach: being cited, summarized, and surfaced in AI answers extends their mission. The two blockers are commercial-news-style properties, which fits the broader pattern that ad- and subscription-driven publishers are the most likely to gate.

There is a second nuance hiding in the no-policy case. spaceweather.com returning no parseable robots.txt is not the same as an allow decision and not the same as a block. It is simply the absence of a published rule.

For a compliant AI crawler, the practical effect often resembles open access, because there is no directive to obey — but the site has made no affirmative statement either way. We record it exactly as what it is: no policy. Treating a missing file as a deliberate stance would be the kind of inference this sealed-data method is built to avoid, so we don't make it.

Read together, the Space breakdown is a small, legible map of the category's priorities. Two commercial properties guard at least one path; six mission- and hobby-oriented sites stay fully open; one publishes nothing at all. None of that requires interpretation beyond the file contents, and none of it depends on a number we computed rather than read. That is the point of the exercise: the posture of an entire vertical, reconstructed from eight plain-text files and a single missing one.

How the Snapshot Was Sealed

We collect each site's robots.txt directly, parse it for AI user-agent directives, and seal the result to a content hash so the figures cannot drift after the fact. For this edition, the snapshot covers 645 sites overall, of which 542 returned a parseable robots.txt, across 64 content categories. Every number in this report is a verbatim count from that sealed file — nothing is estimated, modeled, or extrapolated.

A few definitions keep the reading honest. "Blocks at least one AI crawler" means the file disallows one or more AI user-agents; it does not mean the site blocks all of them. A site with no robots.txt is recorded as having no policy, not as a blocker. And robots.txt is an honor-system standard: a directive is a request, not an enforced firewall.

Across all 542 sites, the most-disallowed crawler is CCBot at 133 mentions (24.5%), followed by ClaudeBot at 114 (21%) and GPTBot at 108 (19.9%). Separately, 117 of 542 sites (21.6%) publish an llms.txt file. The table below shows the corpus-wide crawler picture.

Crawler	Sites disallowing	Share of 542
CCBot	133	24.5%
ClaudeBot	114	21%
GPTBot	108	19.9%
Bytespider	106	19.6%
Meta-ExternalAgent	94	17.3%
PerplexityBot	78	14.4%

These corpus-wide totals matter for Space because they show what the blockers in this category are most likely reaching for: the same broad-coverage crawlers — Common Crawl, Anthropic, OpenAI — that dominate disallow lists everywhere.

Frequently Asked Questions

Q: Does blocking a crawler in robots.txt actually stop it?

A: No. robots.txt is an honor-system standard. A compliant crawler reads the file and obeys it, but the directive is a request, not an enforced control. We report what each Space site declares, not what any crawler ultimately does.

Q: How many Space sites block AI crawlers in this snapshot?

A: Of 9 Space sites checked, 8 returned a parseable robots.txt and 2 of those — space.com and earthsky.org — disallow at least one AI user-agent. That is a 25% block rate within the category.

Q: Why is the Space block rate lower than the corpus average?

A: Corpus-wide, 177 of 542 sites block at least one crawler, a 32.7% rate. Space comes in at 25%. Many Space publishers are mission-driven — agencies, magazines, and nonprofits whose goal is wide dissemination — so gating crawlers runs against their incentives.

Q: What does it mean that spaceweather.com had no robots.txt?

A: It means there was no parseable policy file to read. We count it as having no published rule, not as a blocker. A missing robots.txt is silence, not a disallow directive.

Q: Which crawler is disallowed most across the whole snapshot?

A: CCBot, the Common Crawl agent, appears on the most disallow lists — 133 of 542 sites, or 24.5%. ClaudeBot and GPTBot follow at 114 and 108 sites respectively.

Key Takeaways

Space is an open vertical. Of 8 sites with a published policy, 2 block at least one AI crawler, and 6 allow every crawler we track. The 25% block rate sits below the corpus-wide 32.7% line, and the category clusters with other knowledge- and service-leaning verticals rather than with the heavily gated commercial categories.

Corpus-wide, 177 of 542 sites block at least one AI crawler.

For anyone tracking AI access in astronomy and spaceflight media, the signal is that permissive is normal — and that the interesting question is not who allows, but which of the two blockers changes its policy next. To see how Space compares with adjacent open verticals, read our companion reports on whether cannabis publishers gate AI crawlers and where comics sites land on AI access.

Put AI-Access Data to Work

This report is a point-in-time snapshot; the value is detecting drift from it. Three buyers can act on these sealed figures.

A space-media editorial operations lead — someone running the CMS at an astronomy publisher like astronomy.com or planetary.org — should monitor whether category peers space.com or earthsky.org tighten their policies, re-crawling weekly and getting alerted the moment a new AI user-agent token lands in a competitor's disallow list, because that signals a shifting industry norm worth a board conversation.

An AI-retrieval product manager building a science-answer feature should track which of the 8 Space domains remains crawlable, re-checking on a fixed weekly cadence so a newly added disallow on a key source like esa.int triggers a sourcing review before answer quality degrades. A competitive-intelligence analyst should watch the corpus-wide leaderboard — CCBot at 133, ClaudeBot at 114 — to see when the dominant blocked crawlers shift.

US Tech Automations runs that monitoring as scheduled robots.txt and llms.txt crawls with change alerts and an AI-access policy dashboard, so a token added to a disallow list becomes a routed notification instead of a manual audit. Automate AI-access monitoring with agentic workflows.

Zoom out: Space is just one vertical in a much larger picture — our cross-industry study measures how many top websites block AI crawlers.

Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 14, 2026 (snapshot sha eb8a3956a17595bc).

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “Do Space Sites Block AI Crawlers? 2 of 8 Do.” https://ustechautomations.com/resources/blog/do-space-sites-block-ai-crawlers-2026

Sealed snapshot sha256: eb8a3956a17595bc

Machine-readable data: CSV · JSON · All research & methodology