Research & Data

Do Weather Sites Block AI Crawlers? Sealed robots.txt Data

Jun 14, 2026

Weather is an inherently public service. Meteorological data — forecasts, current conditions, radar loops — is routinely shared across platforms, aggregated into apps, and licensed to commercial operators.

Of the 10 Weather sites we checked, 6 returned a parseable robots.txt, and 2 of those explicitly block at least one AI crawler in the June 2026 Closing Web snapshot — a 33.3% rate, below the 46.6% corpus median but still a one-in-three showing in a sector you might expect to be uniformly open.

2 of 6 Weather sites block at least one AI crawler — a 33.3% block rate.

The more striking structural fact in this category is how few sites returned a robots.txt file at all. Of 10 Weather sites checked, only 6 had a parseable file — a notably thin coverage rate compared with categories where eight, nine, or ten of ten sites publish one. That means the effective base for the blocking count is small, and the behavior of the six is not necessarily representative of the category as a whole.

This report uses only data verbatim from the sealed snapshot. To be explicit, nothing is estimated, modeled, or extrapolated. Every figure comes directly from public robots.txt files sealed June 14, 2026 under snapshot sha 834f1e2f07af24fd.

The Two Blockers — and What They Signal

The 2 Weather sites that block at least one AI crawler in this snapshot are weather.com and timeanddate.com. These are meaningfully different businesses.

weather.com is the flagship consumer weather destination in the United States and a property with substantial editorial production: original video, severe-weather coverage, lifestyle content linked to weather patterns, and extensive branded digital advertising. Its blocking posture aligns with the same pattern seen across large editorial web properties — original content production, a brand valuable enough to protect, and a commercial model where AI training on its content without compensation or attribution creates a tension with revenue.

timeanddate.com is a reference utility with a different profile — time zone data, sun and moon timing, historical weather records, calendar tools. Its appeal to AI systems specifically is high: structured reference data (sunrise times, UTC offsets, date calculators) is exactly the kind of factual content that feeds well into retrieval-augmented systems. A site that produces dense structured reference data has a rational basis for limiting AI access to it, since structured data extraction is often the primary value an AI system derives from a crawl.

"weather.com and timeanddate.com — 2 of 6 Weather sites with parseable robots.txt files — block at least one AI crawler as of June 14, 2026."

The 4 sites with parseable robots.txt files that allow all tracked AI crawlers are wunderground.com, weatherbug.com, foreca.com, and meteoblue.com. These properties range from the citizen-weather-station community (Weather Underground) to commercial forecast APIs (Meteoblue, Foreca) to a consumer ad-supported app (WeatherBug). For platforms built around distributing forecast data as widely as possible — including through API licensing and data partnerships — appearing in AI-generated results is an extension of the distribution model, not a threat to it.

The Significance of Four Sites With No robots.txt

The 4 sites that returned no parseable robots.txt file — accuweather.com, windy.com, ventusky.com, and noaa.gov — form a notable group. AccuWeather is one of the largest independent weather services globally, with a commercial media operation and enterprise forecast licensing as its main revenue streams. Its absence from the robots.txt pool does not mean it has no AI-access preferences; it means those preferences were not published in a robots.txt as of June 14, 2026. A sophisticated commercial operator like AccuWeather may rely on other access control mechanisms or may simply have not yet addressed AI crawlers in its public crawling policy.

noaa.gov is particularly interesting. NOAA (the National Oceanic and Atmospheric Administration) is a federal government agency whose data is, by statute, in the public domain. A U.S. government agency has limited ability under law to restrict reuse of its outputs, which may explain why noaa.gov has not published a robots.txt with AI-crawler restrictions. The absence of a file from noaa.gov is arguably consistent with its public-domain mandate.

The Weather category had only 6 of 10 sites return a parseable robots.txt — a thin base for the blocking count.

How All 24 Categories Stack Up

Category	Sites Checked	With robots.txt	Blocking Any AI	Block Rate
Gaming	9	9	8	88.9%
News	20	17	14	82.4%
Food	10	10	7	70%
Tech	15	13	9	69.2%
Entertainment	9	9	6	66.7%
Healthcare	10	9	6	66.7%
Music	10	9	6	66.7%
Reference	14	11	6	54.5%
Science	10	10	5	50%
Automotive	10	9	4	44.4%
Home & Garden	10	9	4	44.4%
Fashion	9	7	3	42.9%
Social	10	10	4	40%
Sports	10	10	4	40%
Jobs	10	8	3	37.5%
Travel	9	9	3	33.3%
Weather	10	6	2	33.3%
Legal	10	7	2	28.6%
Real Estate	10	7	2	28.6%
Finance	12	11	2	18.2%
Retail	15	12	2	16.7%
Education	9	7	1	14.3%
Government	9	8	1	12.5%
Nonprofit	10	6	0	0%

Weather ties with Travel at 33.3%, sitting in the lower-middle band of the distribution. The categories directly above — Jobs at 37.5%, Social and Sports at 40% — are also sectors where distribution-first incentives compete with content-protection rationales. Weather belongs in that cohort. Categories far below Weather — Finance at 18.2%, Government at 12.5%, Nonprofit at 0% — either have commercial incentives toward openness or public-service missions that make blocking incongruous. At the opposite extreme, the Science sites report documents a 50% block rate among research publishers — a useful contrast to Weather's more open posture.

"Across all 223 corpus sites with parseable robots.txt files, 104 block at least one AI crawler — a 46.6% rate. Weather sits below that line at 33.3%."

Corpus-Wide Bot and Operator Blocking

The tables below show which individual bots and which operators face the most blocking, measured across all 223 corpus sites with parseable robots.txt files.

Bot	Sites Blocking It (of 223)	Block Rate
CCBot	85	38.1%
ClaudeBot	74	33.2%
Bytespider	69	30.9%
GPTBot	64	28.7%
Meta-ExternalAgent	63	28.3%
PerplexityBot	60	26.9%
Applebot-Extended	60	26.9%
Google-Extended	57	25.6%
Amazonbot	50	22.4%

CCBot tops the corpus at 85 blocks, followed by ClaudeBot at 74. The variance across bots — CCBot at 38.1%, Amazonbot at 22.4% — reflects both the age and name recognition of each crawler and how aggressively site operators have responded to them individually. For Weather sites specifically, the fact that weather.com and timeanddate.com chose to block reflects a deliberate policy review; most of the category did not make that same choice.

Operator	Sites Blocking Them (of 223)
Common Crawl	85
Anthropic	80
Meta	73
ByteDance	69
OpenAI	66
Perplexity	60
Apple	60
Google	57
Cohere	56
Diffbot	55
Amazon	50
Mistral	21

Anthropic faces 80 blocks across 223 corpus sites — second only to Common Crawl at 85.

2 of 6 Weather sites block at least one AI crawler.

Weather sites block at a 33.3% rate.

104 of 223 sites block at least one AI crawler.

Key Takeaways

2 of 6 Weather sites block at least one AI crawler — a 33.3% block rate.
The 2 blockers are weather.com (editorial and commercial) and timeanddate.com (structured reference data).
wunderground.com, weatherbug.com, foreca.com, and meteoblue.com allow all tracked AI crawlers.
accuweather.com, windy.com, ventusky.com, and noaa.gov returned no robots.txt in this snapshot.
Weather sits below the corpus-wide 46.6% block rate, consistent with a distribution-first sector.
The thin robots.txt coverage (6 of 10 sites) is itself a notable finding — Weather has the lowest coverage rate of any category alongside Nonprofit.
39 of 223 sites across the corpus have deployed llms.txt — a 17.5% adoption rate for that newer standard.

Frequently Asked Questions

Q: Why would weather.com block AI crawlers when weather data seems inherently distributable?

A: weather.com is not only a data distributor — it is an editorial media brand with original video production, feature journalism, and a substantial digital advertising operation. The content AI systems crawl when they visit weather.com includes branded editorial, not just forecast numbers. From that perspective, the blocking logic resembles what you see at news publishers: original content production plus a commercial model that depends on traffic and ad exposure creates an incentive to control where that content goes. The structured weather data is a vehicle; the editorial and advertising product is the business.

Q: What does it mean that noaa.gov returned no robots.txt?

A: NOAA is a U.S. federal government agency, and data produced by federal agencies is generally in the public domain by federal statute. A government agency with a public-domain mandate has limited grounds to restrict reuse through robots.txt, and its absence from the robots.txt pool may reflect that constraint. It does not mean NOAA has no access policies; it means that as of June 14, 2026, those policies were not expressed in a robots.txt file.

Q: How is a sealed snapshot different from a real-time re-query of these sites?

A: A real-time re-query would return current state. A sealed snapshot captures a specific moment, hashes the result, and preserves it for reproducibility. The value of the sealed approach is verifiability: the sha 834f1e2f07af24fd uniquely identifies this exact captured state. Anyone who holds the same snapshot can independently verify every number in this report. A re-query tomorrow would produce different results if any site has changed its robots.txt. Tracking how the sealed snapshot from June 2026 differs from a future-edition snapshot is how the research program measures policy drift.

Q: Does the 33.3% block rate apply to every AI crawler?

A: No. com. A site may block CCBot but allow GPTBot; it counts as one blocking site in both the category count and the CCBot bot-count, but does not inflate the GPTBot count.

The per-bot leaderboard reflects how many sites in the full 223-site corpus block each individual crawler, which varies significantly across the 9 bots tracked, each with its own per-site blocking breakdown.

Q: What is the practical takeaway for AI product teams that use weather data?

A: The practical map from this snapshot is: weather.com and timeanddate.com carry honor-system signals against AI crawling; wunderground.com, weatherbug.com, foreca.com, and meteoblue.com do not. For retrieval pipelines ingesting weather content, compliant teams should respect the signals from the two blocking sites. The 4 sites with no robots.txt file (accuweather.com, windy.com, ventusky.com, noaa.gov) have not published explicit signals in either direction. Terms of service and licensing agreements apply regardless of what a robots.txt says. Compare the Jobs category report for a sector at 37.5% with similar distribution-vs-protection dynamics.

Put AI-Access Data to Work

Three workflows map directly to the Weather category's sealed results.

AI product leads and retrieval engineers building weather-dependent applications have a concrete access map from this snapshot: weather.com and timeanddate.com carry disallow signals; the other sites in the parseable pool allow all tracked crawlers. A recurring workflow — re-crawl all 10 Weather sites weekly and diff against the June 14, 2026 baseline — alerts the moment a currently-open site (like wunderground.com or meteoblue.com) adds an AI-crawler block, or when a site currently without a robots.txt file (like accuweather.com) publishes one. The week you detect that change is the week your pipeline needs to route around it.

Publisher RevOps leads at commercial weather platforms can use the category-level 33.3% rate as a framing tool in licensing conversations. Of 6 sites with robots.txt files, 2 have staked out a blocking position. Tracking whether that share grows in Q3 and Q4 2026 — particularly after major AI-operator licensing deals or legal developments — gives a RevOps team a quantified read on sector momentum. The Nonprofit category report shows the extreme low end of the distribution (0%); the contrast illustrates how much variance exists across categories even within a single corpus.

SEO and content strategy leads at media companies that cover weather as a vertical should note that weather.com is blocking while many specialized forecast platforms are not. That split signals a divergence between editorial weather content (blocking) and data-utility weather content (allowing). For an SEO lead at a publisher with hybrid editorial-plus-data weather content, the right recurring job is a monthly audit of the full Weather panel: who has changed, in which direction, and what the trigger appears to be. That audit feeds directly into the publisher's own robots.txt review cycle.

US Tech Automations automates this monitoring: scheduled robots.txt crawls, per-site change-diff alerting, and a continuously-updated category policy dashboard across all 24 content categories — so policy shifts surface the same day they happen, without manual checking. Explore the agentic monitoring platform.

Curious how Weather sites compare across every vertical? Our flagship study tracks how many top websites block AI crawlers.

Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 14, 2026 (snapshot sha 834f1e2f07af24fd).

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “Do Weather Sites Block AI Crawlers? Sealed robots.txt Data.” https://ustechautomations.com/resources/blog/do-weather-sites-block-ai-crawlers-2026

Sealed snapshot sha256: 834f1e2f07af24fd

Machine-readable data: CSV · JSON · All research & methodology