Research & Data

Do Home & Garden Sites Block AI Crawlers? Sealed robots.txt Data

Jun 14, 2026

The home and decorating web has a complicated relationship with AI crawlers. Home & Garden blocks at a 44.4% rate, just under the 46.6% corpus-wide rate.

Home & Garden is only a whisker below that line. The split is sharper at the site level, though — the blockers are large, media-rich editorial brands, while the allowance side is dominated by community-driven platforms and specialized enthusiast properties.

4 of 9 Home & Garden sites block at least one AI crawler — a 44.4% block rate.

This report uses only data verbatim from the sealed snapshot. To be explicit, nothing is estimated, modeled, or extrapolated. Every figure below is a direct read from public robots.txt files captured on June 14, 2026 under snapshot sha 834f1e2f07af24fd.

Which Sites Are Blocking — and Which Are Not

The 4 Home & Garden blockers are thespruce.com, bhg.com, architecturaldigest.com, and gardeningknowhow.com. These are content-heavy editorial properties that publish original photography, step-by-step guides, and recipe-style how-to articles. From a purely structural standpoint, editorial content with licensed imagery and dense original prose sits in the same competitive position as news or food media, and those categories rank near the top of the corpus for blocking rates.

The 5 allower sites — houzz.com, apartmenttherapy.com, dwell.com, familyhandyman.com, and thisoldhouse.com — span community platform (Houzz), lifestyle editorial (Apartment Therapy, Dwell), and trade-focused utility content (Family Handyman, This Old House). A platform like Houzz generates substantial commercial value from discovery and referral traffic; appearing in AI-surfaced results may complement rather than cannibalize that use case, explaining why it appears in the allower column. hgtv.com returned no robots.txt file in this snapshot, so it falls into the no-robots group and is not counted in the blocking figures.

"4 of 9 Home & Garden sites with a parseable robots.txt blocked at least one AI crawler as of June 14, 2026 — a 44.4% rate, essentially at the 46.6% corpus median."

The distinction between blockers and allowers within Home & Garden mirrors a pattern visible across the wider corpus: brands whose core business model is original editorial content tend to gate AI crawlers; brands whose model is aggregation, community, or utility content tend to allow them. The editorial properties (The Spruce, BHG, Architectural Digest) produce licensed photography and staff-written articles whose value lies in the combination of authority, design, and presentation — content that loses something if stripped to plain text.

Gardening Know How is more purely utilitarian in style but produces a high volume of indexed how-to content and apparently has concluded the same.

CCBot — Common Crawl's training crawler — is blocked by 85 of 223 sites across the full corpus.

Methodology

The Closing Web research program, run by US Tech Automations, crawls the public robots.txt file of each site in a fixed panel of 260 properties. The snapshot is content-addressed: the files are captured verbatim, hashed, and stored under snapshot sha 834f1e2f07af24fd on June 14, 2026. No inference is applied to ambiguous directives; an AI crawler is counted as blocked only if its user-agent string appears in an explicit Disallow directive in that site's file. Nothing is estimated, modeled, or extrapolated. The 260-site panel covers 24 content categories; this post covers the Home & Garden slice of 10 sites.

How the data was produced:

Collect. Each of the 260 panel sites is fetched at https:///robots.txt using a neutral user-agent that does not impersonate any crawler.
Parse. The file is parsed for user-agent blocks and Disallow/Allow directives. Wildcards and path-level rules are resolved per the robots exclusion protocol spec.
Classify. A fixed list of AI crawler user-agent strings is matched against each parsed rule set. A site is marked as blocking a given crawler if it carries a Disallow directive for that agent's user-agent string (or a wildcard that would catch it with no subsequent Allow override).
Seal. The full parsed output is content-hashed and recorded. The sha printed on this page (834f1e2f07af24fd) identifies this exact snapshot; the figures cannot change after sealing.

How All 24 Categories Compare

The table below covers all 24 categories in the June 2026 Closing Web edition. The Home & Garden row is at 44.4%, a position that sits in the middle third of the distribution — well below Gaming (88.9%) and News (82.4%) at the top, and above Fashion (42.9%), Social (40%), and several lower-blocking sectors at the bottom.

Category	Sites Checked	With robots.txt	Blocking Any AI	Block Rate
Gaming	9	9	8	88.9%
News	20	17	14	82.4%
Food	10	10	7	70%
Tech	15	13	9	69.2%
Entertainment	9	9	6	66.7%
Healthcare	10	9	6	66.7%
Music	10	9	6	66.7%
Reference	14	11	6	54.5%
Science	10	10	5	50%
Automotive	10	9	4	44.4%
Home & Garden	10	9	4	44.4%
Fashion	9	7	3	42.9%
Social	10	10	4	40%
Sports	10	10	4	40%
Jobs	10	8	3	37.5%
Travel	9	9	3	33.3%
Weather	10	6	2	33.3%
Legal	10	7	2	28.6%
Real Estate	10	7	2	28.6%
Finance	12	11	2	18.2%
Retail	15	12	2	16.7%
Education	9	7	1	14.3%
Government	9	8	1	12.5%
Nonprofit	10	6	0	0%

Home & Garden ties with Automotive at the 44.4% mark, landing in the middle of the pack. The categories below it — Finance, Retail, Education, Government, and Nonprofit — include sectors with either commercial incentives to stay visible in AI results (Retail, Finance) or public-service missions that make access restrictions philosophically awkward (Government, Nonprofit).

"Across all 223 sites with parseable robots.txt files in this edition, 104 block at least one AI crawler — a corpus-wide rate of 46.6%."

Which Bots Are Blocked Most Across the Full Corpus

The table below shows how often each tracked AI crawler appears in a disallow directive, measured across all 223 sites with parseable robots.txt files. These counts are corpus-wide, not category-specific.

Bot	Sites Blocking It (of 223)	Block Rate
CCBot	85	38.1%
ClaudeBot	74	33.2%
Bytespider	69	30.9%
GPTBot	64	28.7%
Meta-ExternalAgent	63	28.3%
PerplexityBot	60	26.9%
Applebot-Extended	60	26.9%
Google-Extended	57	25.6%
Amazonbot	50	22.4%

CCBot (Common Crawl's training crawler) leads with 85 blocks across 223 sites. ClaudeBot (74) and Bytespider (64) follow. These bot-level counts are additive within a single site — a site blocking CCBot and GPTBot is counted once for each. The variation between bots reflects a combination of how well-known the crawler is, how long it has existed, and how aggressively sites have chosen to respond to AI training pipelines versus retrieval-augmented systems.

Operator-Level Blocks Across All 223 Sites

Different crawlers are sometimes grouped by the operator deploying them. The table below aggregates blocks by the company or project behind each tracked crawler, across the full corpus.

Operator	Sites Blocking Them (of 223)
Common Crawl	85
Anthropic	80
Meta	73
ByteDance	69
OpenAI	66
Perplexity	60
Apple	60
Google	57
Cohere	56
Diffbot	55
Amazon	50
Mistral	21

Common Crawl leads because CCBot is the oldest and most widely known AI training crawler; many webmaster guidance documents explicitly name it. Anthropic places second with 80 blocks, reflecting the fact that ClaudeBot (74) is among the most individually blocked bots while some sites also carry additional Anthropic-specific directives. Mistral sits at 21, which may reflect the recency of its crawler or the smaller size of its current training effort.

4 of 9 Home and Garden sites block an AI crawler.

Home and Garden sites block at a 44.4% rate.

104 of 223 sites block at least one AI crawler.

Key Takeaways

4 of 9 Home & Garden sites block at least one AI crawler — a 44.4% block rate.
The blockers are editorial brands (thespruce.com, bhg.com, architecturaldigest.com, gardeningknowhow.com); the allowances are community platforms and trade sites.
hgtv.com returned no robots.txt in this snapshot and is not counted in the blocking figures.
Home & Garden sits just below the corpus median of 46.6%, tying Automotive at the same rate.
Across all 223 corpus sites, Common Crawl (85 blocks) is the most-targeted operator; CCBot is the most-targeted individual bot.
39 of 223 sites across the corpus have deployed an llms.txt file — a 17.5% adoption rate for that newer standard.
The 24-category landscape runs from Gaming at 88.9% down to Nonprofit at 0%, with most categories clustered toward the middle of that range.

Frequently Asked Questions

Q: Does blocking a crawler in robots.txt actually stop it?

A: No. robots.txt is an honor-system standard — it signals a site owner's preference, but compliant crawlers stop voluntarily. A non-compliant crawler ignores the file entirely. Blocking in robots.txt is an expression of intent and a legal-notice mechanism, not a technical barrier. Sites that want hard enforcement must use IP filtering, CAPTCHAs, or other server-side access controls.

Q: Why do some Home & Garden sites block AI crawlers while others with similar content do not?

A: The data does not reveal intent. What it shows is that the 4 blocking sites are all high-volume editorial publishers with original photography and staff-written content. A plausible reading is that those publishers see AI training as a use that does not compensate them for content production. The 5 allowing sites include a community platform (Houzz) and trade-utility publishers — they may have calculated that appearing in AI-generated answers is net-positive for discovery. Both positions are rational; the split reflects different business models, not different legal obligations.

Q: What does it mean that hgtv.com returned no robots.txt file?

A: A missing robots.txt means the site has not published explicit instructions for any crawler. Compliant crawlers interpret a missing file as permission to crawl everything. It does not mean the site has no crawling preferences — only that those preferences were not expressed in a robots.txt as of June 14, 2026. The snapshot records the absence as "no robots file" and excludes the site from blocking counts.

Q: What is llms.txt, and how is it different from robots.txt?

A: llms.txt is a proposed supplementary standard for AI-specific access policy — it is intended to communicate structured, machine-readable guidance to large language model operators rather than the general crawler exclusion covered by robots.txt. As of this snapshot, 39 of 223 sites across the corpus have deployed an llms.txt file, a 17.5% adoption rate. It is not an enforceable block; it is a signal. The two standards can coexist, and a site may have both, either, or neither.

Q: How often does this data change?

robots.txt files can be updated at any time by site owners. This snapshot reflects the state on June 14, 2026. A site that currently allows AI crawlers may add a block tomorrow; a current blocker may loosen restrictions.

The sealed-data pipeline captures point-in-time state; ongoing tracking requires re-crawling the panel at a cadence matched to how often sites update their policies. For Home & Garden sites specifically, policy shifts tend to lag news events — a high-profile AI licensing deal or lawsuit often triggers a wave of updates across editorial publishers. See the Jobs category report for how a more fragmented category compares, and the Weather category report for the low end of the blocking spectrum.

Put AI-Access Data to Work

Three workflows apply directly to what this sealed snapshot shows.

SEO and content strategy leads at editorial home-decor publishers can use this data as a competitive benchmark. com — are blocking AI training crawlers, that is evidence the sector is treating AI training differently from retrieval (answer-engine indexing).

A concrete recurring workflow: re-crawl the 10 Home & Garden panel sites monthly and generate a change alert whenever a site flips from allow to block or vice versa. The alert triggers a policy review: does the shift reveal a new industry agreement, a licensing deal, or a unilateral decision that might affect your own competitive position?

Publishers can use the 44.4% block rate as a negotiating data point. Of 9 Home & Garden sites with robots.txt files, 4 have expressed an AI-access restriction — nearly half the category has staked out a position.

Tracking whether that share grows or shrinks over successive monthly snapshots gives a RevOps team a quantified read on whether the sector is moving toward or away from wholesale AI access restriction, which directly affects the terms on which a licensing program can be positioned. Check the Science category for comparison — a category with a 50% block rate and a different allower profile.

Retrieval and data pipeline engineers building Home & Garden content pipelines face a concrete access map from this snapshot: thespruce.com, bhg.com, architecturaldigest.com, and gardeningknowhow.com carry explicit disallow directives. A recurring job — scheduled weekly against the same 10 sites — can alert the moment a previously-allowed site (like houzz.com or dwell.com) adds an AI-crawler block, so the pipeline can route around or flag that source before it affects retrieval quality.

US Tech Automations automates exactly this monitoring loop: scheduled robots.txt crawls, change-diff alerts, and a dashboard that tracks per-site and per-category policy state over time — no manual checking required. See how the platform handles continuous AI-access monitoring.

Zoom out: Home & Garden is just one vertical in a much larger picture — our cross-industry study measures how many top websites block AI crawlers.

Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 14, 2026 (snapshot sha 834f1e2f07af24fd).

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “Do Home & Garden Sites Block AI Crawlers? Sealed robots.txt Data.” https://ustechautomations.com/resources/blog/do-home-garden-sites-block-ai-crawlers-2026

Sealed snapshot sha256: 834f1e2f07af24fd

Machine-readable data: CSV · JSON · All research & methodology