Research & Data

Do Architecture Sites Block AI Crawlers? 3 of 8 Do

Jun 14, 2026

Every site in the Architecture category that returned a parseable robots.txt did so — all 8 of the 8 sites checked published a policy. Of those, 3 block at least one AI crawler — a 37.5% block rate that lands just above the corpus average of 36% across all 417 sites in the June 2026 snapshot. The 8-of-8 robots.txt publication rate is the detail that stands out: Architecture is one of the few categories in the 493-site corpus where every checked site has a robots.txt on record.

A robots.txt file is a plain-text policy document placed at a site's root domain that tells automated clients — search crawlers, AI trainers, data scrapers — which paths are accessible. For the AI use case, operators may name specific crawlers like GPTBot, ClaudeBot, or CCBot in disallow rules. This report reads those rules verbatim from the sealed June 14, 2026 snapshot and reports counts exactly as found.

3 of 8 Architecture sites block at least one AI crawler.

Architecture sites post a 37.5% AI-crawler block rate.

Corpus-wide, 150 of 417 sites block at least one AI crawler.

Key Takeaways

  • 3 of 8 Architecture sites block at least one AI crawler — a 37.5% block rate.

  • All 8 Architecture sites checked returned a parseable robots.txt — a 8 of 8 publication rate.

  • designboom.com, archpaper.com, and core77.com are the 3 blocking sites.

  • Across all 417 sites in the corpus, CCBot is blocked by 118 sites — the highest count of any single bot.

  • Architecture's 37.5% block rate sits just above the 36% corpus-wide average across all 417 sites.

Who Gates the Crawlers in Architecture

The 3 blocking sites are designboom.com, archpaper.com, and core77.com. Each is an editorial or media property — not a professional practice, a project database, or a software tool. That pattern is consistent with the broader corpus finding: content-production sites protect their editorial output more aggressively than service or community sites.

Designboom is a high-volume architecture and design magazine that publishes project coverage, interviews, and reviews. Its content is curated and original — the kind of material an AI training corpus would treat as high-signal text. Archpaper (The Architect's Newspaper) is a regional and national trade publication covering design, construction, and policy; it follows the media-vertical instinct toward blocking, consistent with News (82.4% block rate) being the second-highest category in the corpus. Core77 is a design media brand covering industrial design, architecture, and product culture; it sits in a similar editorial posture.

The 5 allowing sites — archdaily.com, dezeen.com, dribbble.com, design-milk.com, and arch2o.com — present a different philosophy. Archdaily is the largest architecture website by traffic, and its robots.txt allows all crawlers; maximum discoverability, including in AI-summarized surfaces, aligns with its strategy as a reference platform. Dezeen is a global design magazine that has taken the opposite editorial stance from archpaper.com — it allows crawlers despite being a premium editorial brand.

All 8 Architecture sites checked returned a parseable robots.txt file — no site in this category left AI crawlers without a stated policy.

The 3 Architecture blockers are all editorial or media properties — designboom.com, archpaper.com, and core77.com — while 5 platforms and reference sites allow all crawlers.

Dribbble is notable here: primarily a portfolio and design community platform, it is categorized under Architecture for this snapshot's purposes. It allows all AI crawlers — consistent with the platform logic that visibility drives user acquisition. Design-Milk and Arch2O round out the allowing set as curatorial platforms that depend on discoverability.

No Architecture site in this snapshot appears in the noRobotsSites array — every site published a policy, which itself is a signal of policy maturity in the vertical.

Where Architecture Fits in the 48-Category Block-Rate Ranking

3 of 8 Architecture sites block at least one AI crawler — a 37.5% block rate.

All 8 Architecture sites checked returned a parseable robots.txt — a full 8 of 8.

150 of 417 sites corpus-wide block at least one AI crawler — a 36% rate.

Architecture's 37.5% rate places it in a three-way tie with Aviation and Jobs — a cluster of verticals just above the corpus average. The focused window below shows Architecture and its neighboring categories:

CategorySites CheckedWith robots.txtBlocking Any AI CrawlerBlock Rate
Fashion97342.9%
Jobs108337.5%
Aviation108337.5%
Architecture88337.5%
Travel99333.3%
Agriculture109333.3%
Weather106233.3%
Beauty106233.3%
Legal107228.6%

Architecture has the smallest checked set in this window at 8 sites, which means each individual blocker carries more weight in the percentage. Just above it, Fashion sits at 42.9%. Below it, Travel and Agriculture are at 33.3%.

For the full extremes of the distribution, a mini-table of the highest- and lowest-blocking categories:

CategoryBlock Rate
Gaming88.9%
News82.4%
Food70%
Telecom0%
Banking0%
Energy0%

Architecture lands at the middle of this range — not among the most defensive verticals, and nowhere near the open end. For contrast, aviation shares Architecture's 37.5% block rate from a completely different mix of site types, while energy shows what a 0% block rate looks like in a legacy-corporate sector.

The Operator Leaderboard Across All 417 Sites

The corpus-wide operator and bot counts below describe which AI operators and crawlers are most frequently disallowed. These figures cover all 417 sites with a parseable robots.txt in the snapshot — not Architecture alone:

AI OperatorSites That Disallow It (of 417)
Common Crawl118
Anthropic113
OpenAI97
Meta97
ByteDance90
Google81
Apple81
Perplexity76
Cohere73
Amazon70
Diffbot68
Mistral24

Common Crawl (CCBot) leads by operator count at 118 sites. Anthropic is second at 113 — substantially ahead of OpenAI at 97. Mistral is blocked by just 24 sites, well below every other operator in this leaderboard.

The 9-bot leaderboard (named crawler tokens across all 417 sites):

Bot TokenSites Disallowing ItPercentage of 417
CCBot11828.3%
ClaudeBot10424.9%
GPTBot9322.3%
Bytespider9021.6%
Meta-ExternalAgent8420.1%
Applebot-Extended8119.4%
Google-Extended8119.4%
PerplexityBot7518%
Amazonbot7016.8%

ClaudeBot is disallowed by 104 sites — second only to CCBot. GPTBot sits at 93. The gap between the top bot (CCBot at 118) and the bottom (Amazonbot at 70) spans 48 sites, reflecting both awareness and adoption of bot-specific disallow syntax over time.

Reading the Sealed Numbers

This report contains only sealed, verbatim counts — nothing is estimated, modeled, or extrapolated. US Tech Automations Research fetched robots.txt files from public URLs on June 14, 2026, parsed disallow rules against a standardized 9-bot reference list, and sealed the results under snapshot sha c5960481aa465ad3. No number in this post was computed beyond a direct count of matching rules.

The process in order:

  1. Fetch. Request robots.txt from each site's canonical root URL. Record HTTP status. Log sites with no file as noRobotsSites.

  2. Parse. Extract each user-agent block and its associated disallow paths. Match against the 9 named AI crawler tokens.

  3. Count. A site is classified as a blocker if at least one named AI crawler token appears in a disallow rule for any path, including the root.

  4. Seal. Hash the complete dataset with sha256 and publish the hash alongside the data. The record cannot be altered after the seal date.

The Architecture category has 8 sites in the snapshot, all 8 with a parseable robots.txt. The noRobotsSites array is empty for this category.

Frequently Asked Questions

Q: Why do all 8 Architecture sites have a robots.txt when many categories have gaps?

A: Architecture sites in this snapshot skew toward large, professionally operated platforms and media brands that employ SEO staff — teams that maintain robots.txt files as a matter of standard practice. Categories with smaller, independently run sites tend to have more gaps. The 8-of-8 publication rate is a data point, not a claim about the entire Architecture web.

Q: Is designboom.com blocking all AI crawlers, or only some?

A: The sealed data tells us whether a site blocks at least one named AI crawler token — not whether every crawler is blocked. A site may have a nuanced policy: blocking GPTBot while allowing Google-Extended, for example. To know exactly which tokens a specific site blocks, read that site's robots.txt directly on the seal date, or use a monitoring tool that tracks per-token rules.

Q: How does a 37.5% block rate compare to what a typical media category looks like?

A: News, the most media-dominated category in the corpus, has an 82.4% block rate across 17 robots.txt-publishing sites. Architecture's 37.5% is much lower — suggesting that even among its editorial blockers, the category has not converged on blocking as a default. Open platforms like archdaily.com and dezeen.com have explicitly chosen permissive access, which pulls the average down.

Q: Does an architecture firm's website typically block AI crawlers?

A: No firm-level sites appear in this snapshot. The sites covered are media platforms, design magazines, and communities — not individual practice websites. The behavior of hundreds of individual architecture firm sites is outside this dataset's scope.

Q: What happens if a site adds a new AI crawler block tomorrow?

A: This snapshot captures the state on June 14, 2026 only. Any change after the seal date is invisible to this report. Continuous monitoring is required to detect future policy shifts. The value of this sealed dataset is establishing the baseline state — the reference point from which any drift becomes detectable.

Put AI-Access Data to Work

Three practitioner roles find immediate, recurring value in Architecture-category robots.txt data:

Content researchers and editorial leads at architecture, design, and media outlets need to know which authoritative sources in their field have closed or opened AI access since they last checked. If archpaper.com or designboom.com has added new bot-level disallow rules since a prior crawl, that changes which content surfaces in AI-generated design briefings and which sources get credited in model outputs.

The workflow: schedule a weekly re-crawl of a curated watchlist of Architecture and design sites, alert the editorial team when any previously open source adds an AI disallow rule, and adjust content strategy accordingly. You can see how a contrasting editorial-media vertical handles this question in marketing, which has a lower block rate than Architecture.

Data pipeline engineers building retrieval-augmented systems on architecture and design content must track whether their source list is still crawlable. A pipeline that ingested dezeen.com content six months ago may face a different policy environment today. The workflow: daily robots.txt re-fetch for every source in the pipeline, automated diff against the last known state, and a human-in-the-loop review triggered by any disallow addition. Cross-category context from cybersecurity shows how a vertical with a lower block rate handles the same monitoring question.

AI product and strategy leads at design tools, generative architecture platforms, and construction-tech companies need to understand their training-data access landscape at the domain level. Which authoritative architecture sources currently allow training scrapes? Which have chosen the editorial-media path toward blocking? This sealed snapshot gives a point-in-time answer across 8 major Architecture domains; the recurring workflow is quarterly re-audit to detect any category-level trend. For comparison, agriculture sits at a slightly lower block rate and shows how a different professional vertical navigates the same tradeoff.

US Tech Automations automates the monitoring layer — scheduled robots.txt crawls, per-bot change detection, and an AI-access policy dashboard that surfaces shifts the moment they occur. Set up automated AI-access monitoring.

Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 14, 2026 (snapshot sha c5960481aa465ad3).

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “Do Architecture Sites Block AI Crawlers? 3 of 8 Do.” https://ustechautomations.com/resources/blog/do-architecture-sites-block-ai-crawlers-2026

Sealed snapshot sha256: c5960481aa465ad3

Machine-readable data: CSV · JSON · All research & methodology

About the Author

Garrett Mullins
Garrett Mullins
Workflow Specialist

Helping businesses leverage automation for operational efficiency.