Research & Data

Do Tech Sites Block AI Crawlers? Sealed robots.txt Data

Jun 13, 2026

Who This Is For

This report is for SEO directors, developer-relations leads, and content-strategy teams at technology publishers who need to understand how the Tech media sector governs AI crawler access. If your organization publishes technology journalism or operates a developer-community platform, the sealed figures below establish the peer baseline for June 2026.

TL;DR

9 of 13 Tech sites with a parseable robots.txt block at least one AI crawler — a 69.2% rate. Tech ranks second across all 10 content categories in this corpus, trailing only the News sector. The category is decisively above the corpus-wide baseline of 44.9%.

The Tech Category Finding

Of the 15 Tech sites checked, 13 returned a parseable robots.txt file. Of those 13, 9 block at least one AI crawler — a category block rate of 69.2%.

Four sites — github.com, engadget.com, hackernews.com, and slashdot.org — have robots.txt files that do not restrict any of the 21 tracked AI crawlers. Two sites — stackoverflow.com and producthunt.com — returned no parseable robots.txt at all.

The 9 blocking sites are: techcrunch.com, theverge.com, wired.com, arstechnica.com, cnet.com, zdnet.com, mashable.com, gizmodo.com, and venturebeat.com.

9 of 13 Tech sites with a parseable robots.txt block at least one AI crawler — a 69.2% rate.

Three sites — github.com, engadget.com, and slashdot.org — also publish an llms.txt file. These are the only 3 Tech sites in this corpus taking that emerging structured-guidance approach. github.com appears in both the allower list and the llms.txt list, illustrating a posture of access-with-structure rather than blanket restriction.

9 of 13 Tech sites with a robots.txt restrict at least one AI crawler in June 2026.

Methodology

US Tech Automations Research fetched the public robots.txt file for each of the 122 sites in this corpus on June 13, 2026, using standard HTTP GET requests. Each response was parsed for User-agent directives targeting any of 21 tracked AI crawler identifiers across 12 operators. A site is counted as "blocking" if at least one Disallow rule for at least one AI crawler token applies to a non-empty path.

Only responses with parseable content were counted. Sites returning other status codes or non-parseable content are categorized as "no robots.txt." The full snapshot is sealed under sha 741353c4304216ee. Nothing is estimated, modeled, or extrapolated — every count in this report is a verbatim figure from the sealed snapshot.

Parsing followed a straightforward token-match approach: each of the 21 crawler user-agent strings was checked against the User-agent directives in each robots.txt file. When a wildcard User-agent (*) rule is present alongside an AI-crawler-specific rule, the more specific rule takes precedence for that crawler. No interpretation of intent was applied — only explicit Disallow directives targeting a tracked crawler token were counted as blocks.

Tech Category Summary Table

Metric	Value
Sites checked	15
Sites with parseable robots.txt	13
Sites blocking at least one AI crawler	9
Category block rate	69.2%
Sites with llms.txt	3
Sites with no robots.txt	2

Cross-Category Ranking

The Tech category ranks second among all 10 content categories measured. The table below shows the full ranking from the sealed June 2026 snapshot.

Rank	Category	Sites	With robots.txt	Any block	Block rate
1	News	20	17	14	82.4%
2	Tech	15	13	9	69.2%
3	Entertainment	9	9	6	66.7%
4	Reference	14	11	6	54.5%
Five	Social	10	10	4	40%
6	Travel	9	9	3	33.3%
7	Finance	12	11	2	18.2%
8	Retail	15	12	2	16.7%
9	Education	9	7	1	14.3%
10	Government	9	8	1	12.5%

Tech at 69.2% sits well above the corpus-wide rate of 44.9% (48 of 107 sites). The gap between Tech and the third-ranked Entertainment category (66.7%) is small, while the gap between Tech and the fourth-ranked Reference category (54.5%) is more substantial.

You can compare Tech against the News category report — the sector immediately above it in the ranking — and the Reference category report for a sector with a notably different split between blockers and allowers.

Across all 107 corpus sites, 48 block at least one AI crawler — 44.9%. Tech at 69.2% is well above that baseline.

Most-Blocked Operators and Bots (Corpus-Wide, All 107 Sites)

The counts below are corpus-wide figures across all 107 sites — not Tech-specific. They identify which AI operators face the broadest restrictions across the full dataset.

Most-Blocked Operators (all 107 sites)

Operator	Sites blocking their crawlers
Common Crawl	40
Anthropic	39
ByteDance	37
OpenAI	35
Meta	35
Apple	31
Diffbot	30
Perplexity	29
Cohere	27
Google	25
Amazon	22
Mistral	12

Common Crawl leads with 40 of 107 sites restricting its crawler. Anthropic (39) and ByteDance (37) follow closely. OpenAI and Meta are tied at 35. Mistral at 12 sits at the tail. These are corpus-wide figures — a single operator may operate multiple crawler tokens, which are counted individually in the bot-level leaderboard.

Common Crawl is blocked by 40 of the 107 sites measured across all categories.

Site-Level Analysis

The 9 blocking Tech sites are predominantly technology journalism outlets that depend on paywalls or advertising revenue from unique traffic. techcrunch.com, theverge.com, wired.com, arstechnica.com, and cnet.com represent established tech-media brands where editorial content is the primary business asset.

zdnet.com, gizmodo.com, and mashable.com represent a broader set of general-audience tech publications with advertising-driven models. venturebeat.com covers enterprise and AI news specifically — its blocking position creates a notable irony given its editorial focus.

The 4 Tech allowers present a contrasting profile. github.com is a platform for code and developer collaboration — content that is intentionally open and machine-readable. engadget.com, hackernews.com, and slashdot.org also have robots.txt files but have not restricted AI crawlers through that mechanism. github.com, engadget.com, and slashdot.org go further, publishing llms.txt files that provide structured guidance to AI systems.

stackoverflow.com and producthunt.com have no parseable robots.txt. stackoverflow.com hosts a large volume of community-contributed programming knowledge that has historically been treated as openly accessible, though the absence of a robots.txt is not a formal permission statement. The lack of a robots.txt simply means no machine-readable crawl-governance signal was found at the standard location at the time of the snapshot — it does not establish an affirmative license for AI crawler access.

Automation Bridge

For developer-relations teams and tech-media strategy leads, the 69.2% blocking rate signals that two-thirds of technology content sites have taken a formal AI-access position. Tracking that position across the competitive landscape — and monitoring it for changes — requires automation.

US Tech Automations builds workflows that schedule, fetch, and parse robots.txt (and llms.txt) files at scale, then route change-detection alerts to SEO leads, legal, or operations teams. If your organization needs to monitor competitor or partner AI-access posture systematically, that is exactly the kind of pipeline US Tech Automations delivers.

For contrast, the Retail category report shows a sector where the majority of sites are open, with a 16.7% block rate — a fundamentally different orientation toward AI crawler access.

Key Takeaways

9 of 13 Tech sites with a parseable robots.txt block at least one AI crawler — 69.2%.
Tech ranks second across all 10 categories in the June 2026 sealed snapshot.
The corpus-wide baseline is 48 of 107 sites (44.9%); Tech exceeds it by a substantial margin.
4 Tech sites — github.com, engadget.com, hackernews.com, slashdot.org — have robots.txt files that allow all tracked AI crawlers.
3 Tech sites — github.com, engadget.com, slashdot.org — publish an llms.txt structured-guidance file.
2 Tech sites — stackoverflow.com, producthunt.com — have no parseable robots.txt.
Corpus-wide, Common Crawl is blocked by 40 of 107 sites; Anthropic by 39.
Nothing in this report is estimated, modeled, or extrapolated — all counts are from the sealed June 13, 2026 snapshot.

FAQ

Q: Does blocking a crawler in robots.txt actually stop it?

A: No. robots.txt is an honor-system protocol. A crawler that ignores a Disallow directive faces no technical barrier at the HTTP layer. The file expresses the site operator's stated preference, and enforcement depends on the crawler operator choosing to comply.

Q: What is an llms.txt file and why do 3 Tech sites have one?

A: llms.txt is an emerging voluntary standard that provides AI systems with structured guidance about a site — what to index, what to avoid, and how to attribute content — rather than a simple crawl restriction. github.com, engadget.com, and slashdot.org have published one. Its presence indicates these sites are engaging with AI access as a structured policy matter rather than opting for blanket restrictions. The standard is newer than robots.txt and adoption across the corpus is still limited: 20 of the 107 sites with a robots.txt also publish an llms.txt, a corpus-wide rate of 18.7%.

Q: Why do stackoverflow.com and producthunt.com have no robots.txt?

A: No parseable robots.txt response was found for these sites at the time of the snapshot. This may reflect a missing file, a different HTTP status code, or a server configuration issue. In this pipeline, nothing is estimated, modeled, or extrapolated — only confirmed parseable responses are counted. The absence of a robots.txt file is not a positive permission for AI crawlers; it simply means the site has not published a formal machine-readable signal through this channel.

Q: Why is venturebeat.com — a site that covers AI — on the blocking list?

A: The sealed data shows that venturebeat.com returns a robots.txt file that restricts at least one AI crawler. The editorial focus of a site does not determine its robots.txt posture; access-governance decisions reflect business and legal considerations, not just topic area.

Q: How does the Tech sector compare to the overall corpus?

A: The full corpus covers 122 sites and 10 categories. Of the 107 sites with a parseable robots.txt, 48 block at least one AI crawler — a 44.9% rate. Tech at 69.2% is well above that average, placing it in the top tier of AI-access restriction alongside News (82.4%) and Entertainment (66.7%).

For the whole-web baseline behind the Tech category, see our national study on how many top websites block AI crawlers.

Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 13, 2026 (snapshot sha 741353c4304216ee).

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “Do Tech Sites Block AI Crawlers? Sealed robots.txt Data.” https://ustechautomations.com/resources/blog/do-tech-sites-block-ai-crawlers-2026

Sealed snapshot sha256: 741353c4304216ee

Machine-readable data: CSV · JSON · All research & methodology