Do Social Media Sites Block AI Crawlers? Sealed Data
Who This Is For
This report is for content-strategy leads, SEO directors, and platform-policy teams at social media companies and publishing platforms who need to understand how the Social category positions itself in relation to AI crawlers. If your organization operates a social or publishing platform, or monitors how these platforms govern AI crawler access, the sealed figures below provide the sector baseline for June 2026.
TL;DR
4 of 10 Social media sites block at least one AI crawler — a 40% rate. Every site in this category has a parseable robots.txt file; the Social category is unique in this corpus for its universal robots.txt coverage. At 40%, the category sits just below the corpus-wide average of 44.9%. The category is also notable for its relatively high llms.txt adoption: 4 of the 10 sites publish a structured-guidance llms.txt file.
The Social Category Finding
Of the 10 Social sites checked, all 10 returned a parseable robots.txt file — every site in the category. This universal robots.txt coverage rate is the only such result across all 10 categories in the corpus. Of those 10, 4 block at least one AI crawler — a category block rate of 40%.
Six sites — reddit.com, pinterest.com, substack.com, wordpress.com, blogger.com, and twitch.tv — have robots.txt files that do not restrict any of the 21 tracked AI crawlers. No Social sites returned no parseable robots.txt — unlike every other category in the corpus, the Social category has zero sites with missing or unparseable robots.txt files.
The 4 blocking sites are: linkedin.com, tumblr.com, medium.com, and vimeo.com.
4 of 10 Social sites block at least one AI crawler — a 40% rate, just below the 44.9% corpus-wide average.
Four sites — reddit.com, pinterest.com, wordpress.com, and twitch.tv — publish an llms.txt structured-guidance file. This gives Social the highest absolute count of llms.txt publishers of any category in this corpus, and means 4 of the 10 sites (the 6 allowers minus substack.com and blogger.com) have adopted the newer structured-guidance approach to AI access.
4 of 10 Social media sites restrict at least 1 AI crawler in June 2026.
Methodology
US Tech Automations Research fetched the public robots.txt file for each of the 122 sites in this corpus on June 13, 2026, using standard HTTP GET requests. Each response was parsed for User-agent directives targeting any of 21 tracked AI crawler identifiers across 12 operators. A site is counted as "blocking" if at least one Disallow rule for at least one AI crawler token applies to a non-empty path.
Only responses with parseable content are counted. Sites returning other status codes are categorized as "no robots.txt." The snapshot is sealed under sha 741353c4304216ee. Nothing is estimated, modeled, or extrapolated — every count in this report is a verbatim figure from the sealed snapshot.
The parsing logic applied a strict literal match on each of the 21 tracked crawler user-agent strings. Wildcard User-agent rules were treated as a fallback; any crawler-specific directive took precedence. No modeling was applied to infer what a site "probably" blocks based on its category or size — a site either returned a parseable robots.txt with a matching Disallow directive, or it was recorded as not blocking. The Social category is notable here because no site required the "no robots.txt" fallback classification — all 10 returned parseable responses.
Social Category Summary Table
| Metric | Value |
|---|---|
| Sites checked | 10 |
| Sites with parseable robots.txt | 10 |
| Sites blocking at least one AI crawler | 4 |
| Category block rate | 40% |
| Sites with llms.txt | 4 |
| Sites with no robots.txt | none |
Cross-Category Ranking
The Social category ranks fifth among all 10 content categories measured. The full cross-category table from the sealed snapshot is below.
| Rank | Category | Sites | With robots.txt | Any block | Block rate |
|---|---|---|---|---|---|
| 1 | News | 20 | 17 | 14 | 82.4% |
| 2 | Tech | 15 | 13 | 9 | 69.2% |
| 3 | Entertainment | 9 | 9 | 6 | 66.7% |
| 4 | Reference | 14 | 11 | 6 | 54.5% |
| Five | Social | 10 | 10 | 4 | 40% |
| 6 | Travel | 9 | 9 | 3 | 33.3% |
| 7 | Finance | 12 | 11 | 2 | 18.2% |
| 8 | Retail | 15 | 12 | 2 | 16.7% |
| 9 | Education | 9 | 7 | 1 | 14.3% |
| 10 | Government | 9 | 8 | 1 | 12.5% |
Social at 40% sits just under the corpus-wide rate of 44.9% (48 of 107 sites). It occupies a true middle position in the ranking — above Travel, Finance, Retail, Education, and Government, but below News, Tech, Entertainment, and Reference.
The universal robots.txt coverage rate is a distinctive feature of this category. Every Social site in this corpus has taken the step of publishing a machine-readable crawl-governance signal — which is not the case for any other category.
For comparative context, the Reference category report covers the fourth-ranked sector at 54.5%, and the Retail category report covers the eighth-ranked sector at 16.7% — a useful bracket around where Social sits.
Every one of the 10 Social sites in this corpus has a parseable robots.txt — universal coverage, unique across all 10 categories.
Most-Blocked Operators and Bots (Corpus-Wide, All 107 Sites)
These figures are corpus-wide counts across all 107 sites — not Social-specific. They identify which operators face the broadest blocking across the full dataset.
Most-Blocked Operators (all 107 sites)
| Operator | Sites blocking their crawlers |
|---|---|
| Common Crawl | 40 |
| Anthropic | 39 |
| ByteDance | 37 |
| OpenAI | 35 |
| Meta | 35 |
| Apple | 31 |
| Diffbot | 30 |
| Perplexity | 29 |
| Cohere | 27 |
| 25 | |
| Amazon | 22 |
| Mistral | 12 |
Common Crawl leads at 40 sites across the corpus. Anthropic (39) and ByteDance (37) follow. OpenAI and Meta are tied at 35. Apple (31), Diffbot (30), Perplexity (29), and Cohere (27) form a mid-tier. Google (25) and Amazon (22) follow. Mistral at 12 is the tail. These counts are corpus-wide — not specific to the Social category.
Meta is blocked by 35 of the 107 sites measured across the full corpus.
Site-Level Analysis
The 4 blocking Social sites span a range of platform models. linkedin.com is a professional networking platform; its blocking posture may reflect concerns about AI systems ingesting profile and content data at scale. tumblr.com is a long-form blogging and multimedia platform. medium.com is a publishing platform for editorial essays and articles. vimeo.com is a video-hosting platform whose content is protected by creator rights agreements.
The 6 allowers represent a different set of platform models. reddit.com, one of the largest community-discussion platforms on the web, allows AI crawlers through its robots.txt while also publishing an llms.txt structured-guidance file. pinterest.com similarly allows AI crawlers and publishes llms.txt. wordpress.com and twitch.tv also allow AI crawlers and publish llms.txt.
substack.com and blogger.com allow AI crawlers but do not publish llms.txt. Both are publishing platforms for creator-driven written content.
The co-presence of allower status and llms.txt publication — seen in reddit.com, pinterest.com, wordpress.com, and twitch.tv — represents a nuanced posture: these platforms are not restricting AI crawlers but are providing structured guidance about how AI systems should engage with their content. This is a more sophisticated approach than a binary block-or-allow decision.
4 of 10 Social sites publish an llms.txt structured-guidance file.
Automation Bridge
For platform-policy teams and content-strategy leads at social media and publishing companies, the Social category data shows a sector in the middle of the spectrum — with a mix of blockers, permissive allowers, and llms.txt adopters. Tracking how that mix shifts over time is an ongoing monitoring task.
Manual checks of robots.txt and llms.txt files across a defined set of platforms are feasible once, but become operationally burdensome when changes need to be detected on a recurring basis. US Tech Automations builds automated workflows that schedule, fetch, and parse these files, compare against prior snapshots, and route alerts to the appropriate team when a change is detected. That is precisely the kind of structured monitoring pipeline US Tech Automations delivers.
For a look at the most restrictive end of the spectrum, the News category report shows an 82.4% block rate — a useful reference point for understanding how different content models approach AI access governance.
Key Takeaways
4 of 10 Social sites block at least one AI crawler — a 40% rate, just below the corpus-wide 44.9%.
Social ranks fifth across all 10 categories in the June 2026 sealed snapshot.
Social is the only category in this corpus where every measured site has a parseable robots.txt — 10 of 10.
6 Social sites allow all tracked AI crawlers; 4 of those also publish llms.txt structured-guidance files.
The 4 blocking sites are linkedin.com, tumblr.com, medium.com, and vimeo.com.
No Social site in this corpus has a missing or unparseable robots.txt.
Corpus-wide, Common Crawl is blocked by 40 of 107 sites; Anthropic by 39.
Nothing in this report is estimated, modeled, or extrapolated — all counts are from the sealed June 13, 2026 snapshot.
FAQ
Q: Does blocking a crawler in robots.txt actually stop it?
A: No. robots.txt is an honor-system standard. A Disallow directive has no technical enforcement mechanism at the HTTP layer. Whether a crawler respects the directive depends entirely on the crawler operator's compliance. The file states the site operator's preference; it does not enforce it.
Q: What makes the Social category unique compared to other categories in this corpus?
A: The Social category is the only one of 10 categories where every measured site — all 10 — has a parseable robots.txt file. No other category achieves this. The Social category also has a relatively high rate of llms.txt adoption: 4 of the 10 sites publish an llms.txt, compared to the corpus-wide rate of 20 of 107 sites (18.7%).
Q: Why do some Social sites that allow AI crawlers also publish an llms.txt?
A: An llms.txt file and a permissive robots.txt are not in conflict. llms.txt provides structured guidance to AI systems about what to index and how to attribute content, without restricting access. A site can simultaneously allow AI crawlers through robots.txt and provide detailed guidance about how those crawlers should behave through llms.txt. reddit.com, pinterest.com, wordpress.com, and twitch.tv take exactly this approach in the sealed snapshot.
Q: Why is linkedin.com listed as a blocker while platforms like reddit.com allow AI crawlers?
A: The sealed data shows that linkedin.com returns a robots.txt file with at least one AI-crawler Disallow directive, while reddit.com returns a robots.txt with no such restrictions. The reasons behind each site's access decision are not determinable from the robots.txt file alone; the data records only the observed posture.
Q: How should I interpret the 40% block rate relative to the corpus average?
A: The corpus-wide block rate is 44.9% — 48 of the 107 sites with a parseable robots.txt. The Social category at 40% is slightly below that average. It is neither a strongly restrictive nor a strongly permissive sector; it occupies the middle of the 10-category ranking. The universal robots.txt coverage rate is more distinctive than the block rate itself. In this analysis, nothing is estimated, modeled, or extrapolated — every figure is a verbatim count from the sealed June 13, 2026 snapshot.
Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 13, 2026 (snapshot sha 741353c4304216ee).
Get this data as a daily feed
The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.
Prefer to talk first? Contact us.
Cite this report
US Tech Automations Research, 2026-06 edition. “Do Social Media Sites Block AI Crawlers? Sealed Data.” https://ustechautomations.com/resources/blog/do-social-media-sites-block-ai-crawlers-2026
Sealed snapshot sha256: 741353c4304216ee
Machine-readable data: CSV · JSON · All research & methodology
About the Author

Helping businesses leverage automation for operational efficiency.