Research & Data

Do Agriculture Sites Block AI Crawlers? 3 of 9 Do

Jun 14, 2026

Agriculture comes in just below the corpus average: 3 of 9 Agriculture sites with a parseable robots.txt block at least one AI crawler — a 33.3% block rate. The broader June 2026 snapshot of 493 sites found 150 of 417 blocking at least one crawler, a 36% corpus rate.

Agriculture sits slightly below that line, making it a below-average-protection vertical where the majority of sites remain open to AI crawlers. The distinction matters: agriculture is a data-rich industry where soil science, market prices, agronomic research, and crop inputs represent genuine commercial value — yet most operators have not yet deployed AI-access restrictions.

A robots.txt file is a publicly accessible plain-text document at a site's root that instructs automated clients — including AI training crawlers — which paths are accessible and which are off-limits. For AI specifically, operators may disallow named bots like CCBot, GPTBot, or ClaudeBot. This report reads those policies verbatim from the sealed June 14, 2026 snapshot and reports counts without estimation or inference.

3 of 9 Agriculture sites block at least one AI crawler.

Agriculture sites post a 33.3% AI-crawler block rate.

Corpus-wide, 150 of 417 sites block at least one AI crawler.

Key Takeaways

  • 3 of 9 Agriculture sites with a parseable robots.txt block at least one AI crawler — a 33.3% block rate.

  • agriculture.com, farmprogress.com, and modernfarmer.com are the 3 blocking sites in this category.

  • The corpus-wide block rate across all 417 sites is 36% — Agriculture at 33.3% falls just below that line.

  • Across all 417 sites, CCBot is disallowed by 118 sites — the most-blocked bot in the sealed snapshot.

  • successfulfarming.com returned no robots.txt and is excluded from the block-rate calculation.

The Blocking Divide: Which Agriculture Sites Restrict AI Crawlers

Three sites block at least one AI crawler in this category: agriculture.com, farmprogress.com, and modernfarmer.com. Six sites allow all crawlers.

Agriculture.com is a long-running farm media brand under Meredith/Dotdash — a large media company with a portfolio-level approach to AI access policy. Media companies that have engaged with AI licensing questions tend to apply consistent robots.txt policies across their brands; agriculture.com fits that pattern. FarmProgress is a network of regional farming publications (Farm Progress, Prairie Farmer, Wallaces Farmer) under Farm Progress Companies.

As a trade publisher, it follows the editorial-media instinct toward protecting original content. ModernFarmer is a consumer-facing magazine covering sustainable agriculture, food culture, and farming life — it produces editorial content that, like other digital-first magazines, is a clear target for AI training corpora.

The 6 allowing sites are agweb.com, dtnpf.com, croplife.com, no-tillfarmer.com, farmanddairy.com, and agfundernews.com. Each operates in a distinct corner of the agriculture information space.

Of 10 Agriculture sites checked, 9 returned a parseable robots.txt; successfulfarming.com published no robots.txt file.

The 3 Agriculture blockers — agriculture.com, farmprogress.com, and modernfarmer.com — are all media or editorial brands; the 6 allowing sites span data services, trade press, and agtech coverage.

AgWeb is a digital-only farm news platform operated by Farm Journal that allows all AI crawlers, placing it in a different camp from farmprogress.com despite both being farm trade publications. DTN/Progressive Farmer (dtnpf.com) provides agricultural weather, commodity prices, and market news; its allowing posture is notable given how commercially sensitive commodity pricing data can be. CropLife covers crop protection and precision agriculture for agronomists and crop advisors. No-Till Farmer and Farm and Dairy are both regional trade publications with open policies. AgFunderNews covers agtech investment and startups — a publication likely to welcome AI discoverability as a distribution mechanism.

SuccessfulFarming.com returned no robots.txt, placing it in the noRobotsSites array. A missing robots.txt does not mean the site intends to allow all crawlers — it means no policy has been published.

Situating Agriculture in the 48-Category Ranking

3 of 9 Agriculture sites block at least one AI crawler, a 33.3% rate.

The focused window below centers Agriculture in the ranked distribution — the categories that sit just above and below it:

CategorySites CheckedWith robots.txtBlocking Any AI CrawlerBlock Rate
Jobs108337.5%
Aviation108337.5%
Architecture88337.5%
Travel99333.3%
Weather106233.3%
Beauty106233.3%
Agriculture109333.3%
Legal107228.6%
RealEstate107228.6%

CCBot is blocked by 118 of 417 sites across the corpus — the most-blocked bot.

150 of 417 sites corpus-wide block at least one AI crawler — a 36% rate.

Agriculture shares the 33.3% rate with Travel, Weather, and Beauty — a cluster of verticals that sit just below the corpus-average 36% line. Above them, Jobs, Aviation, and Architecture are at 37.5%. Agriculture's number of blockers (3) is the same as its tier-mates, but its denominator of 9 robots.txt-publishing sites versus their smaller pools places it in the same percentage band.

For orientation at the extremes:

CategoryBlock Rate
Gaming88.9%
News82.4%
Nonprofit0%
Telecom0%
Banking0%

Agriculture is meaningfully below the highest-blocking verticals — gaming (88.9%) and news (82.4%) — where content protection is far more prevalent. It sits comfortably above the zero-block categories at the bottom of the distribution. For a look at a category with no blockers at all, the energy vertical shares the corporate-utility logic that explains permissive robots.txt policies in legacy-industrial sectors.

Corpus-Wide Bot and Operator Coverage (All 417 Sites)

The leaderboards below are corpus-wide — they reflect disallow counts across all 417 sites in the snapshot, not Agriculture alone. They provide context for which AI operators and bots face the most resistance across the web:

AI OperatorSites That Disallow It (of 417)
Common Crawl118
Anthropic113
OpenAI97
Meta97
ByteDance90
Google81
Apple81
Perplexity76
Cohere73
Amazon70
Diffbot68
Mistral24

Common Crawl leads at 118 sites blocked, followed closely by Anthropic at 113 and OpenAI and Meta tied at 97. Mistral is blocked by just 24 sites — reflecting its newer entry into widespread awareness. Of 84 sites with an llms.txt file across the corpus, 20.1% of all 417 sites publish that document; Agriculture's open-access majority could benefit from llms.txt adoption as a lightweight way to signal training permissions without a full robots.txt overhaul.

The bot-level breakdown (named crawler tokens, all 417 sites):

Bot TokenSites Disallowing ItPercentage of 417
CCBot11828.3%
ClaudeBot10424.9%
GPTBot9322.3%
Bytespider9021.6%
Meta-ExternalAgent8420.1%
Applebot-Extended8119.4%
Google-Extended8119.4%
PerplexityBot7518%
Amazonbot7016.8%

CCBot remains the most blocked individual crawler token at 28.3% of all sites. Amazonbot, the most recently named token in widespread robots.txt adoption, appears in 70 disallow blocks.

Methodology: How This Report Was Sealed

This is point-in-time data — nothing is estimated, modeled, or extrapolated. US Tech Automations Research collected each robots.txt file from public URLs on June 14, 2026, parsed disallow rules against a reference list of 9 known AI crawler tokens, and sealed the full record under snapshot sha c5960481aa465ad3. Every figure in this report is a verbatim count from that sealed record.

The process:

  1. Collect. Fetch robots.txt from each domain's root. Log the HTTP result. Sites that do not serve a parseable file are recorded in noRobotsSites.

  2. Parse. Extract all user-agent blocks and associated disallow rules. Match against the 9 named AI crawler tokens in the reference list.

  3. Classify. Any site with at least one named AI token in a disallow rule is classified as a blocker. Partial blocks — e.g., blocking only one of nine bots — count.

  4. Seal. Hash the complete dataset, append the sha, and lock the record. The seal date is June 14, 2026.

Frequently Asked Questions

Q: Why does Agriculture have a lower block rate than the corpus average when it is a data-rich industry?

A: Agriculture publishing is split between general media brands (which tend to block) and specialized trade or data-service sites (which tend to allow). The 6 allowing sites span commodity data, agtech news, and regional trade press — categories where discoverability in AI systems carries marketing value. The 3 blockers are the editorial-media end of the spectrum. Until commodity and agronomic data platforms begin engaging with AI licensing, the category will likely remain below average in block rate.

Q: What does it mean that successfulfarming.com has no robots.txt?

A: A missing robots.txt means the site has not published a policy document instructing crawlers. Most well-behaved AI crawlers treat this as permissive by convention, but the operator has made no explicit commitment either way. The site is counted in the sites total (10) but excluded from withRobots (9) and from the block-rate calculation.

Q: How does Agriculture compare to other professional-information verticals?

A: Legal sits at 28.6% block rate with 7 robots.txt-publishing sites and 2 blockers. RealEstate is also at 28.6%. Agriculture at 33.3% is slightly more protective than those two professional verticals. Finance, another data-heavy category, is at 18.2%. Agriculture, despite handling commercially sensitive market data, is more protective than Finance — suggesting that media-brand operations within the vertical are pulling the rate upward relative to pure data providers.

Q: Could a site block some AI crawlers and allow others?

A: Yes. The gate for "blocker" classification is disallowing at least one named AI crawler token. A site might explicitly block CCBot while allowing GPTBot and ClaudeBot. The anyBlock count reflects sites with at least one such rule; it does not require a site to block all 9 tracked crawlers. To understand per-site, per-bot posture, the robots.txt file must be read directly.

Q: Is a 33.3% block rate in Agriculture likely to change in the near term?

A: This sealed snapshot captures one point in time — June 14, 2026 — and makes no predictions. The snapshot cannot be compared to prior editions in this pipeline; this is the first Agriculture observation under the c5960481aa465ad3 seal. Any claim about trend direction would require multiple snapshot comparisons, which this data does not support.

Put AI-Access Data to Work

Three practitioner profiles turn this sealed Agriculture snapshot into a recurring operational asset:

Agtech product and research leads at precision agriculture companies, crop input manufacturers, or farm management software providers need to know which content sources in their vertical are and are not accessible to AI training pipelines. If dtnpf.com's commodity pricing data or agweb.com's market news are currently open, that affects what AI-powered tools can incorporate without licensing overhead.

The workflow: monitor the robots.txt of key Agriculture sources weekly, alert the product team the moment a currently open source like croplife.com or agfundernews.com adds an AI disallow rule, and update the data-access registry before the next training or retrieval run. You can benchmark the Agriculture block rate against a close peer in aviation, which shares a 3-blocker profile but in a different industry context.

Media and content intelligence professionals tracking farm market discourse need to know which primary sources are blocked from AI summarization pipelines. A research firm building an AI-powered agricultural intelligence product that draws from agweb.com, croplife.com, and farmanddairy.com should audit robots.txt status at every content update cycle. If farmprogress.com or modernfarmer.com extends its block list to additional bots, that changes which content assets require manual licensing. A parallel example of this monitoring dynamic is visible in architecture, another editorial-split vertical with the same 3-blocker count.

Data pipeline and retrieval engineers at organizations that use agricultural content in RAG systems or knowledge bases need site-level access status as a first-class operational concern. The workflow: daily robots.txt fetch for each domain in the pipeline, automated diff against the prior known state, and a Jira ticket auto-filed when any token shifts from allowed to disallowed. Successfulfarming.com's lack of a robots.txt is its own signal — it is worth checking whether that site publishes content useful to the pipeline and whether a robots.txt appears in a future crawl.

US Tech Automations runs those scheduled re-crawls, change-detection diffs, and alert routing automatically — your team sees a change the moment it happens rather than discovering it after retraining. Set up automated AI-access monitoring.

Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 14, 2026 (snapshot sha c5960481aa465ad3).

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “Do Agriculture Sites Block AI Crawlers? 3 of 9 Do.” https://ustechautomations.com/resources/blog/do-agriculture-sites-block-ai-crawlers-2026

Sealed snapshot sha256: c5960481aa465ad3

Machine-readable data: CSV · JSON · All research & methodology

About the Author

Garrett Mullins
Garrett Mullins
Workflow Specialist

Helping businesses leverage automation for operational efficiency.