Research & Data

Do Energy Sites Block AI Crawlers? None Do

Jun 14, 2026

Not a single Energy site in the June 2026 snapshot blocks an AI crawler. Of the 6 Energy sites that returned a parseable robots.txt, 0 block any AI crawler — a 0% block rate. The broader corpus of 493 sites found 150 of 417 (36%) blocking at least one crawler.

Energy sits at the far permissive end of the distribution, sharing the 0% tier with Banking, Telecom, Nonprofit, Streaming, and Dating. For a vertical dominated by some of the largest corporations on earth — exxonmobil.com, shell.com, nexteraenergy.com, dominionenergy.com, southerncompany.com, nationalgrid.com — the absence of any AI crawling restriction is a data point worth explaining.

A robots.txt file is a plain-text policy document placed at a site's root domain instructing automated clients — search crawlers, AI training systems, data scrapers — which paths they may access. For AI specifically, operators may name crawlers like GPTBot, ClaudeBot, or CCBot in disallow rules. This report reads those policies verbatim from the sealed June 14, 2026 snapshot and reports counts exactly as found.

0 of 6 Energy sites block any AI crawler.

Energy sites post a 0% AI-crawler block rate.

Corpus-wide, 150 of 417 sites block at least one AI crawler.

Key Takeaways

  • 0 of 6 Energy sites with a parseable robots.txt block any AI crawler — a 0% block rate.

  • Every Energy site that published a policy explicitly allows all AI crawlers.

  • 4 of 10 Energy sites checked — chevron.com, bp.com, duke-energy.com, and aep.com — returned no robots.txt at all.

  • Across all 417 sites in the corpus, 150 block at least one AI crawler (36%); Energy is the outlier on the permissive end.

  • The corpus-wide most-blocked bot is CCBot, disallowed by 118 sites — none of them in the Energy category.

Why Energy Sites Allow All AI Crawlers

The 6 Energy sites with a parseable robots.txt are exxonmobil.com, shell.com, nexteraenergy.com, dominionenergy.com, southerncompany.com, and nationalgrid.com. Every one of them allows all AI crawlers we tracked.

The explanation is structural. Legacy-corporate and utility-sector websites have historically used robots.txt for a narrow set of operational purposes: preventing search engines from indexing internal administrative paths, staging environments, or duplicate URL patterns. These robots.txt files were built for the search-engine era — not the AI-training era. The concept of named AI-crawler tokens (GPTBot, ClaudeBot, Bytespider) is a post-2023 development. Large legacy-corporate websites update their robots.txt infrequently; many have policies that have not been substantively revised in years.

Equally important: the content that major energy companies publish on their public websites is inherently promotional. Investor relations pages, sustainability reports, press releases, product brochures, and corporate responsibility materials are designed for maximum dissemination — including through AI-powered search summaries and chatbot answers. A major utility like nexteraenergy.com or dominionenergy.com benefits when AI answers about renewable energy or grid infrastructure cite or reflect their publicly published content. Blocking AI crawlers would work against that visibility.

Of 10 Energy sites checked, 6 returned a parseable robots.txt file; chevron.com, bp.com, duke-energy.com, and aep.com published no robots.txt.

Every Energy site with a parseable robots.txt — exxonmobil.com, shell.com, nexteraenergy.com, dominionenergy.com, southerncompany.com, and nationalgrid.com — allows all tracked AI crawlers.

The 4 sites without a robots.txt — chevron.com, bp.com, duke-energy.com, and aep.com — are not counted as blockers or allowers. They are in noRobotsSites and are excluded from the withRobots count of 6 and from any block-rate calculation. Their absence from the policy landscape is consistent with the industry pattern: these are large, complex web presences where robots.txt maintenance is not a front-of-mind concern.

What would make a future block significant? If a major energy operator adds AI-specific disallow rules, it would likely signal either a specific content-protection concern (proprietary technical documentation, internal research that inadvertently surfaced publicly) or a policy-driven decision by a legal or communications team following an industry event. Energy companies with large proprietary datasets — well log databases, exploration data, grid topology maps — do not publish those on public websites, which means their robots.txt policies today are largely irrelevant to their actual data-protection strategy.

How Energy Compares Across the 48-Category Ranking

0 of 6 Energy sites with a parseable robots.txt block any AI crawler.

CCBot is blocked by 118 of 417 sites corpus-wide — the most-blocked AI crawler.

150 of 417 sites block at least one AI crawler across the full corpus — a 36% rate.

Energy shares the 0% block rate with 5 other categories in the snapshot. The focused window below centers on the permissive end of the distribution, showing where Energy sits among the lowest-blocking categories and the transition zone above them:

CategorySites CheckedWith robots.txtBlocking Any AI CrawlerBlock Rate
Insurance109111.1%
Cybersecurity109111.1%
Nonprofit10600%
Streaming101000%
Dating10500%
Banking7700%
Telecom10600%
Energy10600%

Energy sits in a group with Banking and Telecom — all three are regulated, legacy-corporate verticals where the 0% block rate reflects similar structural reasons. Streaming and Dating share the 0% rate for different reasons (discoverability is their product; content restriction runs counter to their model).

For orientation at the top of the distribution:

CategoryBlock Rate
Gaming88.9%
News82.4%
Food70%
Tech69.2%
Entertainment66.7%

The contrast is sharp. Gaming (88.9%) and News (82.4%) are categories where original content drives commercial value and AI training scrapes directly threaten the revenue model. Energy operators face the opposite incentive structure: their commercial value comes from physical assets, long-term contracts, and regulated rate structures — not from website content.

Corpus-Wide Operator and Bot Leaderboard (All 417 Sites)

No Energy site contributes to any of the counts below. These reflect the full corpus of 417 sites with a parseable robots.txt, providing context for which AI operators are most frequently blocked across the broader web:

AI OperatorSites That Block It (of 417)
Common Crawl118
Anthropic113
OpenAI97
Meta97
ByteDance90
Google81
Apple81
Perplexity76
Cohere73
Amazon70
Diffbot68
Mistral24

Common Crawl faces the most blocks across the corpus at 118 sites. Anthropic (ClaudeBot) is second at 113. OpenAI and Meta are tied at 97. Mistral, the most recently prominent operator in widespread robots.txt awareness, is blocked by just 24 sites.

The 9-bot leaderboard by named crawler token:

Bot TokenSites Disallowing ItPercentage of 417
CCBot11828.3%
ClaudeBot10424.9%
GPTBot9322.3%
Bytespider9021.6%
Meta-ExternalAgent8420.1%
Applebot-Extended8119.4%
Google-Extended8119.4%
PerplexityBot7518%
Amazonbot7016.8%

Across all 417 sites, 84 have published an llms.txt file — 20.1% of the robots.txt-publishing corpus. Energy's 6 allowing sites have not published llms.txt files; they have simply left their robots.txt without AI-specific disallow rules. The llms.txt format would give them a more structured way to communicate training and retrieval permissions without changing their existing robots.txt policies.

What the Sealed Data Shows — and What It Cannot Show

This is point-in-time data — nothing is estimated, modeled, or extrapolated. US Tech Automations Research fetched each robots.txt from public URLs on June 14, 2026, parsed every user-agent block and disallow rule, matched against a standardized list of 9 AI crawler tokens, and sealed the record under snapshot sha c5960481aa465ad3.

The data is clear on what Energy sites have published. It cannot speak to:

  • Proprietary or access-controlled datasets these operators maintain behind authentication

  • Legal agreements those operators may have with specific AI companies

  • Whether any robots.txt rule changes occurred before or after the seal date

  • The intent behind a 4-site absence of robots.txt

The methodology:

  1. Fetch. Request robots.txt from each domain root. Log HTTP response. Sites without a file become noRobotsSites.

  2. Parse. Extract user-agent blocks and disallow paths. Identify all named AI crawler tokens.

  3. Count. A site is a blocker only if it disallows at least one named AI crawler token. Zero Energy sites meet that threshold.

  4. Seal. Apply sha256 hash to the complete dataset. Publish hash. Freeze the record.

Frequently Asked Questions

Q: Does a 0% block rate mean Energy companies do not care about AI training on their content?

A: Not necessarily. It means their public websites have not deployed AI-specific disallow rules in their robots.txt. Large energy companies hold proprietary data — seismic surveys, well logs, grid models — that is not published publicly and therefore does not need robots.txt protection. Their public website content is promotional and corporate, optimized for maximum reach, which makes blocking AI crawlers counterproductive.

Q: Why do 4 of 10 Energy sites have no robots.txt at all?

A: The sites — chevron.com, bp.com, duke-energy.com, and aep.com — have large, legacy-built web presences where robots.txt maintenance has historically focused on search-engine concerns. AI-specific crawling awareness is relatively recent. These sites may add robots.txt files or AI-specific rules in future; the sealed data shows the state on June 14, 2026 only.

Q: Is the Energy vertical likely to move toward blocking in the future?

A: This snapshot makes no predictions. We have a single sealed observation; trend direction requires multiple observations over time. What the data shows is the baseline: as of June 14, 2026, no Energy operator in this snapshot had engaged with AI crawling restrictions. If a major operator adds a block in a future snapshot, it would stand out sharply against this 0% baseline.

Q: Does the honor-system nature of robots.txt matter more for Energy than for other verticals?

A: It matters equally across all verticals. robots.txt is advisory — compliant AI crawlers respect it, non-compliant scrapers ignore it. Energy companies that want to protect sensitive content must rely on authentication, access controls, and legal agreements — not robots.txt. For their publicly published content, the honor system is the only protection robots.txt provides, and they have chosen not to invoke it.

Q: How does Energy compare to other regulated-sector verticals in this snapshot?

A: Banking (0%, 7 sites, 0 blockers) and Telecom (0%, 6 sites, 0 blockers) match Energy's position exactly. All three are legacy-corporate, regulated-sector verticals with the same pattern: published robots.txt files focused on traditional crawl management, no AI-specific disallow rules. See banking and telecom for those parallel reports.

Put AI-Access Data to Work

Three practitioner profiles find specific, recurring value in Energy-category AI-access data:

AI product and strategy leads at energy intelligence platforms, grid analytics companies, and climate data providers need to know which major energy operators currently allow AI training scrapes of their public content — and would need to know immediately if any of them add a block.

The workflow: weekly robots.txt re-crawl of the 10 Energy sites in this snapshot, automated alert the moment any of the 6 currently-allowing sites (or the 4 no-policy sites) publishes a new AI disallow rule. The Energy 0% baseline is the anchor; any departure from it is a significant policy signal. For contrast, cybersecurity shows how a similarly professional vertical looks when it is far more restrictive.

Sustainability and ESG research analysts building AI-augmented research tools on energy sector content need a clean inventory of which sources are currently accessible to crawling pipelines. ExxonMobil, Shell, NextEra, Dominion, Southern Company, and National Grid publish sustainability reports, investor materials, and policy statements that are high-signal for ESG analysis. The workflow: monthly robots.txt audit across the full Energy watchlist, with a flag added to any source whose policy changes before the next quarterly report cycle. Cross-category comparisons — such as the more defensive posture in aviation — help contextualize why Energy is positioned where it is.

Data compliance and legal counsel at firms that use energy sector web content in AI systems need to document the access status of each source as part of due diligence. A sealed snapshot from June 14, 2026 (sha c5960481aa465ad3) provides a legally timestamped record of each site's published robots.txt policy on that date.

The recurring workflow: quarterly snapshot audit, change-log maintained per domain, and a review triggered whenever any Energy operator makes a public announcement about AI data policy. US Tech Automations automates the monitoring layer — scheduled robots.txt crawls, per-operator change detection, and a timestamped audit log. Set up automated AI-access monitoring.

Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 14, 2026 (snapshot sha c5960481aa465ad3).

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “Do Energy Sites Block AI Crawlers? None Do.” https://ustechautomations.com/resources/blog/do-energy-sites-block-ai-crawlers-2026

Sealed snapshot sha256: c5960481aa465ad3

Machine-readable data: CSV · JSON · All research & methodology

About the Author

Garrett Mullins
Garrett Mullins
Workflow Specialist

Helping businesses leverage automation for operational efficiency.