Research & Data

Do Real Estate Sites Block AI Crawlers? Sealed robots.txt Data

Jun 13, 2026

2 of 7 Real Estate sites block at least one AI crawler.

Real Estate sites block AI crawlers at a 28.6% rate.

72 of 157 sites block at least one AI crawler across the corpus.

Key Takeaways

2 of 7 Real Estate sites with parseable robots.txt files block at least one AI crawler.

Real Estate lands at a 28.6% block rate — well below the corpus-wide rate of 45.9%.

Of 10 Real Estate sites checked, 3 returned no parseable robots.txt at all.

The real estate category sits on the more permissive end of the AI-access spectrum. Only zillow.com and realtor.com have placed any AI-crawler disallows in their robots.txt files. Five other sites — redfin.com, trulia.com, century21.com, compass.com, and coldwellbanker.com — returned parseable robots.txt files that allow all 9 major AI bots. Three sites (apartments.com, homes.com, loopnet.com) returned no parseable robots.txt at all, leaving their crawl posture undefined in the honor-system sense.

This report draws exclusively from a point-in-time snapshot sealed June 13, 2026 (sha 9ceca3bdf0dfeaca). Every number is a verbatim count from that snapshot. Every figure is a raw read; nothing is estimated, modeled, or extrapolated.


What This Snapshot Measures

The US Tech Automations Research team collected and parsed public robots.txt files from 182 prominent websites across 16 content categories. Of those 182, a total of 157 returned a parseable robots.txt file. The Closing Web project checks for disallow directives targeting 9 named AI crawlers: CCBot, ClaudeBot, GPTBot, Bytespider, PerplexityBot, Meta-ExternalAgent, Applebot-Extended, Google-Extended, and Amazonbot.

Across the full corpus of 157 sites, 72 — or 45.9% — block at least one of those crawlers. That is the benchmark against which every category in this report is compared.

For the Real Estate category specifically, 10 sites were checked. Of those 10, 7 returned a parseable robots.txt. Of those 7, only 2 have issued any AI-crawler block — producing a block rate of 28.6%.

Of 10 Real Estate sites checked, 7 returned a parseable robots.txt file, and 2 of those 7 block at least one AI crawler.

The 28.6% Real Estate block rate is substantially below the corpus-wide rate of 45.9% across all 157 sites.


Category Snapshot: Real Estate

The table below summarizes the Real Estate category results from the June 2026 sealed snapshot.

CategorySites CheckedWith robots.txtBlocking Any AIBlock Rate
Real Estate107228.6%

Of the 7 Real Estate sites with a parseable robots.txt, 2 have blocked at least one AI crawler. The remaining allowers — redfin.com, trulia.com, century21.com, compass.com, and coldwellbanker.com — have issued no AI-specific disallows and remain fully open to the 9 tracked bots.

The Blockers: zillow.com and realtor.com

zillow.com and realtor.com are the only two Real Estate sites in this snapshot that have taken an active stance against at least one AI crawler. They represent the minority within the category — but their blocking posture aligns with the broader trend among high-traffic listing platforms that have begun treating AI-training crawlers differently from traditional search-engine bots.

It is worth emphasizing that a robots.txt disallow is an honor-system signal, not a technical enforcement. Compliant bots will respect it; non-compliant actors will not. The snapshot captures declared policy, not enforced access control.

The Allowers: redfin.com, trulia.com, century21.com, compass.com, coldwellbanker.com

Five Real Estate sites returned parseable robots.txt files that do not block any of the 9 tracked AI crawlers. These sites — redfin.com, trulia.com, century21.com, compass.com, and coldwellbanker.com — have effectively declared open access to AI training and retrieval systems, at least under the honor-system framework that robots.txt provides.

Whether that openness reflects a deliberate strategy, an oversight, or an evolving policy that has not yet been updated is outside the scope of a robots.txt snapshot. What the snapshot confirms is their current declared posture.

The No-Robots Sites: apartments.com, homes.com, loopnet.com

Three Real Estate sites — apartments.com, homes.com, and loopnet.com — returned no parseable robots.txt file. This does not mean they have no access policy; it means no honor-system signal was detectable at the time of the snapshot. AI crawlers that respect robots.txt will find no instructions and will typically default to full access.

For competitive-intelligence and AI-strategy purposes, a missing robots.txt is a meaningful data point: it indicates a site has not yet formally declared its position on AI crawling.


All 16 Categories: Cross-Category Ranking

The table below shows all 16 categories from the June 2026 Closing Web snapshot, ordered by block rate. Real Estate ties with Legal at 28.6%, placing both categories toward the bottom of the ranking — well below the corpus average of 45.9%.

CategorySites CheckedWith robots.txtBlocking Any AIBlock Rate
News20151386.7%
Food1010770%
Tech1513969.2%
Entertainment99666.7%
Healthcare109666.7%
Reference1411654.5%
Automotive109444.4%
Social1010440%
Sports1010440%
Travel99333.3%
Legal107228.6%
Real Estate107228.6%
Finance1211218.2%
Retail1512216.7%
Education97114.3%
Government98112.5%

News leads all categories at 86.7%. Real Estate sits in the lower tier, clustered with Legal at 28.6%. Finance, Retail, Education, and Government are even more permissive. The cross-category spread — from 12.5% to 86.7% — demonstrates that AI-access posture varies dramatically by content type and business model.

For readers who want to examine how other specific sectors compare, the Healthcare category report and the Sports category report offer detailed breakdowns of their respective site-level blocking decisions.


Corpus-Wide Bot and Operator Leaderboards

The following tables capture which AI bots and which operators are most frequently blocked across all 157 sites in the full corpus — not just Real Estate. These figures are corpus-wide.

BotSites Blocking (of 157)Block Rate
CCBot5836.9%
ClaudeBot5333.8%
GPTBot4528.7%
Bytespider4428%
PerplexityBot4226.8%
Meta-ExternalAgent3924.8%
Applebot-Extended3924.8%
Google-Extended3723.6%
Amazonbot3119.7%

CCBot, operated by Common Crawl, is the most widely blocked bot across all 157 sites. ClaudeBot follows closely. GPTBot and Bytespider are both blocked by more than a quarter of the corpus. Amazonbot, at 19.7%, is the least-blocked of the 9 tracked bots.

OperatorSites Blocking (of 157)
Common Crawl58
Anthropic55
OpenAI47
Meta45
ByteDance44
Perplexity42
Apple39
Google37
Cohere36
Diffbot36
Amazon31
Mistral15

Common Crawl and Anthropic are the most frequently blocked operators. Mistral, with only 15 sites blocking its crawlers, is the least-blocked of the 12 tracked operators. The gap between Common Crawl (58) and Mistral (15) reflects the uneven operator-by-operator blocking patterns that characterize the current AI-access landscape.

It is important to note that these leaderboard figures are corpus-wide. The Real Estate category's 2 blockers each named their own specific combination of bots — the leaderboard does not represent a per-category count.

Additionally, 27 sites across the full 157-site corpus (17.2%) have deployed an llms.txt file, a newer opt-in signal for AI-readable content. That figure is a corpus-wide count from the same sealed snapshot.

For a parallel perspective on how another high-traffic property category is navigating these decisions, see the Legal sites report.


Methodology

txt file for each of the 182 sites in the June 2026 Closing Web corpus. Each file was parsed for user-agent strings matching any of the 9 tracked AI crawlers. A site is counted as "blocking" if at least one of those 9 bots is named in a disallow directive — regardless of how many paths are disallowed or how many other bots are also named.

txt was found and none of the 9 bots appear in a disallow. txt" if the file was absent or unparseable.

Nothing is estimated, modeled, or extrapolated. Every count in this report is a direct, verbatim read from the snapshot sealed June 13, 2026 under sha 9ceca3bdf0dfeaca. No inferences are drawn about enforcement, compliance, or the legal standing of these directives. robots.txt is a public, honor-system protocol — it communicates declared intent, not guaranteed outcome.


FAQ

Q: Does a robots.txt block actually prevent AI crawlers from scraping a site?

A: No. robots.txt operates on the honor system. A compliant crawler will read the file and respect its directives. A non-compliant actor will ignore it entirely. The snapshot records what sites have declared — not what crawlers have actually done. Legal and technical enforcement options exist separately from robots.txt.

Q: Why do only 2 of the 7 Real Estate sites with robots.txt files block AI crawlers?

A: The snapshot captures a point-in-time posture. Some sites may not have updated their robots.txt to address AI crawlers yet. Others may have made deliberate decisions to remain open to AI indexing for discoverability or traffic reasons. The data reflects declared policy; motivations are outside the scope of the sealed snapshot.

Q: What does it mean that apartments.com, homes.com, and loopnet.com have no parseable robots.txt?

A: It means no honor-system signal was detectable at snapshot time. Crawlers that respect robots.txt will typically default to full access when no file is present. For competitive-intelligence purposes, it marks these sites as having an undeclared AI-crawling posture as of June 13, 2026.

Q: Are the corpus-wide bot leaderboard figures specific to Real Estate sites?

A: No. The bot and operator leaderboards in this report reflect counts across all 157 sites in the full corpus. They show which bots and operators face the most widespread blocking industry-wide — they are not filtered to Real Estate. For Real Estate-specific blocking, only 2 sites are relevant, and they named their own specific bot combinations.

Q: How does the Real Estate block rate compare to the broader corpus?

A: At 28.6%, Real Estate is below the corpus-wide rate of 45.9% across all 157 sites. Real Estate ties with Legal for the same block rate. Categories like News (86.7%), Food (70%), and Tech (69.2%) are far more aggressive in blocking AI crawlers. Finance, Retail, Education, and Government are all even more permissive than Real Estate.


Put AI-Access Data to Work

Real estate professionals, content strategists, and data engineers can all extract recurring operational value from this sealed snapshot — not as a one-time read, but as a trigger for ongoing monitoring.

An SEO or content lead at a real estate technology company has a concrete signal here: the two largest listing platforms in the category have issued AI-crawler blocks, while five competitors remain open. That split creates a content-discoverability gap that changes each time any of those sites updates its robots.txt. A useful automated job re-crawls these 10 real estate domains weekly and alerts the moment a new site adds an AI-crawler disallow — so the SEO team can update AI-visibility strategy before the change cascades through model training pipelines.

A publisher RevOps lead at a real estate media outlet can use this data to track whether content from open sites (redfin.com, trulia.com, century21.com, compass.com, coldwellbanker.com) is being indexed by AI platforms while content from closed sites (zillow.com, realtor.com) is not. A recurring job that compares this report to live robots.txt snapshots on a monthly cadence surfaces policy drift before it becomes a revenue surprise.

A retrieval or data engineer building a real estate knowledge base or RAG system needs to know which sources are explicitly off-limits under the honor system and which are open. Automating a monthly re-fetch of all 10 real estate robots.txt files and diffing against the sealed baseline (sha 9ceca3bdf0dfeaca) makes compliance monitoring a background process rather than a manual audit. The Automotive category report shows a comparable workflow for a neighboring property-adjacent vertical.

US Tech Automations builds agentic workflows that automate exactly this kind of recurring monitoring — fetching, parsing, diffing, and alerting on robots.txt changes so your team acts on signals rather than static snapshots. See how agentic workflows handle AI-access monitoring at scale.


Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 13, 2026 (snapshot sha 9ceca3bdf0dfeaca).

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “Do Real Estate Sites Block AI Crawlers? Sealed robots.txt Data.” https://ustechautomations.com/resources/blog/do-real-estate-sites-block-ai-crawlers-2026

Sealed snapshot sha256: 9ceca3bdf0dfeaca

Machine-readable data: CSV · JSON · All research & methodology

About the Author

Garrett Mullins
Garrett Mullins
Workflow Specialist

Helping businesses leverage automation for operational efficiency.