Research & Data

Do Genealogy Sites Block AI Crawlers? 4 of 10 Do

Jun 14, 2026

When a category lands exactly at the corpus average, that is itself a signal worth examining. The Genealogy category in the June 2026 Closing Web edition does precisely that: all 10 of the 10 sites checked returned a parseable robots.txt file, and exactly 4 of those 10 — 40% — block at least one AI crawler. The corpus-wide rate across all 354 sites with parseable robots.txt files is 39.3%. Genealogy is, in this edition, the category that represents the average.

That average masks an interesting split. Genealogy sites range from large commercial archives holding billions of records — subscription businesses whose core asset is proprietary historical data — to community-built platforms that operate as open, collaborative family trees. The commercial archives are the blockers; the community platforms are the allowers. The data reflects a sector that has not reached consensus on AI crawler policy, drawn directly along the fault line between proprietary and open-source models.

4 of 10 Genealogy sites block at least one AI crawler.

Genealogy sites post a 40% AI-crawler block rate.

Corpus-wide, 139 of 354 sites block at least one AI crawler.

Key Takeaways

The Genealogy category sits precisely at the corpus average — 40% block rate — but the split within the category is structurally meaningful.

4 of 10 Genealogy sites block at least one AI crawler — a 40% block rate.

All 10 Genealogy sites returned a parseable robots.txt on June 14, 2026.

CCBot is blocked by 109 of 354 corpus sites corpus-wide — the most-targeted AI bot in this edition.

Who Gates the Crawlers in Genealogy

The four Genealogy sites blocking at least one AI crawler are ancestry.com, myheritage.com, findagrave.com, and newspapers.com. These properties have one structural thing in common: they hold large, digitized, searchable collections that represent years of investment in data acquisition, OCR processing, and record linking. Ancestry and MyHeritage are subscription businesses whose revenue depends on exclusive access to their record holdings. Findagrave and Newspapers.com are archive properties — also owned by the same corporate family as Ancestry — that similarly hold large collections of proprietary, curated historical content.

For these sites, blocking AI training crawlers is a straightforward content-protection decision. Their value proposition is searchable access to records not available elsewhere. Allowing an AI company to train on those records without license or fee would undermine the exclusivity that justifies the subscription price.

ancestry.com, myheritage.com, findagrave.com, and newspapers.com are the four Genealogy sites blocking at least one AI crawler.

4 of 10 Genealogy sites block at least one AI crawler — a 40% block rate matching the corpus-wide average of 39.3%.

The six Genealogy sites with parseable robots.txt files that allow every crawler are familysearch.org, geni.com, wikitree.com, fold3.com, billiongraves.com, and genealogybank.com. FamilySearch is operated by The Church of Jesus Christ of Latter-day Saints as a free, open-access service — its mission is maximum access, not monetization of data exclusivity.

WikiTree is a collaborative, open family tree platform whose entire value proposition depends on open access and community contribution. Geni similarly operates as an open collaborative tree. BillionGraves and GenealogyBank represent a mixed case — both are commercial services, but their robots.txt files at the time of this snapshot impose no AI-crawler restrictions.

The per-site breakdown for the full Genealogy panel is shown below.

Site	Has robots.txt	Blocks AI Crawler
ancestry.com	Yes	Yes
myheritage.com	Yes	Yes
findagrave.com	Yes	Yes
newspapers.com	Yes	Yes
familysearch.org	Yes	No
geni.com	Yes	No
wikitree.com	Yes	No
fold3.com	Yes	No
billiongraves.com	Yes	No
genealogybank.com	Yes	No

All 10 Genealogy sites returned a parseable robots.txt — the only category in this batch with complete coverage and a 40% block rate.

"4 of 10 Genealogy sites block at least one AI crawler, placing the category at exactly the corpus-wide average of 39.3% among sites with parseable robots.txt files."

"The six Genealogy allowers — familysearch.org, geni.com, wikitree.com, fold3.com, billiongraves.com, and genealogybank.com — allow every AI crawler examined in the June 14, 2026 snapshot."

Where Genealogy Falls in the 40-Category Picture

Genealogy at 40% sits in the middle tier of the 40-category ranking, clustered with Social (40%), Sports (40%), Fitness (40%), and Photography (40%). This cluster around the 40% mark is distinct from the high-restriction top tier (Gaming 88.9%, News 82.4%) and the near-zero bottom tier (Productivity 10%, Nonprofit and Streaming at 0%).

Category	Sites Checked	With robots.txt	Any Blocker	Block Rate
Gaming	9	9	8	88.9%
News	20	17	14	82.4%
Food	10	10	7	70%
Tech	15	13	9	69.2%
Entertainment	9	9	6	66.7%
Healthcare	10	9	6	66.7%
Music	10	9	6	66.7%
Parenting	10	8	5	62.5%
Outdoors	10	5	3	60%
Reference	14	11	6	54.5%
Science	10	10	5	50%
Wedding	10	8	4	50%
Automotive	10	9	4	44.4%
HomeGarden	10	9	4	44.4%
Fashion	9	7	3	42.9%
Social	10	10	4	40%
Sports	10	10	4	40%
Fitness	10	10	4	40%
Photography	10	10	4	40%
Genealogy	10	10	4	40%
Jobs	10	8	3	37.5%
Travel	9	9	3	33.3%
Weather	10	6	2	33.3%
Beauty	10	6	2	33.3%
Legal	10	7	2	28.6%
RealEstate	10	7	2	28.6%
Pets	10	7	2	28.6%
Crafts	10	8	2	25%
Finance	12	11	2	18.2%
Retail	15	12	2	16.7%
Education	9	7	1	14.3%
Government	9	8	1	12.5%
Crypto	9	8	1	12.5%
Books	9	8	1	12.5%
Religion	10	9	1	11.1%
Insurance	10	9	1	11.1%
Productivity	10	10	1	10%
Nonprofit	10	6	0	0%
Streaming	10	10	0	0%
Dating	10	5	0	0%

The Genealogy category at 40% compares interestingly to Reference at 54.5% — a category that includes encyclopedias, dictionaries, and curated knowledge bases, many of which also hold large proprietary databases. The gap between Reference and Genealogy (54.5% vs 40%) may reflect the presence of open-mission platforms like FamilySearch and WikiTree, which pull the Genealogy average downward.

For how a related data-archive category handles AI access, see how Outdoor sites — which hold proprietary trail data — land at 60%. And for contrast on the open-platform end of the spectrum, see Productivity sites at 10%.

The Operator-Level Picture Across the Full Corpus

The following bot and operator figures cover all 354 sites with parseable robots.txt files in the 418-site corpus — not Genealogy specifically.

Bot	Sites Blocking (of 354)	Block Rate
CCBot	109	30.8%
ClaudeBot	96	27.1%
GPTBot	83	23.4%
Bytespider	83	23.4%
Meta-ExternalAgent	78	22%
Google-Extended	76	21.5%
Applebot-Extended	74	20.9%
PerplexityBot	73	20.6%
Amazonbot	64	18.1%

CCBot (Common Crawl) faces blocks from 109 of 354 corpus sites — 30.8%. ClaudeBot (Anthropic) is blocked by 96 sites. GPTBot (OpenAI) and Bytespider (ByteDance) are each blocked by 83 sites. Meta-ExternalAgent is blocked by 78 sites. At the operator level corpus-wide: Common Crawl is blocked by 109 sites, Anthropic by 104, Meta by 89, OpenAI by 87, ByteDance by 83, Google by 76, Perplexity by 74, Apple by 74, Cohere by 68, Amazon by 64, Diffbot by 64, and Mistral by 24.

These corpus-wide figures indicate that the most-targeted operators are those running general-purpose training crawls at scale. In the Genealogy context, ancestry.com and myheritage.com are most likely blocking the training-focused bots (CCBot, ClaudeBot, GPTBot) rather than product-specific bots.

CCBot is blocked by 109 of 354 corpus sites — the most-blocked AI bot in the June 2026 Closing Web edition.

How This Snapshot Was Built

This report reflects a point-in-time crawl conducted June 14, 2026 — nothing is estimated, modeled, or extrapolated. The full methodology:

Crawl. Every site in the 40-category, 418-site Closing Web panel had its robots.txt fetched on June 14, 2026.
Parse. Each file was parsed against a fixed list of AI crawler user-agent tokens to identify disallow directives.
Classify. Each site was assigned one of three states: no parseable robots.txt; parseable robots.txt blocking at least one AI crawler; parseable robots.txt blocking no AI crawlers.
Seal. The complete dataset was content-hashed with sha 27ca61d890a647db and stored in an append-only archive for auditability.

All 10 Genealogy sites returned a parseable robots.txt — a coverage rate that reflects the technical sophistication of these platforms, whether or not they chose to use that file to block crawlers.

Frequently Asked Questions

Q: Why do the commercial archive sites block AI crawlers while the open-access platforms do not?

A: The commercial archives — ancestry.com, myheritage.com, findagrave.com, newspapers.com — hold proprietary, digitized collections whose value rests on access exclusivity. Allowing AI training crawlers to consume those records without license or payment would undermine that model. Open-access platforms like familysearch.org and wikitree.com operate under a mission of maximum access and community contribution; restricting crawlers would work against their core purpose.

Q: Does a genealogy database blocking AI crawlers affect what AI models can say about family history?

A: Robots.txt blocking prevents compliant AI training crawlers from indexing publicly accessible content on the site. It does not affect what AI models learned during prior training runs, nor does it affect access to information in offline collections, databases the AI operator licensed separately, or publicly available information published elsewhere.

Q: Are the blockers protecting user data or corporate data?

A: Both, in different proportions. Ancestry and MyHeritage hold both proprietary records (corporate data) and user-contributed family trees, DNA data, and personal research notes. The robots.txt blocks apply to public-facing pages. Private user data is protected by authentication and access controls, not robots.txt. The robots.txt restriction is primarily a protection on the publicly indexable portions of the archive — the search landing pages, record previews, and editorial content.

Q: Is genealogy data more sensitive than other categories?

A: Genealogy data involves personal identity information — birth dates, death dates, family relationships, and in some cases DNA — for both living and deceased individuals. The sites in this category are well aware of the sensitivity. However, robots.txt governs automated crawling of public pages, not data handling inside the platform. This report does not assess how these sites handle personal data; it only measures their public robots.txt posture toward AI crawlers.

Q: What would change this category average in the next snapshot?

A: If any of the six allowing sites adds a disallow directive for a major AI bot, the block rate climbs above 40%. If any of the four blocking sites removes its directives, the rate falls. The open-access platforms (FamilySearch, WikiTree, Geni) are unlikely to add restrictions given their missions. The commercial allowing sites (BillionGraves, GenealogyBank) are the most likely candidates for a posture change if industry norms shift. Monitoring for drift from this sealed baseline is the only way to catch changes when they happen.

Put AI-Access Data to Work

The Genealogy category finding — exactly at the corpus average, cleanly split between commercial and open-mission operators — gives three types of practitioners a concrete monitoring anchor.

Genealogy platform product and legal teams at properties like ancestry.com need to know when peers shift posture. If billiongraves.com or genealogybank.com — currently allowing crawlers — adds block directives, that signals a possible industry move toward broader restriction. The monitoring workflow: re-crawl all 10 Genealogy sites weekly against the sealed June 14 baseline, trigger an alert on any robots.txt change, and route the finding to the product and legal teams on a standing cadence. See also how Wedding sites track peer-level policy changes as a structural parallel.

AI content licensing teams at operators whose bots are blocked by ancestry.com and myheritage.com need category-level context when approaching these companies for licensing agreements. A 40% block rate with a clear commercial-vs-open split tells you which companies are most likely to engage on licensing (the blockers) and which are already open (the allowers). The workflow: monitor for changes in disallow directives specific to your bot's user-agent string across the full Genealogy panel, flagging any loosening that might signal openness to direct outreach.

Data infrastructure engineers building retrieval pipelines that pull genealogy information need to know which sites are actively restricting automated access at the policy level. This data is not a substitute for reading each site's terms of service, but it provides a quick categorization of access posture across the competitive landscape. US Tech Automations automates this robots.txt monitoring with scheduled crawls, change-detection diffs, and structured alerts keyed to sealed baselines like this one. Start tracking AI-access drift in genealogy and adjacent categories at /platform/agentic-workflows.

See where Genealogy sites fit in the broader trend in our study of how many top websites block AI crawlers.

Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 14, 2026 (snapshot sha 27ca61d890a647db).

Get this data as a daily feed

The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.

Prefer to talk first? Contact us.

Cite this report

US Tech Automations Research, 2026-06 edition. “Do Genealogy Sites Block AI Crawlers? 4 of 10 Do.” https://ustechautomations.com/resources/blog/do-genealogy-sites-block-ai-crawlers-2026

Sealed snapshot sha256: 27ca61d890a647db

Machine-readable data: CSV · JSON · All research & methodology