Do Jobs Sites Block AI Crawlers? Sealed robots.txt Data
Job boards occupy an unusual position in the AI-access debate. Their core product — structured listings that match candidates to employers — is exactly the type of content that AI systems find easy to ingest and repurpose. Jobs blocks at just 37.5%.
Against the 46.6% corpus-wide rate, Jobs clocks in well below that line, and the reasons embedded in the site-level data are worth unpacking.
3 of 8 Jobs sites block at least one AI crawler — a 37.5% block rate.
This report uses only data read verbatim from the sealed snapshot. To be explicit, nothing is estimated, modeled, or extrapolated. Every figure comes directly from public robots.txt files sealed June 14, 2026 under snapshot sha 834f1e2f07af24fd.
The Structural Logic of a Low-Blocking Job Category
A robots.txt file is an honor-system protocol — it signals a site owner's crawling preferences to compliant user-agents but does not technically prevent access. For job boards, the calculus differs from, say, an editorial media brand. Many of the largest job-search platforms actively benefit from maximum discoverability: appearing in AI-surfaced answer results, being cited in AI-powered job-search assistants, or having listings ingested into structured recommendation engines can all drive applicant traffic. That business logic pushes toward allowing, not blocking.
The 3 blocking sites in this category are glassdoor.com, ziprecruiter.com, and simplyhired.com. Glassdoor is the most intuitive case: its core value is salary data and employer reviews contributed by users — proprietary community content that the site monetizes through subscriptions and employer branding products. AI training on that content without compensation or attribution undermines the monetization model directly.
ZipRecruiter and SimplyHired both operate as aggregators and matching platforms. Their blocking choices may reflect concern about AI systems surfacing job-listing structured data without driving applicants back to the platform, bypassing the click-attribution loop that funds both sides of the marketplace. The argument is not about training corpora but about retrieval — an AI assistant that answers "show me software engineer jobs in Austin" by pulling from ZipRecruiter's feed without a referral link removes the platform from the transaction.
"3 of 8 Jobs sites with a parseable robots.txt blocked at least one AI crawler as of June 14, 2026 — a 37.5% rate, below the 46.6% corpus median."
Which Sites Allow All Crawlers
The 5 sites with parseable robots.txt files that allow all tracked AI crawlers are indeed.com, monster.com, dice.com, careerbuilder.com, and wellfound.com. That list spans the spectrum of job-search models: indeed.com is the dominant generalist aggregator; monster.com is a long-established broad-market board; dice.com focuses on tech roles; careerbuilder.com targets broad employer-side recruiting; wellfound.com (formerly AngelList Talent) focuses on startup roles. In each case, the commercial incentive to maximize reach and visibility in AI-generated results plausibly outweighs any concern about training-data extraction.
flexjobs.com and theladders.com returned no robots.txt file in this snapshot. Both are subscription-gated platforms — flexjobs.com for remote and flexible work, theladders.com for higher-compensation roles. A missing robots.txt does not mean permission to scrape aggressively; it means the site has not published explicit instructions. For subscription platforms in particular, the practical access barrier is authentication, not robots.txt, so the absence of a file may not reflect open-access intent.
The Jobs category block rate of 37.5% sits below the corpus-wide 46.6%.
How All 24 Categories Stack Up
The table below covers the full 24-category sweep. Jobs, at 37.5%, falls in the lower third of the distribution — below categories like Science (50%), Reference (54.5%), and the editorial-heavy leaders.
| Category | Sites Checked | With robots.txt | Blocking Any AI | Block Rate |
|---|---|---|---|---|
| Gaming | 9 | 9 | 8 | 88.9% |
| News | 20 | 17 | 14 | 82.4% |
| Food | 10 | 10 | 7 | 70% |
| Tech | 15 | 13 | 9 | 69.2% |
| Entertainment | 9 | 9 | 6 | 66.7% |
| Healthcare | 10 | 9 | 6 | 66.7% |
| Music | 10 | 9 | 6 | 66.7% |
| Reference | 14 | 11 | 6 | 54.5% |
| Science | 10 | 10 | 5 | 50% |
| Automotive | 10 | 9 | 4 | 44.4% |
| Home & Garden | 10 | 9 | 4 | 44.4% |
| Fashion | 9 | 7 | 3 | 42.9% |
| Social | 10 | 10 | 4 | 40% |
| Sports | 10 | 10 | 4 | 40% |
| Jobs | 10 | 8 | 3 | 37.5% |
| Travel | 9 | 9 | 3 | 33.3% |
| Weather | 10 | 6 | 2 | 33.3% |
| Legal | 10 | 7 | 2 | 28.6% |
| Real Estate | 10 | 7 | 2 | 28.6% |
| Finance | 12 | 11 | 2 | 18.2% |
| Retail | 15 | 12 | 2 | 16.7% |
| Education | 9 | 7 | 1 | 14.3% |
| Government | 9 | 8 | 1 | 12.5% |
| Nonprofit | 10 | 6 | 0 | 0% |
Jobs places fifteenth out of 24 categories — squarely below the corpus median and in a cluster with Travel (33.3%) and Fashion (42.9%) that represents more commercially open, distribution-seeking sectors. The categories most aggressively blocking — Gaming at 88.9%, News at 82.4% — are defined by proprietary creative or editorial content where AI training represents a clearer substitution risk.
Corpus-Wide Bot and Operator Blocking
The tables below describe which AI crawlers and operators face the most blocking, measured across all 223 sites with parseable robots.txt files — not limited to the Jobs category.
| Bot | Sites Blocking It (of 223) | Block Rate |
|---|---|---|
| CCBot | 85 | 38.1% |
| ClaudeBot | 74 | 33.2% |
| Bytespider | 69 | 30.9% |
| GPTBot | 64 | 28.7% |
| Meta-ExternalAgent | 63 | 28.3% |
| PerplexityBot | 60 | 26.9% |
| Applebot-Extended | 60 | 26.9% |
| Google-Extended | 57 | 25.6% |
| Amazonbot | 50 | 22.4% |
CCBot leads because Common Crawl's training crawler has been in operation longest and appears in the most webmaster guidance documents. ClaudeBot and Bytespider follow. The gap between the top and bottom of this list reflects that operators deploying newer or less well-publicized crawlers face less pushback from webmaster communities that have not yet identified and discussed them.
| Operator | Sites Blocking Them (of 223) |
|---|---|
| Common Crawl | 85 |
| Anthropic | 80 |
| Meta | 73 |
| ByteDance | 69 |
| OpenAI | 66 |
| Perplexity | 60 |
| Apple | 60 |
| 57 | |
| Cohere | 56 |
| Diffbot | 55 |
| Amazon | 50 |
| Mistral | 21 |
"Across all 223 sites with parseable robots.txt files, CCBot is blocked by 85 sites — the highest count of any individual AI crawler in the June 2026 corpus."
3 of 8 Jobs sites block at least one AI crawler.
Jobs sites block at a 37.5% rate.
104 of 223 sites block at least one AI crawler.
Key Takeaways
3 of 8 Jobs sites block at least one AI crawler — a 37.5% block rate.
The blockers are glassdoor.com (salary and review data), ziprecruiter.com, and simplyhired.com.
indeed.com, monster.com, dice.com, careerbuilder.com, and wellfound.com allow all tracked AI crawlers.
flexjobs.com and theladders.com returned no robots.txt in this snapshot.
Jobs sits below the corpus-wide 46.6% block rate, consistent with a category where broad discoverability is commercially valuable.
Across all 223 corpus sites, 39 have deployed an llms.txt file — a 17.5% adoption rate for that newer signaling standard.
The 24-category range runs from Gaming at 88.9% down to Nonprofit at 0%; Jobs at 37.5% falls in the lower-middle band.
Frequently Asked Questions
Q: Why would a job board allow AI crawlers to train on its listings?
A: A job board's listings are, by design, public and discovery-focused — employers post them to reach the widest possible audience. AI systems that surface listings in answer results or conversational search can function as a distribution channel. For major generalist boards like indeed.com or monster.com, the calculus appears to favor maximum reach. The platforms that block tend to be those whose value comes from proprietary non-listing data (Glassdoor's salary and review content) or those concerned specifically about referral attribution.
Q: Does the absence of a robots.txt file from flexjobs.com and theladders.com mean they allow all crawlers?
A: Technically, compliant crawlers interpret a missing robots.txt as permission to crawl all paths. But both platforms are subscription-gated: a crawler reaching a login wall does not get listing content regardless of what the robots.txt says. The access barrier is authentication, not the robots protocol. Their absence from the robots.txt blocking count should not be read as blanket access permission in practice.
Q: Could a job board change its blocking position in response to AI licensing deals?
A: Yes, and this is precisely why a point-in-time snapshot must be read as a snapshot, not a forecast. The Jobs category has commercial incentives in both directions — maximum discoverability argues for allowing; protection of proprietary data (especially review and salary data) argues for blocking. If a major job board enters a licensing arrangement with an AI operator, it might relax its block for that specific bot while tightening restrictions for others. The sealed-snapshot methodology captures the state on June 14, 2026; re-crawling the panel tracks policy drift over time.
Q: What does a low block rate in Jobs imply for AI product builders using job-listing data?
A: A 37.5% block rate means most major job boards in this panel — indeed.com, monster.com, dice.com, careerbuilder.com, and wellfound.com — carry no AI-crawler disallow, so retrieval pipelines face no honor-system barrier on their listings. A smaller group (glassdoor.com, ziprecruiter.com, simplyhired.com) does object. Builders should treat the allowance column as the absence of an explicit block, not as affirmative permission to train on that data.
Compliant pipelines should respect those signals. For AI training pipelines, the same principle applies — a site appearing in the allowance column has not explicitly blocked training, but absence of a block is not the same as affirmative permission. For more context on how another data-rich category handles this, see the Home & Garden report.
Put AI-Access Data to Work
Three specific workflows apply to what the Jobs-category sealed data shows.
Talent-tech product managers and AI recruiting tool builders face a concrete policy map from this snapshot: glassdoor.com, ziprecruiter.com, and simplyhired.com all carry disallow directives for at least one tracked AI crawler. A recurring monitoring workflow — re-crawl these 10 sites weekly and alert when any currently-allowing site (indeed.com, monster.com, dice.com, careerbuilder.com, wellfound.com) adds an AI-crawler block — gives your team early warning before a pipeline dependency breaks. The snapshot is the anchor; the value is detecting drift.
RevOps and licensing leads at job boards can use the 37.5% block rate as a calibration point. The majority of the category is currently in the allow column; that consensus creates a negotiating environment where licensing deals for AI training or retrieval access are easier to structure. Tracking the share of the category that flips to blocking over successive quarterly snapshots quantifies whether the window is closing. Compare with the Nonprofit category for a sector where zero sites block — a different point on the distribution that illustrates how wide the category variance is.
Data pipeline engineers ingesting job postings into retrieval-augmented or agent-driven systems can use this panel as an access audit starting point. The 3 blocking sites should be flagged in any ingestion pipeline for human review against terms of service; the 5 allowing sites carry no robots.txt barrier as of June 14, 2026. A scheduled re-crawl converts that audit into an ongoing access-compliance signal. Also see the Weather category report for a look at how another fragmented, utility-oriented category handles access policy.
US Tech Automations automates this exact monitoring job: scheduled robots.txt crawls across any panel of sites, change-diff alerting, and a per-category policy dashboard that updates continuously — so your team sees a new disallow directive the same day it goes live. Explore the agentic workflow platform.
Source: US Tech Automations Research — Closing Web edition; figures are verbatim counts from public robots.txt files sealed June 14, 2026 (snapshot sha 834f1e2f07af24fd).
Get this data as a daily feed
The numbers in this report come from a permit feed we monitor daily. Leave your email and we will follow up about a daily feed for your ZIPs and categories.
Prefer to talk first? Contact us.
Cite this report
US Tech Automations Research, 2026-06 edition. “Do Jobs Sites Block AI Crawlers? Sealed robots.txt Data.” https://ustechautomations.com/resources/blog/do-jobs-sites-block-ai-crawlers-2026
Sealed snapshot sha256: 834f1e2f07af24fd
Machine-readable data: CSV · JSON · All research & methodology
About the Author

Helping businesses leverage automation for operational efficiency.