How to set up AI crawler access
Most AI visibility problems start with a robots.txt file written before AI crawlers existed. Here's the canonical list of agents to allow, plus the trade-offs for each.
The crawler landscape in 2026
There are now more than a dozen AI-related user agents you might encounter. They fall into three groups:
- Training crawlers — fetch content to train future models (GPTBot, ClaudeBot)
- Search/retrieval crawlers — fetch content for live AI search answers (OAI-SearchBot, PerplexityBot)
- AI-usage control signals — not crawlers, but flags governing how already-fetched content can be used (Google-Extended, Applebot-Extended)
The most common mistake is treating these as interchangeable. Blocking GPTBot does not block OAI-SearchBot. Allowing Googlebot does not allow Google-Extended.
The canonical allow list
For most commercial sites, this is the recommended baseline:
# OpenAI
User-agent: GPTBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
# Anthropic
User-agent: ClaudeBot
Allow: /
User-agent: anthropic-ai
Allow: /
User-agent: claude-user
Allow: /
# Perplexity
User-agent: PerplexityBot
Allow: /
User-agent: Perplexity-User
Allow: /
# Google (AI surfaces)
User-agent: Google-Extended
Allow: /
# Apple (Apple Intelligence)
User-agent: Applebot-Extended
Allow: /
# Microsoft Bing AI
User-agent: bingbot
Allow: /
# Common Crawl (used in training datasets)
User-agent: CCBot
Allow: /
# DuckDuckGo AI
User-agent: DuckAssistBot
Allow: /
When to block
You may want to block specific crawlers if:
- You are a major publisher negotiating direct licensing deals with model providers
- You have proprietary or paywalled content that must not be used for training
- You have legal or regulatory constraints in a specific jurisdiction
- You are running an A/B test on the visibility impact of blocking (rare)
In every other case, the marketing and visibility cost of blocking outweighs the protection benefit.
Step-by-step
- 1
Locate your current robots.txt
Visit https://yourdomain.com/robots.txt and read what is there. Most sites have either an empty file, a default CMS-generated file, or a years-old configuration. Save a copy as backup before editing.
- 2
Apply the canonical allow list
Replace or extend your robots.txt with the directives in the section above. Keep any site-specific Disallow rules for admin panels, search result pages, or sensitive paths. Make sure your existing Disallow rules do not unintentionally apply to AI crawlers.
- 3
Test each user agent with curl
For each crawler, run a curl request with the user agent string and verify your important pages return 200 OK. Some CDN configurations block unknown user agents at the edge before robots.txt is consulted. Fix CDN rules if needed.
- 4
Add llms.txt as a companion file
While editing access controls, also publish /llms.txt with your top 20-50 canonical pages. The robots.txt allowance lets crawlers in; the llms.txt curation tells them what to prioritize.
- 5
Monitor crawler hits in server logs
After deploying, watch your server logs or analytics for user agents matching the allowed list. Within 7-14 days you should see GPTBot, PerplexityBot, ClaudeBot, and others fetching your pages. If you do not, recheck your CDN and firewall rules.
- 6
Document the policy
Record the rationale for each allow or block decision in an internal doc. Crawler policies tend to be re-litigated when new stakeholders join — having a written decision log prevents accidental tightening that costs visibility.
Frequently asked questions
Will allowing AI crawlers slow down my site?
Marginally, if at all. AI crawlers respect crawl-delay and rate limits. For sites under a few million pages, the additional load is negligible. Larger publishers may want to configure crawl-delay or rate caps in robots.txt — most major AI crawlers honor them.
Do I lose control of my content by allowing AI crawlers?
You retain copyright and trademark rights. Allowing crawlers gives consent for indexing and grounding but does not transfer ownership. Some publishers prefer explicit licensing deals over open access — those negotiations happen outside robots.txt. For most brands, open allowance is the right default.
How can I tell which crawlers are most valuable to allow?
Prioritize by audience overlap: OAI-SearchBot and GPTBot (largest user base), PerplexityBot (highest referral traffic per visit), Google-Extended (largest search surface), ClaudeBot (technical and analyst audiences). Allow all of them unless you have a specific reason not to.
What about scrapers pretending to be AI crawlers?
User agent strings can be spoofed. Major AI crawlers also publish IP ranges or DNS verification methods so you can confirm authenticity. If you suspect scraping, verify the IP against the published list and consider WAF rules for unverified traffic. Real crawlers from major providers are well-documented.
Should I publish ai.txt or other emerging formats?
Several proposed standards (ai.txt, ai-content-license, etc.) exist but adoption is fragmented. Stick with the de-facto standards: robots.txt for access, llms.txt for curation, and standard schema markup for content labeling. Revisit emerging formats once they have meaningful crawler support.
Track your AI visibility automatically
Geosaur runs your prompt set across ChatGPT, Perplexity, Claude, Gemini, and Google AI Overviews on a recurring schedule — and alerts you the moment something changes.
