6 min read·Updated 2026-05-11

How to set up AI crawler access

Most AI visibility problems start with a robots.txt file written before AI crawlers existed. Here's the canonical list of agents to allow, plus the trade-offs for each.

The crawler landscape in 2026

There are now more than a dozen AI-related user agents you might encounter. They fall into three groups:

  1. Training crawlers — fetch content to train future models (GPTBot, ClaudeBot)
  2. Search/retrieval crawlers — fetch content for live AI search answers (OAI-SearchBot, PerplexityBot)
  3. AI-usage control signals — not crawlers, but flags governing how already-fetched content can be used (Google-Extended, Applebot-Extended)

The most common mistake is treating these as interchangeable. Blocking GPTBot does not block OAI-SearchBot. Allowing Googlebot does not allow Google-Extended.

The canonical allow list

For most commercial sites, this is the recommended baseline:

# OpenAI
User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

# Anthropic
User-agent: ClaudeBot
Allow: /

User-agent: anthropic-ai
Allow: /

User-agent: claude-user
Allow: /

# Perplexity
User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

# Google (AI surfaces)
User-agent: Google-Extended
Allow: /

# Apple (Apple Intelligence)
User-agent: Applebot-Extended
Allow: /

# Microsoft Bing AI
User-agent: bingbot
Allow: /

# Common Crawl (used in training datasets)
User-agent: CCBot
Allow: /

# DuckDuckGo AI
User-agent: DuckAssistBot
Allow: /

When to block

You may want to block specific crawlers if:

  • You are a major publisher negotiating direct licensing deals with model providers
  • You have proprietary or paywalled content that must not be used for training
  • You have legal or regulatory constraints in a specific jurisdiction
  • You are running an A/B test on the visibility impact of blocking (rare)

In every other case, the marketing and visibility cost of blocking outweighs the protection benefit.

Step-by-step

  1. 1

    Locate your current robots.txt

    Visit https://yourdomain.com/robots.txt and read what is there. Most sites have either an empty file, a default CMS-generated file, or a years-old configuration. Save a copy as backup before editing.

  2. 2

    Apply the canonical allow list

    Replace or extend your robots.txt with the directives in the section above. Keep any site-specific Disallow rules for admin panels, search result pages, or sensitive paths. Make sure your existing Disallow rules do not unintentionally apply to AI crawlers.

  3. 3

    Test each user agent with curl

    For each crawler, run a curl request with the user agent string and verify your important pages return 200 OK. Some CDN configurations block unknown user agents at the edge before robots.txt is consulted. Fix CDN rules if needed.

  4. 4

    Add llms.txt as a companion file

    While editing access controls, also publish /llms.txt with your top 20-50 canonical pages. The robots.txt allowance lets crawlers in; the llms.txt curation tells them what to prioritize.

  5. 5

    Monitor crawler hits in server logs

    After deploying, watch your server logs or analytics for user agents matching the allowed list. Within 7-14 days you should see GPTBot, PerplexityBot, ClaudeBot, and others fetching your pages. If you do not, recheck your CDN and firewall rules.

  6. 6

    Document the policy

    Record the rationale for each allow or block decision in an internal doc. Crawler policies tend to be re-litigated when new stakeholders join — having a written decision log prevents accidental tightening that costs visibility.

Frequently asked questions

Will allowing AI crawlers slow down my site?

Marginally, if at all. AI crawlers respect crawl-delay and rate limits. For sites under a few million pages, the additional load is negligible. Larger publishers may want to configure crawl-delay or rate caps in robots.txt — most major AI crawlers honor them.

Do I lose control of my content by allowing AI crawlers?

You retain copyright and trademark rights. Allowing crawlers gives consent for indexing and grounding but does not transfer ownership. Some publishers prefer explicit licensing deals over open access — those negotiations happen outside robots.txt. For most brands, open allowance is the right default.

How can I tell which crawlers are most valuable to allow?

Prioritize by audience overlap: OAI-SearchBot and GPTBot (largest user base), PerplexityBot (highest referral traffic per visit), Google-Extended (largest search surface), ClaudeBot (technical and analyst audiences). Allow all of them unless you have a specific reason not to.

What about scrapers pretending to be AI crawlers?

User agent strings can be spoofed. Major AI crawlers also publish IP ranges or DNS verification methods so you can confirm authenticity. If you suspect scraping, verify the IP against the published list and consider WAF rules for unverified traffic. Real crawlers from major providers are well-documented.

Should I publish ai.txt or other emerging formats?

Several proposed standards (ai.txt, ai-content-license, etc.) exist but adoption is fragmented. Stick with the de-facto standards: robots.txt for access, llms.txt for curation, and standard schema markup for content labeling. Revisit emerging formats once they have meaningful crawler support.

Track your AI visibility automatically

Geosaur runs your prompt set across ChatGPT, Perplexity, Claude, Gemini, and Google AI Overviews on a recurring schedule — and alerts you the moment something changes.

SCORE: 00000LVL: 1
Full heartFull heartFull heart
Geosaur

GEOSAUR SURVIVAL

Don't let your brand go extinct in the new era of search. Collect credits with Geosaur and avoid meteors.

Left arrowRight arroworA keyD keyto move