GitHub

Summary

AI crawlers are bots from OpenAI, Anthropic, Google, and Perplexity that fetch web content to train models or retrieve live answers — and your robots.txt policy toward them decides whether your brand exists inside AI search.

AI crawlers are automated bots run by AI companies — GPTBot from OpenAI, ClaudeBot from Anthropic, PerplexityBot, and a fast-growing list of others — that fetch web content either to train large language models or to retrieve pages in real time when an AI engine answers a question. Where Googlebot indexes URLs so pages can rank as links, AI crawlers determine whether your brand exists inside ChatGPT, Gemini, Perplexity, and Google AI Overviews at all. How you handle them in robots.txt has become one of the most consequential technical decisions in Generative Engine Optimization (GEO).

Key takeaways

AI crawlers come in three types: training bots that feed future models (GPTBot, ClaudeBot), search-index bots that power AI search (OAI-SearchBot, PerplexityBot), and user-triggered fetchers that grab pages live (ChatGPT-User, Perplexity-User).
Blocking a training bot keeps your content out of future model versions. Blocking a search bot removes you from AI answers today. Brands competing for AI visibility should almost always allow the search and retrieval bots.
Google-Extended controls Gemini training and grounding, but it does not remove you from AI Overviews — those run on ordinary Googlebot crawling.
Robots.txt is a policy signal, not a wall: user-triggered fetchers often ignore it, and some crawlers have a poor compliance record.
Audit before you assume: Cloudflare has blocked known AI crawlers by default for new domains since mid-2025, so many brands are invisible to AI engines without ever choosing to be.

Three jobs, three kinds of crawler

Googlebot has one job: map the web so pages can rank. AI crawlers do three different jobs, and each changes what you get back.

Training bots collect large corpora for future model versions. GPTBot, ClaudeBot, and Meta-ExternalAgent sit here, along with CCBot from the nonprofit Common Crawl, whose dataset many AI labs reuse. Content crawled today may not influence answers for months, until the next model ships.
Search-index bots build the retrieval indexes behind AI search. OAI-SearchBot powers ChatGPT search, Claude-SearchBot feeds Claude's web search, and PerplexityBot builds Perplexity's index. These bots decide whether you can be cited in an answer this week.