AI Crawlers Explained: GPTBot, ClaudeBot, PerplexityBot

The bots that visit your site

Every website is visited constantly by automated programs called crawlers, bots, or spiders. Most of the well-known ones belong to search engines — Googlebot, Bingbot, and so on. Their job is to discover and index pages so search engines can rank them.

In the last few years, a new category of crawler has joined them. These are AI crawlers — bots run by AI companies that visit websites either to gather training data for their models, or to fetch specific pages in real time when an AI system needs to answer a question.

The names you’ll hear most often:

GPTBot — OpenAI’s training crawler, used to gather content for future ChatGPT models
OAI-SearchBot — OpenAI’s retrieval crawler, used when ChatGPT searches the live web to answer questions
ChatGPT-User — OpenAI’s user-agent for direct browsing when a user (or an agentic version of ChatGPT) visits a specific page
ClaudeBot — Anthropic’s training crawler
Claude-SearchBot — Anthropic’s retrieval crawler
PerplexityBot — Perplexity’s primary crawler
Perplexity-User — Perplexity’s user-agent for direct page fetches
Google-Extended — Google’s AI training crawler (separate from Googlebot)
CCBot — Common Crawl, a non-profit crawler whose data is used by many AI companies to train models
Bytespider — ByteDance’s crawler (TikTok’s parent company)
Meta-ExternalAgent — Meta’s AI crawler

There are others, and new ones appear regularly. But these are the ones that most websites will see in their server logs today.

The crucial distinction: training versus retrieval

Not all AI crawlers do the same thing — and this is the single most important distinction in the module.

Training crawlers gather content to teach AI models. The content they collect is absorbed into a future version of the model. This happens once, at the time of training, and the model doesn’t return to your site afterwards. If you block a training crawler, the model is trained without your content.

Retrieval crawlers fetch content in real time when an AI is trying to answer a specific question. The content isn’t absorbed — it’s read, used, and forgotten. If you block a retrieval crawler, your content can’t be cited in the answers the AI gives users.

This distinction matters because the decision to allow or block each type has very different consequences.

Blocking training crawlers means your content isn’t used to train future AI models. It doesn’t affect whether current AI systems can find or cite you today. It’s a long-term decision about whether you want to contribute to model training.

Blocking retrieval crawlers means your content can’t be quoted in real-time AI answers. The effect is immediate and direct: AI tools answering questions about your topic will use other sources instead of you.

Most websites that want to be cited by AI should allow retrieval crawlers. The decision about training crawlers is genuinely separate — and we’ll look at it properly in the next lesson.

How to know which is which

The names aren’t always self-explanatory. A short reference:

Crawler	Type
GPTBot	Training
OAI-SearchBot	Retrieval
ChatGPT-User	Direct page fetch (user-initiated)
ClaudeBot	Training
Claude-SearchBot	Retrieval
PerplexityBot	Retrieval (primary)
Perplexity-User	Direct page fetch (user-initiated)
Google-Extended	Training (for Google’s AI products)
CCBot	Training (Common Crawl, used by many AI companies)
Bytespider	Mixed — primarily training
Meta-ExternalAgent	Training

The user-initiated bots (ChatGPT-User, Perplexity-User) are a third category worth understanding. They fetch a single page on behalf of a specific user who asked the AI to look at it. Blocking these is roughly equivalent to telling those users they can’t share your site with the AI tool they’re using.

Why this matters more than it might seem

A surprising number of websites have inherited robots.txt files that block crawlers indiscriminately — sometimes because of a one-line copy-paste from a template, sometimes because someone made an aggressive decision years ago. The consequence is often that the site has accidentally locked itself out of being cited by AI without realising.

Equally, some websites have allowed everything by default and would prefer to make a more deliberate choice — particularly about training crawlers, where there are real reasons (intellectual property, content protection, commercial sensitivity) to want a say.

In both cases, knowing which crawler is which is the prerequisite for making the right call. The next lesson walks through that decision in detail.

A small note on detection

If you want to know which AI crawlers are actually visiting your site, there are two routes.

Server logs. Every visit to your site is recorded in your server’s access logs, including the user-agent string of whatever made the request. Tools like Cloudflare, your hosting dashboard, or a developer can pull these out and show you which crawlers have been visiting recently.

Analytics filters. Most analytics tools (including Google Analytics) filter bots out by default, so they won’t show up there. But some hosting platforms — including managed WordPress hosts — surface AI crawler activity in their own dashboards.

You don’t need to know exactly which crawlers visit you to make sensible decisions. The list of common ones above is enough to work with. But if you’re curious, the data exists, and it’s worth a look at least once.

A useful mindset

Not all AI crawlers do the same thing. The decision to allow or block one isn’t the decision to allow or block all of them — and treating them as a single category is how most websites accidentally do the wrong thing.

If you remember one thing from this lesson, remember the training-versus-retrieval distinction. The next lesson builds the whole decision framework on top of it.

Coming up in the next lesson: Robots.txt and AI — the decision you need to make. Now that you know what each crawler does, we’ll work through the practical decision about which to allow and which to block — without picking your position for you.