The New Technical Signals Lesson 19 of 27

Robots.txt and AI: the decision you need to make

What you'll learn
  • How robots.txt actually works (and where it stops working)
  • A clear framework for deciding which AI crawlers to allow or block
  • Why there isn't a single right answer — and why that's the honest position

What robots.txt is, in one paragraph

robots.txt is a small text file at the root of your website that tells crawlers which parts of your site they’re allowed to access. It’s been a standard part of the web for nearly thirty years. Search engines respect it. Most reputable AI crawlers do too. The format is plain text, the rules are simple, and the file lives at yoursite.com/robots.txt.

A robots.txt rule has two parts: which crawler the rule applies to (the user-agent), and what that crawler is allowed or disallowed from accessing. A line that says User-agent: GPTBot followed by Disallow: / tells OpenAI’s training crawler that it isn’t allowed anywhere on your site. A line that says User-agent: * followed by Allow: / tells every crawler that everything is permitted.

That’s the whole mechanic. Once you understand the format, you can read and edit robots.txt confidently.

The limit of what robots.txt does

Before going further, an honest caveat. robots.txt is a request, not an enforcement.

Reputable crawlers — Googlebot, GPTBot, ClaudeBot, PerplexityBot, and so on — honour robots.txt rules. They check the file, follow the directives, and stay out of areas they’ve been told to stay out of.

Bad-faith crawlers don’t. A scraper built specifically to bypass robots.txt will simply ignore it. Some less reputable AI companies have been caught training on content that was supposed to be blocked. The protection robots.txt gives you is real but not absolute. It works against good-faith actors and is largely ignored by bad-faith ones.

This matters because some readers will arrive at this lesson hoping robots.txt will protect their content from being used in AI training entirely. It won’t. What it will do is tell the reputable AI companies your preferences and trust them to respect those preferences. That’s worth doing — but it isn’t a copyright shield.

The decision in three questions

Most of the angst around AI crawlers can be resolved by asking three questions in order.

Question 1: Do you want to be cited by AI tools when users ask questions about your topic?

If yes, you should allow retrieval crawlers. That means leaving OAI-SearchBot, Claude-SearchBot, PerplexityBot, and the equivalents from other AI search products unblocked. If retrieval crawlers can’t reach your site, your content can’t be quoted in AI answers — which is the opposite of what GEO is trying to achieve.

If no — for example, if your site is private, if you genuinely don’t want AI visibility, or if you’ve decided as a matter of principle that AI tools shouldn’t quote you — you can block retrieval crawlers. The consequence is that AI answers about your topic will use other sources.

For almost every business reading this course, the answer to question 1 is yes.

Question 2: Do you want your content used to train future AI models?

This is a separate question with a separate answer. Allowing training crawlers means your content may be absorbed into the next generation of ChatGPT, Claude, or other models. The model doesn’t store your pages, but it learns from them. You don’t get attribution and you don’t get paid.

Some businesses are fine with this — they see it as part of how the open web works, and they benefit indirectly when AI systems become more knowledgeable about their topic. Other businesses object — they don’t want commercial AI companies profiting from content they paid to produce, particularly without compensation.

There’s no objectively right answer. The decision depends on your view of intellectual property, your commercial position, and whether you think training contribution leads to citation visibility (it sometimes does, but not reliably).

Common positions:

  • Allow everything. You believe the benefits of broad AI inclusion outweigh the lack of attribution. Most websites still default to this position.
  • Block training, allow retrieval. You want to be cited by AI tools answering today’s questions, but you don’t want your content folded into future models without consent. This is an increasingly common middle position.
  • Block everything. You don’t want AI involvement of any kind. Defensible, but it forfeits AI visibility entirely.

Question 2 is where most of the disagreement in this field lives. There’s no consensus, and there shouldn’t be one — it’s a values question disguised as a technical one.

Question 3: Do you want AI tools to fetch specific pages when users ask them to?

The third category — user-initiated crawlers like ChatGPT-User and Perplexity-User — is the one most people don’t realise they’re making a decision about.

These crawlers visit your site only when a specific user has explicitly asked an AI tool to look at a specific page. Blocking them means users of those tools can’t ask the AI to read your content. For most websites, this is a counterproductive thing to block — the user is already engaged with you, and you’re effectively telling them you’d rather they used a different source.

Almost every website should allow user-initiated crawlers. The only reason to block them is if the content on your site is genuinely private or sensitive enough that no AI tool should ever read it under any circumstances — which is a small category.

What a sensible robots.txt looks like

For most businesses doing GEO work, this is the configuration I’d suggest as a starting point:

User-agent: *
Allow: /

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

That configuration allows everything by default — including all retrieval and user-initiated crawlers — and then specifically blocks the major training crawlers. This is the middle position: AI tools can find and cite your content today, but they aren’t training future models on it without your active permission.

If you want a different position, the configuration changes accordingly:

  • Allow everything: delete the disallow lines. AI companies can train on and retrieve from your content freely.
  • Block everything: add every AI crawler to the disallow list. AI tools can neither train on nor cite your content.

You don’t need to memorise the exact syntax. Most CMSes have a robots.txt editor in the SEO settings, and a developer can implement any of these patterns in five minutes.

A note on changing your mind

robots.txt decisions aren’t permanent. You can change them at any time, and reputable crawlers will respect the new rules from their next visit.

What you can’t do is undo the training that’s already happened. Content that was crawled and used in a model’s training before you blocked it stays in that model. The decision is forward-looking. Whatever’s already absorbed is absorbed.

This is worth being honest about — partly because it shapes the urgency of the decision (it’s not “fix this immediately or you’re doomed”), and partly because it sets realistic expectations for what blocking can and can’t achieve.

A useful mindset

Robots.txt is the polite version of “no thank you.” Reputable AI companies will honour it. Bad-faith actors will ignore it. The right framing is that it tells the players who care about your wishes what those wishes are — not that it makes you invisible to the rest.

The decision is yours. Most businesses settle on a position within an afternoon once they understand the three questions clearly — and the honest answer to most of those questions is less complicated than the discourse around them suggests.


Coming up in the next module: Authority signals AI trusts. We leave the technical layer behind and look at the trust-and-authority signals AI systems use to decide who to cite. Author identity, outbound links, the role of your About page — three lessons covering the E-E-A-T layer translated for the AI era.