How LLMs Read Your Website: Crawling, Training, and Retrieval

Why this is the first technical lesson

Almost everything in GEO comes down to a single question: how does an AI system actually know anything about your website?

The honest answer is that there isn’t one mechanism — there are three. They work differently, they happen at different times, and they have different implications for how you publish. Once you understand the three, most GEO advice you’ll read starts to make sense in context. Without that foundation, even good advice can feel arbitrary.

So this lesson is the load-bearing wall of the whole course. It’s worth taking your time with.

The three ways AI gathers information

AI systems gather information about your site through training, retrieval, and direct browsing. They sound similar. They’re not.

1. Training

When an AI model is built, it’s trained on a vast amount of text — much of it scraped from the public web at a specific point in time. Once that training is finished, the model has a kind of frozen snapshot of what the web looked like up to that date.

If your website was online and crawlable during the training window, some of its content may have been absorbed into the model itself. The AI doesn’t store your pages as files — it learns patterns from them, in the same way you might learn from reading a book without being able to quote it word for word.

A few things follow from this:

Training is historical. Anything you publish today won’t be in the current generation of models. It might be in the next one — if there is a next one, and if your content is still online when it’s built.
Training is slow. New models are trained every six to twelve months, sometimes longer. You can’t influence training in any direct way except by publishing good content and hoping it’s included next time.
Training is lossy. The model doesn’t remember your exact words. It remembers patterns, associations, and the broad sense of what was written about a topic.

You can’t tune for training. You can only show up consistently over time and hope to be part of the next snapshot.

2. Retrieval

When you ask ChatGPT, Claude, or Perplexity a question today, many of them don’t just rely on training. They actively search the live web, read what they find, and use it to generate an answer in real time.

This is retrieval — and it’s where most current GEO opportunity lives.

If your content is well-structured, clearly written, and answers a question the user just asked, it can be retrieved and cited within seconds of being published. You don’t have to wait for the next training cycle. You don’t need to be famous. You just need to be the clearest, most relevant answer when the AI looks.

Retrieval has very different properties from training:

It’s fast. New content can be retrieved the same day it goes live.
It’s specific. The AI is looking for an answer to a particular question, not building general knowledge.
It’s transparent. Most retrieval-based answers cite their sources, which means you can see who’s being chosen and learn from it.

If you take only one thing from this lesson: most of the practical work in GEO is preparing your content to be retrieved well. Training is largely outside your control. Retrieval isn’t.

3. Direct browsing

The newest of the three. Some AI tools — agentic ones especially — now visit websites directly while completing a task. They might be researching a purchase, planning a trip, or compiling a report, and they’ll open and read individual pages along the way.

This is closer to how a human visitor uses your site, except the visitor is an AI acting on someone else’s behalf. The content it reads informs the response that goes back to the user.

Direct browsing is still emerging. It works best on sites that are easy to navigate, fast to load, and structured so a non-human reader can find what it’s looking for quickly. Most of the advice that applies to retrieval applies here too — clarity, structure, and self-contained content win.

Why the three matter together

The three mechanisms aren’t competing. They’re layered.

A well-built site shows up in training over time, gets retrieved quickly when something new is published, and is easy to navigate when an AI agent visits directly. The same underlying work — clear writing, good structure, accurate metadata, trustworthy signals — serves all three.

This is why GEO tends to feel less like a new discipline and more like an extension of good web craft. The mechanisms are new. The work is familiar.

A useful mindset

Stop asking “how do I get into ChatGPT?” Start asking “how do I make my content easy to gather, by any system, in any of the three ways it might be gathered?”

That question has practical answers. The first question doesn’t.

Coming up in the next lesson: The difference between ranking and being cited. SEO is about being found in a list. GEO is about being chosen as the answer. We’ll look at why that difference changes everything about how you structure content.