Training data

Training data refers to the massive datasets of text used to train large language models (LLMs). The composition of training data directly influences which brands, facts, and perspectives an AI model can reference in its outputs.

What training data includes

LLM training data typically comes from:

Web pages crawled from the internet (Common Crawl, etc.)
Books and academic papers
Wikipedia and other reference sources
Code repositories
News articles and press coverage
Social media (in some cases)

Training data and brand visibility

A brand's representation in training data affects:

Knowledge: Whether the AI "knows" about your brand at all
Accuracy: Whether information about your brand is current and correct
Sentiment: Whether the training data skews positive or negative about your brand
Context: What associations the AI makes with your brand

The training data gap

LLMs have a knowledge cutoff — a date beyond which they have no training data. This means:

New brands or products may not exist in the AI's knowledge
Recent developments about established brands may be missing
Real-time web search (used by Perplexity, ChatGPT Search) partially addresses this

Influencing training data

While you can't directly control what goes into training data, you can:

Publish authoritative, factual content about your brand
Earn coverage from major publications and trusted sources
Maintain accurate information across Wikipedia, industry databases, and review sites
Ensure your content is accessible to AI crawlers

What training data includes

Training data and brand visibility

The training data gap

Influencing training data

Geosaur

GEOSAUR SURVIVAL