Training data
The large corpus of text and information used to train AI language models, which shapes their knowledge and the brands they reference.
Training data refers to the massive datasets of text used to train large language models (LLMs). The composition of training data directly influences which brands, facts, and perspectives an AI model can reference in its outputs.
What training data includes
LLM training data typically comes from:
- Web pages crawled from the internet (Common Crawl, etc.)
- Books and academic papers
- Wikipedia and other reference sources
- Code repositories
- News articles and press coverage
- Social media (in some cases)
Training data and brand visibility
A brand's representation in training data affects:
- Knowledge: Whether the AI "knows" about your brand at all
- Accuracy: Whether information about your brand is current and correct
- Sentiment: Whether the training data skews positive or negative about your brand
- Context: What associations the AI makes with your brand
The training data gap
LLMs have a knowledge cutoff — a date beyond which they have no training data. This means:
- New brands or products may not exist in the AI's knowledge
- Recent developments about established brands may be missing
- Real-time web search (used by Perplexity, ChatGPT Search) partially addresses this
Influencing training data
While you can't directly control what goes into training data, you can:
- Publish authoritative, factual content about your brand
- Earn coverage from major publications and trusted sources
- Maintain accurate information across Wikipedia, industry databases, and review sites
- Ensure your content is accessible to AI crawlers
