Training data

The large corpus of text and information used to train AI language models, which shapes their knowledge and the brands they reference.

Training data refers to the massive datasets of text used to train large language models (LLMs). The composition of training data directly influences which brands, facts, and perspectives an AI model can reference in its outputs.

What training data includes

LLM training data typically comes from:

  • Web pages crawled from the internet (Common Crawl, etc.)
  • Books and academic papers
  • Wikipedia and other reference sources
  • Code repositories
  • News articles and press coverage
  • Social media (in some cases)

Training data and brand visibility

A brand's representation in training data affects:

  1. Knowledge: Whether the AI "knows" about your brand at all
  2. Accuracy: Whether information about your brand is current and correct
  3. Sentiment: Whether the training data skews positive or negative about your brand
  4. Context: What associations the AI makes with your brand

The training data gap

LLMs have a knowledge cutoff — a date beyond which they have no training data. This means:

  • New brands or products may not exist in the AI's knowledge
  • Recent developments about established brands may be missing
  • Real-time web search (used by Perplexity, ChatGPT Search) partially addresses this

Influencing training data

While you can't directly control what goes into training data, you can:

  • Publish authoritative, factual content about your brand
  • Earn coverage from major publications and trusted sources
  • Maintain accurate information across Wikipedia, industry databases, and review sites
  • Ensure your content is accessible to AI crawlers
SCORE: 00000LVL: 1
Full heartFull heartFull heart
Geosaur

GEOSAUR SURVIVAL

Don't let your brand go extinct in the new era of search. Collect credits with Geosaur and avoid meteors.

Left arrowRight arroworA keyD keyto move