Skip to content

LLM Fundamentals

Large Language Models are next-token predictors trained on massive text corpora. Everything else — reasoning, coding, conversation — is an emergent behaviour from learning to predict text well enough.


How LLMs Work (The One-Paragraph Version)

Training:
  Billions of text documents → Transformer neural network
  Task: predict the next token given all previous tokens
  Trained on enough data → model learns grammar, facts, reasoning, code

Inference:
  You send a prompt (tokens in)
  Model predicts next token → appends it → predicts next → repeat
  Until it predicts a stop token or hits max_tokens

LLMs don't "know" things the way a database does. They learn statistical associations. This is why they hallucinate — they predict a plausible-sounding next token, not necessarily a true one.


Tokens

Everything is tokens. Not words, not characters — tokens. Understanding tokens is critical for understanding costs, context limits, and model behaviour.

"Hello, world!"  →  ["Hello", ",", " world", "!"]  →  4 tokens

"ChatGPT"        →  ["Chat", "G", "PT"]             →  3 tokens

"The quick brown fox" → ["The", " quick", " brown", " fox"]  →  4 tokens

Rule of thumb:
  ~4 characters ≈ 1 token (English)
  ~0.75 words   ≈ 1 token (English)
  Non-English languages are often less efficient (more tokens per word)
  Code tends to tokenize efficiently

Why Tokens Matter

Impact Details
Cost APIs charge per token (input + output)
Context window Max tokens the model can see at once
Latency More output tokens = slower response
Truncation Inputs exceeding context window are cut off

Context Window

The context window is the maximum number of tokens the model can "see" at once — your prompt, any conversation history, retrieved documents, and the response all count against it.

┌──────────────────────────────────────────────────────────┐
│                    Context Window (128k tokens)           │
│  ┌──────────┬──────────────┬──────────────┬───────────┐  │
│  │  System  │  Conversation│  Retrieved   │  Output   │  │
│  │  Prompt  │  History     │  Documents   │  (grows)  │  │
│  │  ~500    │  ~10,000     │  ~30,000     │  ~2,000   │  │
│  └──────────┴──────────────┴──────────────┴───────────┘  │
└──────────────────────────────────────────────────────────┘

Context Window by Model (approximate)

Model Context Window
GPT-4o 128k tokens
Claude 3.5 Sonnet 200k tokens
Gemini 1.5 Pro 1M tokens
Llama 3.1 70B 128k tokens
GPT-3.5 Turbo 16k tokens

Larger context ≠ always better. Models tend to lose focus on information in the middle of very long contexts ("lost in the middle" problem). Important info goes at the start or end.


Inference Parameters

mindmap
  root((Inference Parameters))
    Temperature
      0.0 = deterministic greedy
      0.0-0.3 = factual precise tasks
      0.7-1.0 = creative open-ended
      >1.0 = chaotic rarely useful
    Top-p Nucleus Sampling
      Cumulative probability threshold
      top_p=0.9 consider tokens making up 90% of probability mass
      Works with temperature
    Top-k
      Only consider top k tokens
      top_k=40 common default
    Max Tokens
      Hard cap on output length
      Cost and latency control
    Stop Sequences
      Tokens that stop generation
      Useful for structured output

Temperature Intuition

Temperature = 0.0 (deterministic):
  Prompt: "The capital of France is"
  Output: "Paris" (always, every time)

Temperature = 0.7 (balanced):
  Prompt: "Write me a product tagline for coffee"
  Output: varies each call, creative but coherent

Temperature = 1.5 (high):
  Output: creative, unexpected, sometimes incoherent

Use low temperature for:  factual Q&A, code generation, extraction
Use high temperature for: brainstorming, creative writing, variation

Embeddings

An embedding is a numerical vector that captures semantic meaning. Similar meanings → similar vectors (close in vector space).

"dog"    → [0.23, -0.87, 0.14, ...]  (1536 dimensions for ada-002)
"puppy"  → [0.25, -0.85, 0.16, ...]  (very close → similar meaning)
"cat"    → [0.18, -0.71, 0.22, ...]  (moderately close → related animal)
"SQL"    → [0.91,  0.12, -0.54, ...] (far away → different domain)

Cosine Similarity

How close are two vectors?

cosine_similarity = (A · B) / (|A| × |B|)

Result: -1 (opposite) to 1 (identical)

"dog" vs "puppy"    → ~0.93  (very similar)
"dog" vs "cat"      → ~0.78  (related)
"dog" vs "database" → ~0.21  (unrelated)

What Embeddings Enable

Application How
Semantic search Embed query, find nearest document embeddings
RAG Store doc chunks as embeddings, retrieve relevant ones
Clustering Group similar items by vector proximity
Recommendation "Similar items" = nearby vectors
Classification Train classifier on top of embeddings
Model Dimensions Notes
text-embedding-3-small (OpenAI) 1536 Cheap, fast, very good
text-embedding-3-large (OpenAI) 3072 Better quality, more cost
embed-english-v3.0 (Cohere) 1024 Strong for retrieval
all-MiniLM-L6-v2 (HuggingFace) 384 Open-source, runs locally

Hallucination

Hallucination = the model confidently generates false information. It's not lying — it's pattern-matching gone wrong. The model predicts plausible-sounding tokens, not necessarily true ones.

Types of hallucination:
  Factual:    "The Eiffel Tower was built in 1902" (wrong: 1889)
  Citation:   Made-up paper titles, non-existent URLs
  Entity:     Real person attributed a quote they never said
  Reasoning:  Confidently wrong math or logic

Why It Happens

Training data contains:
  - Errors and contradictions
  - Fictional content presented as fact
  - Out-of-date information

Inference:
  Model predicts next token based on patterns
  "Eiffel Tower built in ___" → "1889" (correct pattern)
  "The author of [obscure book] is ___" → plausible but wrong name

Mitigation Strategies

Strategy How it helps
RAG Ground answers in retrieved real documents
Grounding instructions "Only use information from the provided context"
Low temperature Less random → more likely to reproduce training facts
Self-consistency Sample multiple times, take majority vote
Citation requirements Prompt model to cite sources, easier to verify
Structured output Constrain format → harder to invent

Major Model Families

┌─────────────────────┬──────────────┬────────────┬────────────────────────────┐
│ Family              │ Maker        │ Access     │ Strengths                  │
├─────────────────────┼──────────────┼────────────┼────────────────────────────┤
│ GPT-4o / o3         │ OpenAI       │ API        │ General purpose, tool use  │
│ Claude 3.5 / 4      │ Anthropic    │ API        │ Long context, coding, safe │
│ Gemini 1.5 Pro      │ Google       │ API        │ Multimodal, huge context   │
│ Llama 3.x           │ Meta         │ Open       │ Runs locally, no API cost  │
│ Mistral             │ Mistral AI   │ Open + API │ Efficient, multilingual    │
│ Qwen 2.5            │ Alibaba      │ Open       │ Coding, multilingual       │
└─────────────────────┴──────────────┴────────────┴────────────────────────────┘

Quick Reference

Tokens:
  ~4 chars ≈ 1 token (English)
  Cost and context limits are measured in tokens

Context Window:
  Everything the model can see: system prompt + history + docs + response
  "Lost in the middle" — critical info at start or end

Temperature:
  0 = deterministic (facts, code)
  0.7 = balanced (most tasks)
  1.0+ = creative (brainstorming)

Embeddings:
  Vectors that capture semantic meaning
  Similar meaning = similar vector = small cosine distance
  Power semantic search and RAG

Hallucination:
  Model predicts plausible, not necessarily true
  Mitigate with RAG, grounding, low temperature