LLM Terminology Cheat Sheet: Comprehensive Reference for AI Practitioners

This reference distils the essential terminology of large language model (LLM) research and engineering into a single, accessible guide. From architectures and core components to training strategies and evaluation benchmarks, it provides the precise definitions needed to navigate technical papers, model documentation, and benchmark results.

This article is contributed by NetMind.AI's Algorithm Lead Bin Liang.


If you work with large language models (play with the latest open-source ones here), you know the jargon can get dense fast. This guide puts the most important terms - from architectures and core components to training strategies and evaluation benchmarks - in one place, with precise definitions you can rely on when reading research papers, model documentation, or benchmark results.


Model Architectures & Types

Transformer: Neural architecture based entirely on attention mechanisms, discarding recurrent (RNN) and convolutional (CNN) networks. Delivers strong performance in tasks such as machine translation, with higher parallelism and shorter training times.

Encoder–Decoder Architecture: Standard sequence transduction structure. The encoder processes the input sequence and the decoder generates the output; in the highest-performing models, the two are connected via an attention mechanism.

·Decoder-Only Architecture: Transformer design using only the decoder stack, with causal self-attention restricting tokens to attend only to previous positions. Optimised for autoregressive generation (e.g., text completion, code, dialogue). Simpler and more efficient than encoder–decoder for generative tasks; used in most modern LLMs like GPT, LLaMA, Qwen, etc.

BERT (Bidirectional Encoder Representations from Transformers): Bidirectional Transformer encoder pre-trained with Masked Language Modelling (MLM) and Next Sentence Prediction (NSP). Jointly attends to left and right context. Common sizes: BERT-BASE (110M parameters) and BERT-LARGE (340M).

Masked Language Model (MLM): Pre-training objective masking ~15% of WordPiece tokens: 80% replaced with [MASK], 10% with random tokens, 10% unchanged. The model predicts the originals from both left and right context.

Next Sentence Prediction (NSP): Pre-training objective predicting whether the second sentence follows the first (50% true, 50% random).

OpenAI GPT: Left-to-right Transformer using causal self-attention, where tokens attend only to preceding context.

ELMo (Embeddings from Language Models): Concatenates features from independently trained left-to-right and right-to-left LSTMs.

DistilBERT: Compressed BERT retaining ~97% of its performance, with reduced size, cost, and latency.

ERNIE (Enhanced Language Representation with Informative Entities): Incorporates entity information from knowledge graphs into language representations.

GLaM (Efficient Scaling of Language Models with Mixture-of-Experts): MoE-based architecture outperforming GPT-3 in NLU/NLG, with lower FLOPs per token and reduced training energy.

Mixture-of-Experts (MoE): Uses a gating module to dynamically select a small subset of experts (e.g., 2 of 64) per token; outputs are combined before the next layer.

PaLM (Pathways Language Model): Model series with sizes from 8B to 540B parameters, trained on 780B high-quality tokens.

phi-1 / phi-1-base / phi-1-small: Code generation models differing in API usage accuracy and logical consistency.

Toolformer: Learns to use external tools such as WikiSearch or machine translation APIs.

WebGPT: Browser-augmented question answering model using human feedback.

InstructGPT: Fine-tuned with human feedback to better follow instructions.


Model Components & Mechanisms

Attention Mechanism: Maps a query (Q) and key–value pairs (K, V) to an output as the weighted sum of values, with weights determined by a compatibility function.

Query (Q), Key (K), Value (V): Vector components in the attention function.

Scaled Dot-Product Attention: Computed as softmax(QKᵀ / √dₖ)V; adds scaling for stability.

Additive Attention: Uses a feed-forward network to compute compatibility scores; less efficient than dot-product attention in practice.

Self-Attention: Each token attends to all others in the sequence, capturing long-range dependencies.

Attention Heads: Independent attention layers in multi-head attention, each focusing on different dependencies.

Transformer Blocks: The repeated layer units in Transformer-based models such as BERT.

Layers (L), Hidden Size (H), Self-Attention Heads (A): Core parameters defining a Transformer architecture.

WordPiece Embeddings: Subword tokenisation used in BERT, with a 30K-token vocabulary.

[CLS] Token: Special classification token placed at sequence start; its final hidden state represents the sequence for classification.

[SEP] Token: Separator token distinguishing sentences or segments.

Token, Segment, and Position Embeddings: Elements combined to form token input representations.

Rotary Position Embedding (RoPE): Encodes positional information via rotation, preserving vector norms.

Rotary Matrix: Predefined matrix in RoPE for applying rotations.

Linear Attention: Self-attention variant with linear complexity, compatible with RoPE.

Content-Based Key Vectors: Weight matrices for computing content-based keys.

Location-Based Key Vectors: Weight matrices for computing location-based keys.

Bitwise Determinism: Ensures exact reproducibility at bit level from any checkpoint.

Outliers: Extreme values that reduce quantisation precision; mitigated with block constants.

Vector-Wise Quantisation: Improves quantisation by applying scaling per vector.

NF4 (NormalFloat 4-bit): Symmetric 4-bit datatype optimised for normally distributed weights.

FP8: 8-bit floating-point datatype for scaling constants.

Blocksize: Block size in quantisation, affecting precision and memory.

Trainable Matrices: Low-rank matrices updated in LoRA fine-tuning.

Frozen Weights: Pre-trained weights kept fixed during tuning.

dmodel: Transformer layer input/output dimension.

·LoRA (Low-Rank Adaptation): A fine-tuning method that injects trainable low-rank matrices into pre-trained models while keeping original weights frozen.

Rank (r): Rank of LoRA matrices.

dffn: Dimension of Transformer MLP, usually 4 × dmodel.

Subspace Similarity: Measures overlap between ∆W and W subspaces.

Segment-Level Recurrence Mechanism: Transformer-XL technique reusing hidden states to extend context.

Relative Positional Encodings: Encodes relative distances into attention scores.

Bi-Encoder Architecture: Uses separate encoders for queries and documents.

Maximum Inner Product Search (MIPS): Retrieves top-k documents with highest similarity scores.

Non-Parametric Memory: Document index in retrieval-augmented models.


Training Methods & Strategies

Pre-Training: Initial training on unlabelled data using objectives such as MLM and NSP.

Fine-Tuning: Adapting a pre-trained model to a specific task with labelled data.

Semi-Supervised Approach: Combines unsupervised pre-training and supervised fine-tuning for transferable representations.

Two-Stage Training Procedure: Unsupervised pre-training followed by supervised adaptation.

Unsupervised Objective: Pre-training target that does not require labels.

Denoising Objective: Reconstructs corrupted inputs.

Span-Corruption Objective: Masks contiguous token spans, predicting the originals.

Mass-Style Objective: Masks 15% of tokens, replacing with mask tokens, then reconstructs.

BERT-Style Denoising Objective: Similar to MLM but used in encoder–decoder models to reconstruct full sequences.

Deshuffling Objective: Predicts the original order from shuffled tokens.

Multi-Task Training: Trains on multiple tasks concurrently.

Multi-Task Pre-Training: Pre-trains across multiple tasks.

Instruction Tuning: Fine-tunes on datasets reformatted as natural language instructions.

Prompt Engineering: Optimising prompts for desired outputs.

LoRA (Low-Rank Adaptation): Parameter-efficient fine-tuning with low-rank matrices merged into frozen weights.

Parameter-Efficient Approach: Fine-tuning with fewer trainable parameters.

Random Gaussian Initialisation: LoRA matrix A initialisation method.

Zero Initialisation: LoRA matrix B initialisation producing zero ∆W.

Bias-Only / BitFit: Tunes only bias terms.

Prefix-Embedding Tuning: Inserts special tokens with trainable embeddings.

Prefix-Layer Tuning: Extends prefix tuning to layer activations.

Adapter Tuning: Inserts adapter layers between attention/MLP modules and residuals.

Prefix-Tuning: Optimises a continuous task-specific prefix vector without changing model weights.

Continuous Task-Specific Vectors: Learnable prefix parameters not tied to real tokens.

Virtual Tokens: Prefix vectors treated as pseudo-tokens.

Random Initialisation: Randomly initialising prefixes; less effective than real-token initialisation.

QLoRA (Quantised LoRA): LoRA using NF4 quantisation with FP8 constants and double dequantisation.

Double Dequantisation: Converts 4-bit weights to higher-precision formats during inference.

RLHF (Reinforcement Learning from Human Feedback): Aligns models using human preference data.

RLAIF (Reinforcement Learning from AI Feedback): Uses AI-generated preference labels for alignment.

AI-Generated Preference Labels: AI-produced quality judgements for candidate outputs.

Reward Model (RM): Predicts reward signals for RLHF/RLAIF.

Self-Consistency: Samples multiple reasoning chains (CoT) and averages preferences.

PPO (Proximal Policy Optimisation): RL algorithm used in RLHF.

SFT (Supervised Fine-Tuning): Fine-tunes on curated supervised data.

Process Supervision: Labels intermediate reasoning steps.

Outcome Supervision: Labels only final results.

MathMix: Math-focused token dataset for pre-training.

Decontamination Checks: Ensures no benchmark leakage in training data.

Weak-to-Strong Generalisation: Trains a strong model under weak supervision.

Weak Supervisor: Model producing weak labels.

Weak Labels: Soft labels from weak supervision.

Strong Student: Model trained under weak supervision that surpasses the supervisor.

Imitation: Failure mode where the student copies supervisor errors.

Human Simulator Failure Mode: Risk of models imitating human phrasing instead of optimal answers.

Linear Probing: Adds a linear classifier to frozen models.

Covariate Shift Problem: Training/test distribution mismatch.

Concept Shift: Change in label meaning or distribution.

Noisy Labels: Incorrect labels in data.

FLOPs per Token: Inference compute cost measure.

Greedy Decoder: Decoding strategy selecting highest-probability token each step.

AdamW Optimiser: Optimiser for models such as LLaMA and LoRA.

Cosine Learning Rate Schedule: Cosine-shaped learning rate decay.

Weight Decay: Regularisation reducing overfitting.

Gradient Clipping: Caps gradient magnitude.

Warmup Steps: Gradually increase learning rate at training start.

Batch Size: Number of samples per training step.

Epoch: One full pass over the training dataset.

Learning Rate: Step size for weight updates.

Reward Function (πrf): Source of training rewards in alignment.

Gradient Coefficient (GC): Scales penalty/reward magnitude.

RFT (Reward-Free Tuning): Alignment method without explicit reward models.

DPO (Direct Preference Optimisation): Preference-based alignment without reinforcement learning.

Online RFT: Real-time reward-free tuning.

GRPO (Generalised Regularised Policy Optimisation): Alignment method using model-based reward functions.

Continual Training: Continues training for domain adaptation.


Evaluation Metrics & Datasets

BLEU: Translation quality metric, also used in code generation.

Pass@k: Fraction of generated code passing tests within k attempts.

GLUE Benchmark: NLU benchmark with tasks such as CoLA, MNLI, MRPC, QNLI, QQP, RTE, SST-2, STS-B, WNLI.

SQuAD: Reading comprehension dataset for QA.

SICK Dataset: NLI dataset with entailment, contradiction, and neutral labels.

Open Entity: Entity classification benchmark.

FewRel, TACRED: Relation classification datasets.

HumanEval: Code generation benchmark.

SuperGLUE: More challenging NLU benchmark.

WMT Language Pairs: Translation datasets for BLEU scoring.

MMLU: Multi-domain knowledge understanding benchmark.

GSM8K: Grade-school maths QA benchmark.

MATH Dataset: Advanced mathematics problem set.

PRM800K: Large dataset with step-level labels for math problem solving.

RealToxicityPrompts: Toxicity evaluation dataset.

CrowS Pairs: Measures bias in models.

TruthfulQA: Tests factual accuracy and informativeness.

CAIL2019-SCM: Chinese long-text semantic matching dataset.

HotpotQA, Fever: QA datasets for multi-hop reasoning.

MS-MARCO, Jeopardy Question Generation: Used for retrieval-augmented generation.

CMATH, AGIEval: Chinese mathematics benchmarks.

Rouge-1 / Rouge-2 / Rouge-L: Summarisation evaluation metrics.

EM (Exact Match): QA accuracy measure.

F1 Score: Common classification/NER metric.

Accuracy: Overall correctness measure.

Precision, Recall, Micro-F1: Entity and relation extraction metrics.


Other Key Concepts

Universal Representation: Features transferable to multiple tasks.

Low-Data Regime: Training with very limited data.

Length Generalisation: Performance on sequences longer than training examples.

Co-Occurrence Prompts: Analyses token co-occurrence patterns in generated text.

Temperature: Sampling parameter controlling randomness.

Top-k Sampling: Chooses from top-k tokens at each step.

POS Tagger: Identifies part of speech for tokens.

Toxicity Probability of the Prompt (TPP): Measures input prompt toxicity.

Toxicity Probability of the Continuation (TPC): Measures toxicity in model outputs.

Perspective API: Tool for assigning toxicity probabilities to text.

Toxicity Degeneration: Unwanted toxic text generation.

In-Context Learning: Learning from examples in the prompt without weight updates.

Data Contamination: Evaluation data appearing in training sets.

Calibration Curve: Plots predicted confidence vs. actual accuracy.

Substring Match: Detects overlap between evaluation and training data.

LLM Prompt: Instruction or example text for LLMs.

Self-Reflection Iterations: Iteration count for reflection-based entity detection.

Chunk Size: Affects detected entity count.

Entity Extraction: Identifies named entities and attributes.

Relationship Extraction: Identifies relationships between entities.

Leiden Algorithm: Detects communities in graph data.

Entity Nodes: Graph nodes representing entities.

Graph Communities: Groups of related entities in a graph.

Hierarchical Clustering: Reveals internal community structure.

RLAIF vs RLHF: Compares AI-feedback and human-feedback reinforcement learning.

Position Bias: Preference for specific positions in pairwise comparisons.

Pairwise Accuracy: Reward model accuracy on held-out human preferences.

ULMFiT: Universal fine-tuning method for text classification.

Model Architectures: Fundamental model design types (e.g., Transformer, MoE).

Reasoning Capability: Ability to perform logical reasoning and problem solving.

External Knowledge: Use of information outside training data.

High-Quality Data: Critical for improving performance and alignment.

Human/AI Feedback: Mechanism for improving performance and alignment.