Transformer & Attention Mechanism

Exam Importance

MUST | Every exam has a Transformer question (2025 Q5, 2024 Q5, Practice Q6)

Feynman Draft

Imagine you're reading a long book and someone asks: "What did the main character feel about the letter?"

You don't re-read every word. You skim for relevant parts — you pay more attention to sentences about the character and the letter, and less attention to descriptions of the weather. That's Attention（注意力机制）.

Now imagine you have 8 friends, and each one reads the book looking for something different: one tracks emotions, one tracks characters, one tracks locations, one tracks time. Then they share notes. That's Multi-Head Attention（多头注意力）— multiple "perspectives" on the same input.

The Transformer's Big Idea:

RNNs read words one by one (like reading a book left to right, can't skip ahead). This is slow. The Transformer reads ALL words at once (like seeing the whole page), then uses Attention to figure out which words relate to which. Much faster.

But wait — if you see all words at once, you lose the order! "Dog bites man" ≠ "Man bites dog". Solution: add Positional Encoding（位置编码）— a signal that tells the model "this word is in position 1, this one is position 2..."

Toy Example: "The cat sat on the mat"

With attention, when processing "sat", the model assigns weights:

"The"  → 0.05  (not very relevant)
"cat"  → 0.60  (WHO sat? very relevant!)
"sat"  → 0.10  (itself)
"on"   → 0.05  (grammar word)
"the"  → 0.05  (not very relevant)
"mat"  → 0.15  (WHERE sat? somewhat relevant)

Common Misconception: "Transformers are just faster RNNs." No — they work fundamentally differently. RNNs process sequentially (maintaining hidden state). Transformers process all positions in parallel (using attention to find relationships).

Core Intuition: Attention = learned "relevance weighting" between all pairs of inputs, processed in parallel.

Formal Definition: Scaled Dot-Product Attention

Scaled Dot-Product Attention（缩放点积注意力）:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$

Where:

Q (Query): "What am I looking for?"
K (Key): "What do I contain?"
V (Value): "What information do I provide?"
$d_k$: dimension of keys (scaling factor to prevent huge dot products)

The softmax creates attention weights (sum to 1) → multiply by V to get weighted combination.

Multi-Head Attention (考试高频)

Instead of one attention function, run h attention heads in parallel, each with its own learned Q, K, V weight matrices:

$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h) W^O$$

Why multiple heads?

Each head learns to focus on different aspects (syntax, semantics, position)
Single head would have an averaging effect — blurs different types of relationships
Multiple heads capture richer, more diverse patterns

Masked Attention in Decoder (2025 Q5a, 2024 Q5a)

What: In the decoder, when predicting token at position $t$, the attention is masked to prevent looking at positions $t+1, t+2, ...$

Why: During training, all tokens are available (teacher forcing（教师强迫）), but the model must learn to predict WITHOUT seeing the future. The mask sets future positions to $-\infty$ before softmax → attention weights become 0 for future tokens.

In plain English: It's like covering the right half of the answer sheet during an exam — you can only see what you've already written, not what comes next. This preserves the autoregressive（自回归） property: each prediction depends only on previous predictions.

Without mask:        With mask:
"I love cats"       "I love cats"
 ↕  ↕  ↕             →  →  →    (can only look LEFT)
All attend to all   Each token only attends to previous tokens

Vision Transformer (ViT) — Full Pipeline (2025 Q5b)

The Core Idea

CNNs use sliding filters to process images. ViT asks: what if we just cut the image into patches（图像块） and feed them into a standard Transformer? It turns out this works — and for large datasets, ViT matches or beats CNNs.

ViT Pipeline

The ViT Pipeline (Step by Step)

Concrete Example: 224 × 224 image, patch size = 16 × 16

Step 1: Split into patches
   224 / 16 = 14 patches per side → 14 × 14 = 196 patches total
   Each patch is 16 × 16 × 3 (RGB) = 768 values

Step 2: Linear projection (patch embedding)
   Each patch (768 values) → linearly projected to a D-dimensional vector
   This is NOT just flattening — it's a learned linear layer
   Output: 196 vectors of dimension D

Step 3: Prepend [CLS] token
   Add one learnable vector at position 0
   Sequence is now: [CLS], patch_1, patch_2, ..., patch_196
   Total: 197 tokens

Step 4: Add positional embeddings
   Each of the 197 positions gets a learnable positional embedding (added, not concatenated)
   Without this: the model can't distinguish top-left patch from bottom-right

Step 5: Pass through Transformer encoder
   Standard encoder: Multi-Head Self-Attention → Add & Norm → FFN → Add & Norm
   Repeated N times (ViT-Base uses N=12)

   "Add & Norm" explained:
   - ADD = residual/skip connection: output = sublayer(x) + x
     Why: gradient flows directly through the '+' → prevents vanishing gradients in deep models
   - NORM = Layer Normalisation: normalise across features for each token
     Why: keeps activations stable → faster, more stable training

Step 6: Classification
   Take ONLY the [CLS] token's output → pass through MLP head → class prediction

Why Patches Instead of Pixels?

Self-attention complexity is O(n²) where n = number of tokens.

Approach	n (tokens)	Attention operations
Pixel-level (224×224)	50,176	~2.5 billion — impossible
Patch-level (16×16 patches)	196	~38,000 — feasible

Patches reduce the sequence length by a factor of 256, making attention computationally tractable.

The [CLS] Token — What and Why

What: A special learnable embedding prepended to the patch sequence. It has no image content initially — it starts as random values and is learned during training.

How it works: Through self-attention across all encoder layers, the [CLS] token gradually aggregates information from ALL patches into a single global representation — like a "summary" token.

Why not just use all patch outputs? You could (some variants use Global Average Pooling over all patch embeddings instead). But [CLS] is more efficient: the MLP classification head only needs to read one vector instead of processing 196 vectors.

ViT vs CNN — Key Differences (Likely Exam Comparison)

Aspect	CNN	ViT
Basic operation	Sliding filters (convolution)	Self-attention over patches
Inductive bias（归纳偏置）	Strong: locality + translation invariance built in	Weak: no assumptions about spatial structure
Small datasets	Better — inductive bias compensates for limited data	Worse — needs pre-training on large data
Large datasets	Good	Better — fewer assumptions → more flexible
Computation pattern	Local (each filter sees a small region)	Global (each patch attends to ALL other patches)
Long-range dependencies	Only in deep layers (receptive field grows with depth)	From layer 1 (full attention is global)

Common Misconception: "ViT is always better than CNN." Wrong — ViT only beats CNN when trained on large datasets (e.g., ImageNet-21k, JFT-300M). On small datasets, CNN's inductive bias gives it a significant advantage. This is why ViT models are typically pre-trained on large data then fine-tuned on smaller target datasets.

Core Intuition: ViT trades CNN's built-in assumptions (locality, translation invariance) for the Transformer's flexibility — this pays off only when you have enough data to learn those patterns from scratch.

RNN vs Transformer (2024 Q5)

Aspect	RNN	Transformer
Processing	Sequential (one token at a time)	Parallel (all tokens at once)
Order info	Implicit (from sequential processing)	Explicit (positional encoding needed)
Speed	Slow for long sequences	Fast (parallelisable)
Long-range deps	Struggles (vanishing gradients)	Good (direct attention connections)
Advantage	Natural order capture	Parallelisation + long-range attention
Drawback	Can't parallelise	Needs positional encoding, O(n²) attention

Exam answer structure for 2024 Q5:

Advantage of sequential: RNNs naturally capture sequence order through their step-by-step processing — no extra mechanism needed.
Drawback of sequential: Can't parallelise → slow for long sequences. Each step must wait for the previous one.
How Transformer fixes it: Uses embeddings to represent all positions at once (parallel), then adds positional encoding to restore order information that would otherwise be lost.

Past Exam Questions Summary

Exam	Question	What They Asked
2025 Q5a	Masked attention in decoder	Why mask? (autoregressive property)
2025 Q5b	ViT [CLS] token	What is it? Why useful? (aggregation + efficiency)
2024 Q5	RNN advantage/drawback + how Transformer fixes	Sequential processing trade-off
Practice Q6a	What is multi-head attention?	Multiple attention heads with separate Q/K/V
Practice Q6b	Why is multi-head attention useful?	Different aspects, avoids averaging

English Expression Templates

Explaining attention:

"The attention mechanism allows the model to focus on the most relevant parts of the input sequence when making predictions."
"Attention computes a weighted sum of values, where weights reflect the relevance of each input position."

Explaining masked attention:

"Masking prevents each position from attending to future tokens, ensuring predictions depend only on known outputs."
"This preserves the autoregressive property during training."

Explaining multi-head:

"Multi-head attention runs several attention functions in parallel, each focusing on different aspects of the input."
"This is beneficial because a single head would have an averaging effect over all types of relationships."

Architecture Diagrams

Transformer Encoder Block:

Transformer Encoder Block

Scaled Dot-Product Attention:

Scaled Dot-Product Attention

中文思维 → 英文输出

中文思路	考试英文表达
注意力就是给每个位置加权	"The attention mechanism computes a weighted sum of values, where the weights reflect the relevance of each input position to the current query."
多头是为了关注不同方面	"Multi-head attention runs several attention functions in parallel, each with its own learned projections, allowing the model to focus on different aspects simultaneously."
遮蔽是为了不看未来的token	"Masking prevents each position from attending to future tokens, preserving the autoregressive property during training."
CLS token聚合所有patch的信息	"The [CLS] token aggregates information from all image patches through self-attention, providing a global representation for classification."
ViT需要大数据才比CNN好	"ViT outperforms CNN only when trained on large-scale datasets; on small datasets, CNN's stronger inductive bias is advantageous."

本章 Chinglish 纠正

Chinglish (避免)	正确表达
"Attention can let model focus on important part"	"The attention mechanism enables the model to dynamically focus on the most relevant parts of the input"
"Mask is for preventing cheat"	"Masking prevents information leakage from future tokens during training"
"ViT is cut picture to small pieces"	"ViT splits the image into non-overlapping patches and processes them as a sequence of tokens"

Whiteboard Self-Test

Can you draw the Transformer encoder block (self-attention → add&norm → FFN → add&norm)?
Can you explain Q, K, V in one sentence each?
Can you explain masked attention and WHY it's needed?
Can you explain the [CLS] token in ViT?
Can you explain why multi-head attention is better than single-head?
Can you compare RNN vs Transformer in 3 bullet points?