COMPSCI 714 β A4 Cheatsheet (Print Both Sides)
SIDE 1: FORMULAS + DIAGNOSIS + CONCEPTS
1. CNN Dimension Formulas (MUST β every exam)
Conv output: floor((n + 2p - f) / s) + 1 Γ num_filters
Pool output: floor((n - f) / s) + 1 Γ same_depth
Flatten: H Γ W Γ C
- Valid padding: p = 0 (output shrinks)
- Same padding: output H,W = input H,W (p chosen automatically)
- MaxPool vs AvgPool: same output dimensions, only values differ
- Depth after Conv = number of filters; depth after Pool = unchanged
Worked example (2025 Q6):
[35,35,3] βConv(valid,k=7,s=2)β floor((35-7)/2)+1=15 β [15,15,10]
βPool(k=2,s=2)β floor((15-2)/2)+1=7 β [7,7,10]
βConv(same,k=3,s=1)β same H,W β [7,7,20]
βPool(k=2,s=2)β floor((7-2)/2)+1=3 β [3,3,20]
βFlatten: 3Γ3Γ20 = 180
2. Bias-Variance Diagnosis (MUST β ~20% of marks)
| Symptom | Diagnosis | Name |
|---|---|---|
| Train HIGH, Val HIGH | High bias | Underfitting |
| Train LOW, Val HIGH (gap) | High variance | Overfitting |
| Train LOW, Val LOW | Good fit | β |
Fixes for OVERFITTING (high variance):
- More data / data augmentation β
- L2 regularisation (penalises large weights) β
- Dropout (randomly deactivates neurons) β
- Batch normalisation (regularising effect) β
- Reduce model size β
- Early stopping β
- More epochs β (worsens it!)
Fixes for UNDERFITTING (high bias):
- Increase model size (more layers/neurons) β
- More/better features β
- Train longer β
- Reduce regularisation β
- Dropout β (constrains already limited model!)
- Zero initialisation β (symmetry problem β all neurons learn same thing)
3. Evaluation Metrics (HIGH)
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision = TP / (TP + FP) "of predicted +, how many correct?"
Recall = TP / (TP + FN) "of actual +, how many found?"
F1 = 2 Γ (P Γ R) / (P + R)
Trap: High accuracy + low recall β class imbalance. Model predicts majority class. Trap: 100% recall + low precision β model predicts everything as positive.
4. Learning Rate Curves (HIGH)
| Curve shape | LR | Reason |
|---|---|---|
| Diverges (loss goes up) | Too high (e.g. 0.5) | Big updates overshoot |
| Fast converge β high loss | Slightly high (e.g. 0.1) | Overshoots optimum |
| Fast converge β low loss | Good (e.g. 0.01) | Just right |
| Very slow descent | Too small (e.g. 0.001) | Tiny updates |
Momentum: exponentially decaying average of past gradients β smoother updates, faster convergence.
LR Schedule (e.g. exponential decay): start high for fast progress, reduce to fine-tune near optimum.
Adam: momentum + adaptive per-parameter learning rates.
5. Activation Functions (MED)
| Function | Output range | Use case |
|---|---|---|
| ReLU | [0, β) | Hidden layers (default) |
| LeakyReLU | (-β, β) | Hidden (fixes dying ReLU) |
| Sigmoid | (0, 1) | Output: binary / multi-label |
| Softmax | (0,1) sums to 1 | Output: multi-class (one label) |
| Tanh | (-1, 1) | Hidden (zero-centred) |
Dying ReLU: negative input β output=0 β no gradient β neuron "dies". LeakyReLU fix: small slope (e.g. 0.01x) for negative inputs β neuron still gets gradient.
KEY: Multi-label (multiple outputs ON) β sigmoid. Multi-class (exactly one) β softmax.
6. Data Preprocessing (MUST)
| Step | Implies about raw data |
|---|---|
| Median imputer | Numerical, has missing values, possibly skewed/outliers |
| Most-frequent imputer | Categorical, has missing values |
| Standardisation | Attributes on different scales |
| Log transform | Heavy-tailed distribution |
| One-hot encoding | Categorical, no ordinal relationship, not too many categories |
| Remove attribute | >99% missing values β imputation creates misleading info |
When to remove vs impute:
- Remove: vast majority missing (e.g. 9995/10000)
- Impute: reasonable number missing (e.g. 15/10000)
Outlier detection: extreme min/max relative to mean+std β likely outliers
SIDE 2: ARCHITECTURES + ANSWER TEMPLATES
7. Transformer / Attention (MUST)
Self-attention: weighted sum of Values, where weights = relevance between Query and Key.
Multi-head attention: multiple attention heads with separate Q/K/V β each focuses on different aspects. Outputs concatenated.
Masked attention (decoder): prevents attending to future tokens β preserves autoregressive property during training (predict next token based only on previous).
Positional encoding: needed because Transformer processes all tokens in parallel β loses order information. Added to embeddings.
ViT [CLS] token: learnable token prepended to patch sequence β aggregates info from all patches via attention β fed to MLP for classification. Advantage: efficient, no need for global pooling.
8. RNN / LSTM (MED)
RNN: h_t = f(WΒ·h_{t-1} + UΒ·x_t + b). Sequential processing.
- Advantage: naturally captures order
- Drawback: can't parallelise β slow for long sequences
- Problem: vanishing gradients (long-range dependencies lost)
LSTM: 3 gates (forget, input, output) control information flow β solves vanishing gradient.
How Transformer fixes RNN drawback: processes all positions in parallel via embeddings + adds positional encoding for order info.
9. DNN Training Challenges (MED)
Why deep nets are hard to train:
- Vanishing/exploding gradients
- More prone to overfitting
- Longer training time
Strategies to help:
- Skip connections / ResNet (y = F(x) + x, gradients flow through shortcut)
- Batch normalisation (normalises activations, keeps gradients healthy)
- Better optimisers (Adam, RMSProp)
- LSTM/GRU for sequences
Batch Norm effects: speeds up training, reduces vanishing gradients, regularisation effect, reduces sensitivity to weight initialisation.
10. Answer Templates
"Will this improve validation accuracy?"
[YES/NO], [suggestion] is [likely/unlikely] to improve validation accuracy.
The model is currently [overfitting/underfitting], as evidenced by
[train acc X% vs val acc Y%].
[Suggestion] [helps/does not help] because [mechanism].
"Explain a concept"
[Concept] is [one-sentence definition].
It works by [mechanism].
This is [beneficial/important] because [why].
"Interpret loss curves / metrics"
The [curve/metric] shows [observation].
This indicates [diagnosis].
This is because [cause].
CNN calculation
Layer: Input [H,W,C]
formula: floor((H+2p-f)/s)+1 = ...
Output: [H',W',C']
β next layer...
β Flatten: HΓWΓC = answer
11. Key English Phrases (copy-paste ready)
| Situation | Write this |
|---|---|
| Overfitting | "The model is overfitting, as training acc (X%) is much higher than validation acc (Y%)." |
| L2 helps | "L2 regularisation penalises large weights, encouraging a simpler, more generalisable model." |
| Dropout hurts underfitting | "Dropout will NOT help because the model is underfitting β it further constrains an already limited model." |
| More epochs hurts | "More epochs will worsen overfitting as the model continues to memorise training noise." |
| Class imbalance trap | "Despite high accuracy, the model is ineffective due to class imbalance β it achieves accuracy by predicting the majority class." |
| RNN advantage | "RNNs naturally capture sequential order during training." |
| RNN drawback | "Sequential processing prevents parallelisation, making training slow for long sequences." |
| Transformer fix | "The Transformer processes all positions in parallel via self-attention, using positional encoding to retain order information." |
12. Common Traps to Avoid
- Multi-label β multi-class β sigmoid, NOT softmax
- High recall + low precision = predicting everything positive (not a good model)
- Regularisation fights overfitting, NEVER helps underfitting
- Zero initialisation β symmetry problem β all neurons learn the same
- More epochs β more overfitting, not less
- MaxPool vs AvgPool β same dimensions, different values
- Accuracy alone is misleading with imbalanced classes
13. Marks-per-minute priority
1. Bias/Variance diagnosis + fixes ~20% β ALWAYS on exam
2. CNN dimension calculation ~15% β ALWAYS on exam
3. Transformer/Attention concepts ~15% β ALWAYS on exam
4. Data preprocessing reasoning ~15% β ALWAYS on exam
5. Learning rate curve matching ~10% β usually on exam
6. Confusion matrix metrics ~10% β usually on exam
7. Activation functions ~5%
8. RNN vs Transformer ~5%
9. Batch Norm / DNN training ~5%