COMPSCI 714 β€” A4 Cheatsheet (Print Both Sides)


SIDE 1: FORMULAS + DIAGNOSIS + CONCEPTS


1. CNN Dimension Formulas (MUST β€” every exam)

Conv output:  floor((n + 2p - f) / s) + 1  Γ— num_filters
Pool output:  floor((n - f) / s) + 1        Γ— same_depth
Flatten:      H Γ— W Γ— C
  • Valid padding: p = 0 (output shrinks)
  • Same padding: output H,W = input H,W (p chosen automatically)
  • MaxPool vs AvgPool: same output dimensions, only values differ
  • Depth after Conv = number of filters; depth after Pool = unchanged

Worked example (2025 Q6):

[35,35,3] β†’Conv(valid,k=7,s=2)β†’ floor((35-7)/2)+1=15 β†’ [15,15,10]
         β†’Pool(k=2,s=2)β†’ floor((15-2)/2)+1=7 β†’ [7,7,10]
         β†’Conv(same,k=3,s=1)β†’ same H,W β†’ [7,7,20]
         β†’Pool(k=2,s=2)β†’ floor((7-2)/2)+1=3 β†’ [3,3,20]
         β†’Flatten: 3Γ—3Γ—20 = 180

2. Bias-Variance Diagnosis (MUST β€” ~20% of marks)

SymptomDiagnosisName
Train HIGH, Val HIGHHigh biasUnderfitting
Train LOW, Val HIGH (gap)High varianceOverfitting
Train LOW, Val LOWGood fitβ€”

Fixes for OVERFITTING (high variance):

  • More data / data augmentation βœ“
  • L2 regularisation (penalises large weights) βœ“
  • Dropout (randomly deactivates neurons) βœ“
  • Batch normalisation (regularising effect) βœ“
  • Reduce model size βœ“
  • Early stopping βœ“
  • More epochs βœ— (worsens it!)

Fixes for UNDERFITTING (high bias):

  • Increase model size (more layers/neurons) βœ“
  • More/better features βœ“
  • Train longer βœ“
  • Reduce regularisation βœ“
  • Dropout βœ— (constrains already limited model!)
  • Zero initialisation βœ— (symmetry problem β€” all neurons learn same thing)

3. Evaluation Metrics (HIGH)

Accuracy  = (TP + TN) / (TP + TN + FP + FN)
Precision = TP / (TP + FP)    "of predicted +, how many correct?"
Recall    = TP / (TP + FN)    "of actual +, how many found?"
F1        = 2 Γ— (P Γ— R) / (P + R)

Trap: High accuracy + low recall β†’ class imbalance. Model predicts majority class. Trap: 100% recall + low precision β†’ model predicts everything as positive.


4. Learning Rate Curves (HIGH)

Curve shapeLRReason
Diverges (loss goes up)Too high (e.g. 0.5)Big updates overshoot
Fast converge β†’ high lossSlightly high (e.g. 0.1)Overshoots optimum
Fast converge β†’ low lossGood (e.g. 0.01)Just right
Very slow descentToo small (e.g. 0.001)Tiny updates

Momentum: exponentially decaying average of past gradients β†’ smoother updates, faster convergence.

LR Schedule (e.g. exponential decay): start high for fast progress, reduce to fine-tune near optimum.

Adam: momentum + adaptive per-parameter learning rates.


5. Activation Functions (MED)

FunctionOutput rangeUse case
ReLU[0, ∞)Hidden layers (default)
LeakyReLU(-∞, ∞)Hidden (fixes dying ReLU)
Sigmoid(0, 1)Output: binary / multi-label
Softmax(0,1) sums to 1Output: multi-class (one label)
Tanh(-1, 1)Hidden (zero-centred)

Dying ReLU: negative input β†’ output=0 β†’ no gradient β†’ neuron "dies". LeakyReLU fix: small slope (e.g. 0.01x) for negative inputs β†’ neuron still gets gradient.

KEY: Multi-label (multiple outputs ON) β†’ sigmoid. Multi-class (exactly one) β†’ softmax.


6. Data Preprocessing (MUST)

StepImplies about raw data
Median imputerNumerical, has missing values, possibly skewed/outliers
Most-frequent imputerCategorical, has missing values
StandardisationAttributes on different scales
Log transformHeavy-tailed distribution
One-hot encodingCategorical, no ordinal relationship, not too many categories
Remove attribute>99% missing values β†’ imputation creates misleading info

When to remove vs impute:

  • Remove: vast majority missing (e.g. 9995/10000)
  • Impute: reasonable number missing (e.g. 15/10000)

Outlier detection: extreme min/max relative to mean+std β†’ likely outliers


SIDE 2: ARCHITECTURES + ANSWER TEMPLATES


7. Transformer / Attention (MUST)

Self-attention: weighted sum of Values, where weights = relevance between Query and Key.

Multi-head attention: multiple attention heads with separate Q/K/V β†’ each focuses on different aspects. Outputs concatenated.

Masked attention (decoder): prevents attending to future tokens β†’ preserves autoregressive property during training (predict next token based only on previous).

Positional encoding: needed because Transformer processes all tokens in parallel β†’ loses order information. Added to embeddings.

ViT [CLS] token: learnable token prepended to patch sequence β†’ aggregates info from all patches via attention β†’ fed to MLP for classification. Advantage: efficient, no need for global pooling.


8. RNN / LSTM (MED)

RNN: h_t = f(WΒ·h_{t-1} + UΒ·x_t + b). Sequential processing.

  • Advantage: naturally captures order
  • Drawback: can't parallelise β†’ slow for long sequences
  • Problem: vanishing gradients (long-range dependencies lost)

LSTM: 3 gates (forget, input, output) control information flow β†’ solves vanishing gradient.

How Transformer fixes RNN drawback: processes all positions in parallel via embeddings + adds positional encoding for order info.


9. DNN Training Challenges (MED)

Why deep nets are hard to train:

  1. Vanishing/exploding gradients
  2. More prone to overfitting
  3. Longer training time

Strategies to help:

  • Skip connections / ResNet (y = F(x) + x, gradients flow through shortcut)
  • Batch normalisation (normalises activations, keeps gradients healthy)
  • Better optimisers (Adam, RMSProp)
  • LSTM/GRU for sequences

Batch Norm effects: speeds up training, reduces vanishing gradients, regularisation effect, reduces sensitivity to weight initialisation.


10. Answer Templates

"Will this improve validation accuracy?"

[YES/NO], [suggestion] is [likely/unlikely] to improve validation accuracy.
The model is currently [overfitting/underfitting], as evidenced by
[train acc X% vs val acc Y%].
[Suggestion] [helps/does not help] because [mechanism].

"Explain a concept"

[Concept] is [one-sentence definition].
It works by [mechanism].
This is [beneficial/important] because [why].

"Interpret loss curves / metrics"

The [curve/metric] shows [observation].
This indicates [diagnosis].
This is because [cause].

CNN calculation

Layer: Input [H,W,C]
  formula: floor((H+2p-f)/s)+1 = ...
  Output: [H',W',C']
β†’ next layer...
β†’ Flatten: HΓ—WΓ—C = answer

11. Key English Phrases (copy-paste ready)

SituationWrite this
Overfitting"The model is overfitting, as training acc (X%) is much higher than validation acc (Y%)."
L2 helps"L2 regularisation penalises large weights, encouraging a simpler, more generalisable model."
Dropout hurts underfitting"Dropout will NOT help because the model is underfitting β€” it further constrains an already limited model."
More epochs hurts"More epochs will worsen overfitting as the model continues to memorise training noise."
Class imbalance trap"Despite high accuracy, the model is ineffective due to class imbalance β€” it achieves accuracy by predicting the majority class."
RNN advantage"RNNs naturally capture sequential order during training."
RNN drawback"Sequential processing prevents parallelisation, making training slow for long sequences."
Transformer fix"The Transformer processes all positions in parallel via self-attention, using positional encoding to retain order information."

12. Common Traps to Avoid

  • Multi-label β‰  multi-class β†’ sigmoid, NOT softmax
  • High recall + low precision = predicting everything positive (not a good model)
  • Regularisation fights overfitting, NEVER helps underfitting
  • Zero initialisation β†’ symmetry problem β†’ all neurons learn the same
  • More epochs β†’ more overfitting, not less
  • MaxPool vs AvgPool β†’ same dimensions, different values
  • Accuracy alone is misleading with imbalanced classes

13. Marks-per-minute priority

1. Bias/Variance diagnosis + fixes    ~20%  ← ALWAYS on exam
2. CNN dimension calculation           ~15%  ← ALWAYS on exam
3. Transformer/Attention concepts      ~15%  ← ALWAYS on exam
4. Data preprocessing reasoning        ~15%  ← ALWAYS on exam
5. Learning rate curve matching        ~10%  ← usually on exam
6. Confusion matrix metrics            ~10%  ← usually on exam
7. Activation functions                ~5%
8. RNN vs Transformer                  ~5%
9. Batch Norm / DNN training           ~5%