Exam Question-by-Question Analysis

Source: 2025 S1 Test + 2024 S1 Test + Practice Test (with official answers)


2025 S1 Mid-Semester Test (20 marks, 6 questions)

Q1: Dataset Cleaning [2 marks]

FieldDetail
TypeData table analysis → justify 4 cleaning steps
ModuleA: Data Preprocessing
Difficulty★☆☆
Keywordsmissing values, median imputation, attribute removal, outlier detection
IntentCan you read dataset statistics and make cleaning decisions?

The Trick: Attribute 4 has 9995/10000 missing values → remove entirely. Attribute 2 has extreme min/max relative to mean → outliers exist. No missing values in categorical/binary → no need for most-frequent imputation.


Q2: Evaluation and Design Choices [3 marks]

FieldDetail
Type(a) Interpret loss curves → diagnose bias/variance (b) Suggest 2 improvements
ModuleA: Bias-Variance
Difficulty★★☆
Keywordsoverfitting, high variance, high bias, regularisation, data augmentation
IntentCan you read training curves and prescribe fixes?

The Trick: Gap between training and validation = high variance (overfitting). Training loss still relatively high = possible high bias too. Each suggestion must target a DIFFERENT aspect.


Q3: Activation Functions [3 marks]

FieldDetail
Type(a) Explain dying ReLU + LeakyReLU fix (b) Choose output activation for multi-label
ModuleB: MLP / Activation Functions
Difficulty★★☆
KeywordsReLU, LeakyReLU, dying neurons, sigmoid, multi-label vs multi-class
IntentDo you understand activation function failure modes and design choices?

The Trick: Multi-label (multiple anomalies per image) = sigmoid (independent per output). NOT softmax (which forces probabilities to sum to 1).


Q4: Learning Rate [4 marks]

FieldDetail
TypeMatch 4 loss curves to 4 learning rates
ModuleA: Optimization
Difficulty★★☆
Keywordsdivergence, convergence, overshooting, learning rate
IntentCan you visually identify learning rate effects?

The Trick: Diverging (loss goes up) = 0.5. Slow descent = 0.001. Fast convergence to HIGH loss = 0.1 (overshoots optimum). Best convergence to LOW loss = 0.01.


Q5: Transformers [4 marks]

FieldDetail
Type(a) Explain masked attention in decoder (b) Explain ViT class token
ModuleB: Transformer
Difficulty★★★
Keywordsmasked attention, autoregressive, ViT, [CLS] token, classification
IntentDeep understanding of Transformer variants

The Trick: (a) Mask prevents looking at future tokens → preserves autoregressive property during training. (b) [CLS] token aggregates info from all patches → efficient classification without processing all embeddings separately.


Q6: CNNs [4 marks]

FieldDetail
Type(a) Multiple choice: FC layer inputs (b) Show calculation
ModuleB: CNN
Difficulty★★☆
Keywordsvalid padding, same padding, convolution, max pooling, flatten
IntentCan you compute dimensions through a CNN pipeline?

Answer: 180. Pipeline: [35,35,3] → Conv1(valid,k=7,s=2) → [15,15,10] → Pool1(k=2,s=2) → [7,7,10] → Conv2(same,k=3,s=1) → [7,7,20] → Pool2(k=2,s=2) → [3,3,20] → Flatten = 180.


2024 S1 Mid-Semester Test (30 marks, 7 questions)

Q1: Data Preprocessing [4 marks]

FieldDetail
TypeInfer data characteristics from preprocessing pipeline
ModuleA: Data Preprocessing
Difficulty★★☆
IntentCan you reverse-engineer what raw data looks like from the pipeline?

Pipeline 1 (median imputer → standardisation → log transform): Numerical data, missing values, different scales, heavy-tailed distribution.

Pipeline 2 (most-frequent imputer → one-hot encoding): Categorical data, missing values, no ordinal relationship, not too many categories.


Q2: Design Choices [6 marks] — HIGHEST VALUE QUESTION

FieldDetail
TypeOverfitting scenario (train=95%, val=60%), evaluate 3 fixes
ModuleA: Bias-Variance
Difficulty★★☆
  • More epochs: NO — worsens overfitting
  • Larger dataset: YES — more diverse data helps generalise
  • L2 regularisation: YES — penalises large weights, promotes simpler model

Q3: Evaluation [4 marks]

FieldDetail
TypeConfusion matrix → calculate metrics → interpret
ModuleE: Metrics

Results: Accuracy=60%, Recall=100%, Precision=56%. The model predicts almost everything as positive. Looks like it catches all positives (perfect recall) but actually just labels everything positive (terrible precision).


Q4: Learning Rate and Optimisers [4 marks]

FieldDetail
Type(1) LR schedule example + benefit (2) Explain momentum

Key answers: (1) Exponential decay — fast at start, fine-tune near optimum. (2) Momentum = exponentially decaying average of past gradients → smoother updates, speeds up convergence.


Q5: RNN and Transformer [4 marks]

FieldDetail
Type(1) Sequential processing: advantage AND drawback (2) How Transformer fixes it

Key: (1) Advantage: naturally captures order. Drawback: can't parallelise → slow for long sequences. (2) Transformer: processes all tokens in parallel via embeddings + adds positional encoding for order.


Q6: CNN Feature Map [4 marks]

FieldDetail
TypeCalculate dimensions after conv and pooling layers

Answers: Conv: ((50+0-5)/3)+1 = 16 → [16,16,10]. AvgPool: ((50-5)/5)+1 = 10 → [10,10,5]. MaxPool: same dimensions (only values differ).


Q7: DNN Training [4 marks]

FieldDetail
Type(1) Why deep nets are hard to train (2) Two strategies to help

Key: (1) Vanishing/exploding gradients + overfitting + longer training. (2) Batch norm, skip connections (ResNet), better optimisers (Adam), LSTM/GRU.


Practice Test (~32 marks, 7 questions)

Q1: Data Pre-processing [5 marks]

Two approaches to missing data + when to use each. Remove attribute (when mostly missing) or impute values (when reasonable amount missing).

Q2: DNN and Generalisation [5 marks]

High bias → underfitting → increase model, add data, transfer learning. High variance → overfitting → regularisation, more data, reduce model.

Q3: Design Choices [6 marks]

Underfitting scenario (train=val=50%). Increase size = YES. Zero init = NO (symmetry problem). Dropout = NO (regularisation doesn't help underfitting).

Q4: Evaluation [3 marks]

Accuracy=70%, Recall=33%. Accuracy misleading due to class imbalance — model bad at finding positives.

Q5: Batch Normalisation [5 marks]

Effects: speeds up training, reduces vanishing gradients, regularisation effect, reduces weight init sensitivity.

Q6: Attention and Transformers [4 marks]

Multi-head attention = stacks multiple attention heads with separate Q/K/V. Benefit: focuses on different aspects simultaneously.

Q7: CNNs [5 marks]

Reverse-engineer hyperparameters from diagram. Early layers = edge detectors (low-level features), deeper layers = complex features.