ELEN 521 — Missing Topics Study Guide

01

Bayesian View of Regularization

Lec 4 HIGH PRIORITY ▼

⚠ WHY THIS MATTERS FOR THE EXAM This is the conceptual link your prof loves. He showed it explicitly in slides and it's exactly the kind of "you had an idea" bridge question (Q4 style) — connecting regularization to prior distributions. If he asks "WHY does L2 regularization work from a probabilistic standpoint?" this is the answer.

The Big Idea

When we train a neural network, we usually just minimise the loss L(y, p). But a Bayesian perspective says: we have a PRIOR BELIEF about what good weights look like. The training data gives us a likelihood. We want the most likely weights given BOTH.

MAP = Maximum A Posteriori. Not just "what fits the data best" but "what fits the data best AND matches my prior belief about weights."

The Key Insight

Different priors on weights give you DIFFERENT regularisation methods automatically. This isn't a coincidence — regularisation IS a prior, encoded as math. The prof's slide: "many regularization techniques correspond to imposing certain prior distributions on model parameters."

A prior is just you saying "I think the weights are probably small" before seeing any data. The training data then updates that belief.

MAP derivation — where L2 comes from

Want to find: argmax_w P(w | data) By Bayes' theorem: P(w | data) = P(data | w) · P(w) / P(data) Take log (monotonic, so argmax unchanged). Drop constants: log P(w | data) = log P(data | w) + log P(w) + const With Gaussian likelihood P(data|w) → this becomes negative MSE. Now assume Gaussian prior on weights: P(w) = N(0, 1/λ) = (λ/2π)^(1/2) · exp(−λ/2 · ||w||²) So log P(w) = −(λ/2)·||w||² + const. Substitute back: argmax log P(w|data) = argmin [ L(y,p) + (λ/2)||w||² ]

L2 regularisation IS a Gaussian prior on weights. ✓

L1 regularisation — from a Laplace prior

Laplace prior: P(w) = (λ/2) · exp(−λ·|w|) log P(w) = −λ·|w| + const → argmin [ L(y,p) + λ·||w||₁ ]

L1 regularisation IS a Laplace prior on weights. ✓

Laplace distribution has a SHARP PEAK at 0 → encourages exact zeros → sparsity. Gaussian is smooth at 0 → weights shrink but rarely hit exactly zero.

L2 / Ridge / Tikhonov

Prior: P(w) = N(0, 1/λ)

Penalty: λ/2 · ||w||²

Effect: weights shrink toward 0 (never exactly 0)

Also called: weight decay

L1 / Lasso

Prior: P(w) = Laplace(0, 1/λ)

Penalty: λ · ||w||₁

Effect: many weights go exactly to 0 (sparse!)

Acts as: automatic feature selection

Bias is NOT regularised

Why? Bias shifts the output — regularising it would constrain the model to pass near the origin. Weights control shape/curvature. Only weights get the prior.

Different layers can have different λ

Regularised weight update — full derivation

Ω(w) = (λ/2)·wᵀw [L2 penalty term] ∂Ω/∂w = λ·w [gradient of penalty] Total gradient = ∂L_train/∂w + λ·w Update: w ← w − η·(∂L_train/∂w + λ·w) = w − η·∂L_train/∂w − η·λ·w = (1 − η·λ)·w − η·∂L_train/∂w

The (1−ηλ) factor SHRINKS weights before the gradient step. This is why it's called "weight decay" — weights decay toward zero at every step regardless of gradient.

Cheat Sheet — Bayesian Regularization

MAP: argmax P(w|data) = argmin L + log P(w) Bayes' theorem + take log

L2 ↔ Gaussian prior N(0, 1/λ) L1 ↔ Laplace prior Laplace(0, 1/λ)

w ← (1-ηλ)·w - η·∂L/∂w "weight decay" = Gaussian prior

Laplace sharp at 0 → exact zeros (L1) Gaussian smooth at 0 → shrink (L2)

02

ELBO Derivation — Jensen Inequality Proof (Full)

Lec 9 HIGH PRIORITY — "Exercise: Prove" on slides ▼

⚠ THE PROF LITERALLY WROTE "EXERCISE: PROVE THAT..." ON THE SLIDE This is coming. The cheat sheet has the result. This page has the full step-by-step proof you need to reproduce from memory. The exercise asks you to show the proportionality of the ELBO to the terms involving KL. Know every line below.

Jensen's Inequality

For a concave function f:

f(E[X]) ≥ E[f(X)]

log is concave (curves downward), so:

log(E[X]) ≥ E[log(X)]

Intuition: the average of the logs is ≤ the log of the average. Jensen says "you can bring log inside the expectation but you pay a price — it becomes a lower bound."

KL Divergence Definition

KL(Q||P) = E_Q[log Q(z|x) / P(z)]
= E_Q[log Q(z|x) − log P(z)]

KL ≥ 0 always. KL = 0 iff Q = P exactly.

KL measures how "surprised" you'd be if you assumed distribution P but the real distribution is Q. Always non-negative by Gibbs' inequality.

Step-by-step ELBO derivation — write this in exam

Start from: We want to maximise log P(x) — the log-likelihood of our data x log P(x) = log ∫ P(x|z)·P(z) dz Step 1 — Multiply and divide by Q(z|x) inside the integral (clever trick — multiplying by 1): = log ∫ Q(z|x) · [ P(x|z)·P(z) / Q(z|x) ] dz Step 2 — Recognise this as an expectation under Q(z|x): = log E_Q[ P(x|z)·P(z) / Q(z|x) ] Step 3 — Apply Jensen's inequality (log is concave → log(E[·]) ≥ E[log(·)]): ≥ E_Q[ log( P(x|z)·P(z) / Q(z|x) ) ] Step 4 — Expand the log of a product/quotient: = E_Q[ log P(x|z) + log P(z) − log Q(z|x) ] Step 5 — Split the expectation and rearrange: = E_Q[log P(x|z)] + E_Q[log P(z) − log Q(z|x)] = E_Q[log P(x|z)] − E_Q[log Q(z|x) − log P(z)] = E_Q[log P(x|z)] − E_Q[log Q(z|x)/P(z)] Step 6 — Recognise the KL divergence definition:

log P(x) ≥ E_Q[log P(x|z)] − KL( Q(z|x) || P(z) )
↑ RECONSTRUCTION LOSS ↑ KL DIVERGENCE LOSS

The gap: The ≥ became equality when Q(z|x) = P(z|x) exactly (posterior = encoder). The KL term IS the gap. Maximising the ELBO = minimising the KL gap = making encoder match the true posterior.

The "Exercise: Prove proportionality" — what the prof wants

The slide says: prove that the ELBO is proportional to CrossEntropy + KL. Here's how: Maximise: E_Q[log P(x|z)] − KL(Q||P) Equivalent to minimising (negate + add constants): Minimise: −E_Q[log P(x|z)] + KL(Q||P) For binary images, −E_Q[log P(x|z)] IS the reconstruction cross-entropy: −E_Q[log P(x|z)] = −Σ [x·log(x̂) + (1−x)·log(1−x̂)] And KL for Gaussians (closed form, Q=N(μ,σ²), P=N(0,1)): KL = −½ · Σⱼ (1 + log σⱼ² − μⱼ² − σⱼ²)

VAE Loss = CrossEntropy(x, x̂) − ½·Σ(1 + log σ² − μ² − σ²)

Cheat Sheet — ELBO Proof in 6 Lines

1. log P(x) = log ∫ P(x|z)P(z) dz 2. Multiply-divide by Q(z|x): = log E_Q[P(x|z)P(z)/Q(z|x)] 3. Jensen (log concave): ≥ E_Q[log(P(x|z)P(z)/Q(z|x))] 4. Expand log: = E_Q[log P(x|z)] + E_Q[log P(z)] - E_Q[log Q(z|x)] 5. Rearrange: = E_Q[log P(x|z)] - E_Q[log Q(z|x)/P(z)] 6. Recognise KL: = E_Q[log P(x|z)] - KL(Q||P) ← ELBO ✓

03

The "Tables → Embedding, Dictionaries → Attention, if-then-else → Gates" Mental Model

Lec 8 HIGH PRIORITY — Direct slide quote ▼

⚠ EXACT QUOTE FROM PROF'S SLIDE: "Tables: Embedding · Dictionaries: Attention · if-then-else: gates in LSTM, Mixture-of-Experts" — This is a one-liner the prof put on a dedicated slide. It's a conceptual unification. Could appear as a short-answer "what does X correspond to in classical computing?" question.

classical computing

Table lookup

→

Embedding Layer

A table where each word/token ID maps to a row (its vector). Integer index → dense vector. Exactly like looking up a row in a database table.

vocab_size × embedding_dim matrix. Input: integer. Output: the corresponding row vector.

classical computing

Dictionary / Hash Map

→

Attention Mechanism

A SOFT dictionary. Instead of an exact key match, you compute similarity between your query and ALL keys, then return a weighted sum of values. No exact match needed — closest keys get higher weight.

Query = "what am I looking for?". Keys = "what do I have?". Values = "what do I return?" Attention(Q,K,V) = softmax(QKᵀ/√d) · V

classical computing

if-then-else / conditional

→

LSTM Gates + Mixture of Experts

Gates are differentiable if-then-else. Instead of a hard 0/1 decision, sigmoid gives a soft value between 0 and 1. "Should I remember this? → forget_gate ≈ 0.9" is like "if important: keep 90% of this." Mixture of Experts extends this: route input to different experts based on gating.

Forget gate ≈ if (relevant) keep else discard. Input gate ≈ if (new info matters) write to memory.

Why Attention = Soft Dictionary

Hard dictionary: you either find the key or you don't. Binary. Not differentiable.

Soft dictionary (Attention): every key matches a little. The similarity score = how much that value matters. Weighted sum = soft retrieval. Fully differentiable → trainable end-to-end.

Google Search = hard lookup. Attention = "find me documents similar to this query" — soft, ranked, weighted.

Why Gates = Differentiable if-then-else

Normal if-then-else: gradient is zero everywhere except at the switching point → can't backprop.

Sigmoid gate: smooth everywhere → gradient flows through. The network LEARNS what conditions should open/close each gate.

Hard if: light switch (on/off). Sigmoid gate: dimmer switch (0 to 1). The backprop can "turn the dimmer" gradually during training.

Mixture of Experts (MoE) — the gating extension

Instead of one big network, have N "expert" sub-networks and a gating network that decides which experts handle which input. Gate output = probabilities over experts. The input is routed to the top-k experts (differentiable gating via softmax).

output = Σᵢ gate(x)ᵢ · expertᵢ(x)

Like a hospital: different specialist doctors for different conditions. A triage nurse (gate) decides who you see.

Cheat Sheet — The Three Analogies

Table → Embedding: integer index → dense vector row

Dictionary → Attention: soft lookup, Q·Kᵀ similarity → weighted sum of V

if-then-else → LSTM gate: sigmoid(·) = differentiable 0/1 switch

MoE: output = Σᵢ softmax(gate)ᵢ · expertᵢ(x)

04

Data Types in Deep Learning (1D / 2D / 3D / 4D)

Lec 4 MEDIUM PRIORITY ▼

CONTEXT The prof showed this as a slide table to show CNNs work on more than images. Could appear as a quick concept question: "What type of data would you use a 3D convolution on?"

Dimension	Single Channel	Multi-Channel	Notes
1D	Audio waveform	Skeleton animation data	Conv1D, time series
2D	Fourier transform of audio	Color image data (RGB)	Conv2D, most common
3D	Volumetric data (CT scans)	Color video data	Conv3D, medical imaging
4D	Heart scans (3D + time)	Multi-modal MRI	Rare, very expensive

What "channel" means in each

1D audio: 1 channel = mono, 2 = stereo. 2D image: 1 = grayscale, 3 = RGB. 3D CT scan: 1 = Hounsfield units. Color video: 3 channels × time × height × width.

What changes in the Conv layer

For 1D: kernel slides along time axis (1 direction). For 2D: kernel slides along height × width (2 directions). For 3D: kernel slides along depth × height × width (3 directions). The math is the same — just more axes.

Cheat Sheet — Data Types

1D: audio waveform (1ch), skeleton (multi-ch) → Conv1D 2D: Fourier of audio (1ch), color image (3ch) → Conv2D 3D: CT scan (1ch), color video (3ch) → Conv3D 4D: heart scan, multi-modal MRI → 4D Conv

05

ResNet ↔ Differential Equations (ODE Solver View)

Lec 4 MEDIUM PRIORITY — 4 slides devoted to this ▼

CONTEXT The prof spent 4 slides on this connection in Lec 4. It's a beautiful theoretical bridge. Most likely appears as a conceptual Q4 extension: "ResNet can be seen as solving a differential equation — explain the connection." The 2026 mHC extension is also from this.

ResNet skip connection (what you know)

y = F(x) + x

Output = residual function + original input. The network learns the CHANGE (residual) rather than the full mapping. Deep networks → no vanishing gradients.

ODE view (the connection)

dy/dt = F(y, t)

An ordinary differential equation says: the rate of change of y at time t is some function F. The solution y(t) is found by stepping forward in time.

Euler's method → ResNet skip connection

Euler's method approximates an ODE numerically. With step size Δt: y(t + Δt) = y(t) + Δt · F(y(t), t) Set Δt = 1, rename y(t) = xₗ (layer l), F = residual function: x_{l+1} = x_l + F(x_l) ← ResNet skip connection!

Each ResNet block = one Euler step solving dy/dt = F(y). The depth of ResNet = number of time steps in the ODE solver.

Neural ODE (Chen et al. 2018): take this to the limit. Infinite layers → continuous dynamics. The "layer index" becomes continuous time t ∈ [0, 1].

2026 Extension: mHC — Doubly Stochastic ResNet

The prof mentioned in 2026 slides: keep the weight matrix W doubly stochastic (each row and column sums to 1). Uses Sinkhorn-Knopp normalisation. Doubly stochastic: Σⱼ Wᵢⱼ = 1 (rows) AND Σᵢ Wᵢⱼ = 1 (cols) Why? Two reasons the prof gave:

1. Convex combination of inputs → stable activations

2. Convex combination of outputs → bounded representations

Sinkhorn-Knopp: alternately normalise rows then columns until convergence. May not converge fast → mHC-lite is a simpler approximation.

Why this matters conceptually

The ODE view explains WHY ResNet works better than plain deep networks — it's solving a well-posed mathematical problem (integrating a differential equation) rather than stacking arbitrary transformations. It also gives theoretical tools: stability analysis from ODE theory applies to ResNet depth and training.

Cheat Sheet — ResNet ↔ ODE

ODE: dy/dt = F(y,t) → Euler step: y(t+1) = y(t) + F(y(t)) ResNet: x_{l+1} = x_l + F(x_l) ← SAME FORMULA Each block = 1 Euler step. Depth = number of time steps. mHC: W doubly stochastic (rows+cols sum to 1). Via Sinkhorn-Knopp.

06

Double Descent — "Is Overfitting Always Bad?"

Lec 6 MEDIUM PRIORITY ▼

CONTEXT The prof showed a slide titled "Is Overfitting Bad? (beating the variance hill)". The classical U-shaped bias-variance curve is WRONG for modern deep networks. This is a key modern finding. Could appear as a short conceptual question.

Classical view (what you learned in Lec 4)

Bias-variance tradeoff: as model capacity increases, training error decreases monotonically, but test error follows a U-shape. Too simple = high bias (underfitting). Too complex = high variance (overfitting). Sweet spot in the middle.

This is the standard ML101 picture. It's true for classical models (polynomial regression, SVMs).

Modern finding: Double Descent

For large neural networks: after the classical "overfitting peak," if you keep increasing model size PAST the data-fitting threshold, test error starts DECREASING again.

Small model → underfit → test error high (bias)
Medium model → overfit → test error peaks (variance)
Large model → over-parameterised → test error DROPS again!

The very large model finds a smooth interpolating solution through the training points — like fitting a smooth curve that happens to pass through all noisy points without wiggling wildly.

Why does this happen?

When the model has WAY more parameters than data points, there are infinitely many solutions that perfectly fit the training data. Gradient descent finds the one with the MINIMUM NORM (smallest weights). This solution turns out to be smooth and generalises well — it avoids the extreme oscillations of just-right-sized overfitting.

What the prof's slide showed

At the interpolation threshold (model capacity = data size), test error peaks. Beyond this point with even larger models, it seems like the network uses a "smoothing function" again — behaving more like regularisation than overfitting.

Implication: bigger is often better in deep learning — don't stop at the classical sweet spot!

Cheat Sheet — Double Descent

Classical: test error = U-shape (bias → sweet spot → variance) Modern: test error = double-U (goes DOWN again past interpolation threshold) Why: gradient descent finds min-norm solution → smooth → generalises Implication: very large models are often BETTER than medium-sized ones

07

U-Net — Encoder-Decoder with Skip Connections (Full Architecture)

Lec 6 MEDIUM PRIORITY — "SOTA for segmentation" ▼

CONTEXT The prof stated "U-Nets are SOTA for Image Segmentation" and showed the architecture. The study guide mentions it but misses the key architectural detail: skip connections between MATCHING encoder and decoder levels. This is what makes it different from a plain autoencoder.

U-NET ARCHITECTURE (encoder left, decoder right, skip connections across)

Input image (H×W×C)
↓
Conv Block 1 → 64 filters ─────────────────────────────────┐
↓ MaxPool (÷2)
Conv Block 2 → 128 filters ─────────────────────────────┐ │
↓ MaxPool (÷2)
Conv Block 3 → 256 filters ─────────────────────────┐ │ │
↓ MaxPool (÷2)
Bottleneck → 512 filters (smallest feature map)
↓ UpSample (Conv2DTranspose ×2)
UpConv Block 3 → 256 ◄─ CONCAT encoder Block 3 ──────┘
↓ UpSample (×2)
UpConv Block 2 → 128 ◄─ CONCAT encoder Block 2 ─────────┘
↓ UpSample (×2)
UpConv Block 1 → 64 ◄─ CONCAT encoder Block 1 ────────────┘
↓
Output → 1×1 Conv → num_classes (pixel-wise softmax)

What makes it a U-Net

1. The "U" shape: downsample (left side) then upsample (right side)

2. Skip connections from EACH encoder level to the MATCHING decoder level (same spatial resolution). These skip connections carry fine spatial detail that gets lost in the bottleneck.

Encoder = zooming out to understand context. Decoder = zooming back in to place exact pixel labels. Skip connections = passing the original detail forward to help with exact placement.

How it's different from a plain Autoencoder

Plain Autoencoder: bottleneck → decoder reconstructs from only the bottleneck code. No direct path from encoder to decoder at fine scales.

U-Net: skip connections concatenate encoder features at every scale. The decoder has BOTH the upsampled signal from below AND the fine details from the matching encoder layer.

Key Operations

Downsampling: MaxPool2D(2,2) or Conv2D(stride=2) Upsampling: Conv2DTranspose(filters,(2,2),strides=2) Skip merge: Concatenate()([upconv_output, encoder_feature]) Output: Conv2D(num_classes, 1, activation='softmax')

Each output pixel gets its own class prediction. Input and output have the SAME spatial dimensions (H×W). This is pixel-wise (semantic) segmentation.

DeconvNet vs U-Net

DeconvNet (Zeiler 2014): also encoder-decoder, also uses unpooling + deconvolution to upsample. Difference from U-Net: DeconvNet does NOT have skip connections — it relies entirely on the bottleneck code. U-Net's skip connections give it much better fine-grained localisation, which is why it became the standard for medical imaging segmentation.

Cheat Sheet — U-Net

Encoder: Conv+MaxPool at each level (H↓ W↓ C↑) Decoder: Conv2DTranspose at each level (H↑ W↑ C↓) Skip: Concatenate encoder_i with decoder_i at SAME resolution Output: 1×1 Conv → num_classes (pixel-wise, same H×W as input) vs AutoEncoder: AE has no skip connections. U-Net does. → better localisation

08

ELMo — Contextual Embeddings (How It Actually Works)

Lec 8 MEDIUM PRIORITY ▼

CONTEXT The study guide says ELMo gives "different vector per context" but doesn't explain the mechanism. The prof listed it on the slide "Beyond RNNs and seq2seq" as the first step toward BERT/GPT. Understanding how it works helps with the "Tables/Dict/Gates" mental model too.

Problem with Word2Vec (what ELMo fixes)

Word2Vec gives ONE fixed vector per word regardless of context.

"I went to the fair last Saturday."
"I do not believe your attitude is fair."

"fair" gets the same vector in both sentences — but the meaning is completely different. Word2Vec can't handle this.

ELMo's solution

Train a deep bidirectional LSTM as a language model. For each word position, take the hidden states from MULTIPLE layers (not just the final output) and combine them. The result: a different vector for "fair" in each context, because the LSTM's hidden state encodes the surrounding words.

Word2Vec = dictionary (one meaning per word). ELMo = a person who reads the whole sentence before deciding what each word means.

ELMo Architecture — how it works

Step 1: Train a 2-layer bidirectional LSTM on a large text corpus as a language model: Forward LM: P(tₖ | t₁, t₂, ..., tₖ₋₁) Backward LM: P(tₖ | tₖ₊₁, ..., tₙ) Step 2: For each token k, extract hidden states from ALL layers: Layer 0: character-level CNN embedding (surface form) Layer 1: lower BiLSTM (syntactic info — POS, tense) Layer 2: upper BiLSTM (semantic info — word sense, meaning) Step 3: Combine layers with learned task-specific weights γ: ELMo_k = γ · Σⱼ sⱼ · h_kⱼ where sⱼ = softmax weights (learned per task), h_kⱼ = layer j hidden state

Same model, different layer weights for different tasks. POS tagging = more weight on layer 1. Coreference = more weight on layer 2.

ELMo vs Word2Vec

Word2Vec: 1 vector per word. Context-free.

ELMo: different vector per usage. Context-sensitive via BiLSTM hidden states.

ELMo vs BERT

ELMo: BiLSTM-based. Sequential. Two separate LMs concatenated.

BERT: Transformer-based. Truly bidirectional (not two one-directional LMs). Attention over full context at once.

The progression

Word2Vec (static)
↓
ELMo (contextual, LSTM)
↓
BERT (contextual, Transformer)
↓
GPT (generative, Transformer)

Cheat Sheet — ELMo

ELMo = deep BiLSTM trained as language model. Uses ALL layer hiddens. ELMo_k = γ · Σⱼ sⱼ · h_kⱼ (learned weights per task over layers) Layer 1 = syntax. Layer 2 = semantics. Different tasks → different layer weights. vs Word2Vec: 1 vector per word. ELMo: 1 vector per word-in-context. Progression: Word2Vec → ELMo → BERT → GPT