AlexNet / CaffeNet
Input: 3×227×227. Stack: Conv → ReLU → MaxPool → Norm → ... → FC → Softmax. First deep net to win ImageNet (2012). Used GPU training + Dropout.
VGG
Very deep network using only 3×3 conv filters. The insight: two 3×3 convs = one 5×5 conv's receptive field, but fewer parameters. Goes up to 19 layers.
GoogleNet / Inception Module
Instead of choosing 1×1, 3×3, or 5×5 filters — use ALL in parallel and concatenate. Uses 1×1 convolutions as "bottlenecks" to reduce computation before expensive 3×3 and 5×5 convs.
ResNet — Skip Connections
Problem: as networks get deeper, gradients vanish. Solution: add the input directly to the output of a block (skip connection). Learns residual = F(x) + x instead of F(x).
MobileNet — Depthwise Separable Conv
Split standard conv into: (1) Depthwise conv (filter each channel independently) + (2) Pointwise 1×1 conv (combine channels). Dramatically reduces computation.
MobileNet v2 — Inverted Residuals
Bottleneck blocks that EXPAND first (using 1×1), then depthwise, then COMPRESS (using 1×1). Add skip connection on the compressed (narrow) ends. Opposite of regular bottleneck.
Weight Initialization
Zeros → all neurons learn the same thing (symmetry problem). Random small → works better. Xavier/Glorot → good for symmetric activations (tanh). He → good for ReLU/asymmetric.
Vanishing / Exploding Gradients
In deep nets, gradients can shrink to ≈0 (vanishing) or blow up to ∞ (exploding) as they travel backwards. Cause: repeated multiplication of small/large numbers. Fix: proper init, BatchNorm, skip connections.
L2 Regularization (Weight Decay)
Add λ·||w||² to the loss. Effect: weights are pulled towards zero (shrunk) at every update. Also called Tikhonov regularization or ridge regression.
L1 Regularization (Lasso)
Add λ·||w||₁ to the loss. Effect: many weights become exactly 0 (sparsity). Acts as feature selection.
Bias-Variance Trade-off
High bias = underfitting (model too simple). High variance = overfitting (model memorizes training data). Goal: find the sweet spot.
Train / Validation / Test Sets
Train: learn weights. Validation: tune hyperparameters (architecture, lr, reg). Test: final evaluation ONCE. Rule: Train > 1M → val/test can be 1% (≈10k samples).
Early Stopping
Stop training when validation loss stops improving. Prevents overfitting. Simple and very effective regularization.
Bagging / Ensemble Methods
Train multiple models on different bootstrap samples of data. Average their predictions. Different models make different mistakes → averaging reduces errors.
| Abbr | Full Name | Key Point |
|---|---|---|
| VGG | Visual Geometry Group | Deep net, only 3×3 filters (Oxford) |
| ResNet | Residual Network | Skip connections to fight vanishing gradients |
| mHC | mHC (doubly stochastic) | Sinkhorn-Knopp normalization for ResNet in 2026 |
| ReLU6 | ReLU capped at 6 | min(max(0,x), 6) — used in MobileNet |
| GELU | Gaussian Error Linear Unit | Smoother activation, used in Transformers |
| L1 | L1 norm penalty (Lasso) | Encourages sparsity (exact zeros) |
| L2 | L2 norm penalty (Ridge/Tikhonov) | Weight decay, drives weights toward 0 |
| i.i.d. | Independent and Identically Distributed | Assumption that train/test come from same distribution |
| κ | Condition Number | λ_max/λ_min — how badly conditioned the loss surface is |
| DW | Depthwise Conv | Filter each input channel separately |
| PW | Pointwise Conv (1×1 Conv) | Combine channels linearly |
# Parallel branches, then concatenate conv_1x1 = Conv2D(filters_1x1, (1,1), padding='same', activation='relu')(x) conv_3x3 = Conv2D(filters_3x3_reduce, (1,1), padding='same', activation='relu')(x) conv_3x3 = Conv2D(filters_3x3, (3,3), padding='same', activation='relu')(conv_3x3) conv_5x5 = Conv2D(filters_5x5_reduce, (1,1), padding='same', activation='relu')(x) conv_5x5 = Conv2D(filters_5x5, (5,5), padding='same', activation='relu')(conv_5x5) pool_proj = MaxPool2D((3,3), strides=(1,1), padding='same')(x) pool_proj = Conv2D(filters_pool_proj, (1,1), padding='same', activation='relu')(pool_proj) output = Concatenate(axis=3)([conv_1x1, conv_3x3, conv_5x5, pool_proj])
def inverted_residual_block(x, expand=64, squeeze=16): m = Conv2D(expand, (1,1), activation='relu')(x) # EXPAND m = DepthwiseConv2D((3,3), activation='relu')(m) # DEPTHWISE m = Conv2D(squeeze, (1,1), activation='relu')(m) # SQUEEZE return Add()([m, x]) # Skip connection on NARROW end
# BAD - zeros initialization (symmetry problem) w_bad = np.zeros((n_in, n_out)) # Xavier/Glorot - for tanh activation limit = np.sqrt(6.0 / (n_in + n_out)) w_xavier = np.random.uniform(-limit, limit, (n_in, n_out)) # He initialization - for ReLU activation w_he = np.random.normal(0, np.sqrt(2.0 / n_in), (n_in, n_out)) # In Keras: Dense(64, activation='relu', kernel_initializer='he_normal') Dense(64, activation='tanh', kernel_initializer='glorot_uniform')