Home

Deep Learning Formula Sheet

ELEN 521  ·  Complete Course Reference

ELEN 521 · W2026
Loss Functions
Mean Squared Error
\[ \mathcal{L}_{MSE}(y,\hat{p}) = \frac{1}{B}\sum_{i}(y^{(i)} - \hat{p}^{(i)})^2 \]
Binary Cross-Entropy
\[ \mathcal{L}_{BCE} = -\frac{1}{B}\sum_{i}\bigl[y^{(i)}\log\hat{p}^{(i)} + (1-y^{(i)})\log(1-\hat{p}^{(i)})\bigr] \]
Categorical Cross-Entropy
\[ \mathcal{L}_{CE} = -\frac{1}{B}\sum_{i}\sum_{k} y_k^{(i)}\log \hat{p}_k^{(i)} \]

B = batch size  |  k = class index

Activation Functions
Sigmoid
\[ \sigma(z) = \frac{1}{1+e^{-z}} \]
Softmax
\[ \text{softmax}(z_k) = \frac{e^{z_k}}{\sum_j e^{z_j}} \]
ReLU
relu(z) = max(0, z)
Tanh
tanh(z) = (eᶻ−e⁻ᶻ)/(eᶻ+e⁻ᶻ)

ReLU/sigmoid → He init  ·  tanh → Xavier init

Backprop & Gradient Descent
Weight Update (SGD)
\[ w \leftarrow w - \eta \frac{\partial \mathcal{L}}{\partial w} \]
Chain Rule (Backprop)
\[ \frac{\partial g}{\partial x} = \frac{\partial g}{\partial f}\cdot\frac{\partial f}{\partial x} \]
  • η = learning rate (hyperparameter)
  • Propagate gradients backward layer by layer
  • Gradient at layer ℓ depends on layer ℓ+1 (chain rule)
Batch Normalization
Normalize
\[ \hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \varepsilon}} \]
Scale & Shift (learnable γ, β)
\[ y = \gamma\,\hat{x} + \beta \]

μB, σB² = batch mean & variance  |  ε = numerical stability

Applied to intermediate layer outputs; stabilizes training.

Regularization
L2 Regularized Objective (Weight Decay)
\[ \tilde{\mathcal{L}} = \mathcal{L}(y,\hat{p}) + \frac{\lambda}{2}\|w\|^2 \]
L1 Regularization (Sparsity)
\[ \tilde{\mathcal{L}} = \mathcal{L}(y,\hat{p}) + \lambda\|w\|_1 \]
L2 Weight Update (shrink step)
\[ w \leftarrow w(1 - \eta\lambda) - \eta\frac{\partial\mathcal{L}}{\partial w} \]
  • L2 → drives weights toward 0 (ridge / weight decay)
  • L1 → encourages sparsity (feature selection)
  • Bias terms are not regularized
  • Dropout: randomly zero out neurons during training (implicit regularization)
  • Early stopping: regularization in time
Weight Initialization
Xavier / Glorot (symmetric activations: tanh)
\[ W \sim \mathcal{U}\!\left(-\sqrt{\frac{6}{n_{in}+n_{out}}},\; \sqrt{\frac{6}{n_{in}+n_{out}}}\right) \]
He / Kaiming (ReLU / sigmoid)
\[ W \sim \mathcal{N}\!\left(0,\; \sqrt{\frac{2}{n_{in}}}\right) \]
  • Zero init → all neurons learn the same thing (symmetry problem)
  • Large random init → exploding/vanishing gradients
  • Bias → always initialize to 0
Convolutional Neural Networks
Output Spatial Size
\[ O = \left\lfloor \frac{W - K + 2P}{S} \right\rfloor + 1 \]
Convolution Operation
\[ \text{out}[f_o, r, c] = \sum_{f_i}\sum_{i,j} w[f_o,f_i,i,j] \cdot x[f_i, Sr+i, Sc+j] \]

W = input size  |  K = kernel size  |  P = padding  |  S = stride

Parameter count per Conv layer = K × K × Cin × Cout + Cout (bias)

  • MaxPool: takes max in each window (no params)
  • 1×1 Conv: channel-wise linear projection
  • DepthwiseConv: one filter per input channel (MobileNet)
  • Condition number: κ = λmaxmin of Hessian
Embeddings & NLP Metrics
TF-IDF
\[ \text{TF}(w,d) = \frac{\text{freq}(w)}{|d|} \quad \text{IDF}(w)=\log\frac{N}{df(w)} \] \[ \text{TF-IDF}(w,d) = \text{TF}(w,d)\cdot\text{IDF}(w) \]
Cosine Similarity
\[ \cos(\mathbf{v},\mathbf{w}) = \frac{\mathbf{v}\cdot\mathbf{w}}{\|\mathbf{v}\|\|\mathbf{w}\|} \in [-1,+1] \]

Word analogy: vec(king) − vec(man) + vec(woman) ≈ vec(queen)

Word2Vec (Skip-Gram)
Probability of positive pair (t, c)
\[ P(+|t,c) = \sigma(\mathbf{t}\cdot\mathbf{c}) = \frac{1}{1+e^{-\mathbf{t}\cdot\mathbf{c}}} \]
Objective (all context words, independent)
\[ \log P(+|t,c) + \sum_{i=1}^{k}\log P(-|t,n_i) \]
  • Noise word sampling: pα(w), α = ¾ works well
  • Learn separate W (target) and C (context) matrices
Simple RNN
Hidden State Update
\[ s_t = \tanh(U\,x_t + W\,s_{t-1}) \]
Output
\[ o_t = \text{softmax}(V\,s_t) \]
  • BPTT: backpropagation through time
  • Long sequences → vanishing/exploding gradients
  • Gradient clipping: cap ‖∇‖ to prevent explosion
LSTM Gates
Forget Gate
\[ f_t = \sigma(W_f[h_{t-1}, x_t] + b_f) \]
Input Gate
\[ i_t = \sigma(W_i[h_{t-1}, x_t] + b_i) \]
Cell State Update
\[ C_t = f_t \odot C_{t-1} + i_t \odot \tanh(W_c[h_{t-1}, x_t] + b_c) \]
Output Gate & Hidden State
\[ o_t = \sigma(W_o[h_{t-1}, x_t] + b_o) \quad h_t = o_t \odot \tanh(C_t) \]

f = forget  |  i = input  |  o = output  |  C = cell state  |  = element-wise multiply  |  GRU simplifies to 2 gates (reset + update)

Attention Mechanism (seq2seq)
Attention Score
\[ e_{ti} = f_{att}(a_i,\, h_{t-1}) \]
Attention Weight (softmax)
\[ \alpha_{ti} = \frac{\exp(e_{ti})}{\sum_{k=1}^{L}\exp(e_{tk})} \]
Context Vector
\[ \hat{z}_t = \sum_{i=1}^{L} \alpha_{ti}\, a_i \]
Scaled Dot-Product (Transformer-style)
\[ \text{Attention}(Q,K,V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V \]

Keys = encoder hidden states  |  Query = decoder state  |  Values = encoder hidden states

Variational Autoencoder (VAE)
ELBO Loss (maximize)
\[ \mathcal{L}_{VAE} = \underbrace{\mathbb{E}[\log p(x|z)]}_{\text{reconstruction}} - \underbrace{D_{KL}(q(z|x)\,\|\,p(z))}_{\text{regularization}} \]
KL Divergence (Gaussian prior)
\[ D_{KL} = -\frac{1}{2}\sum_j\bigl(1 + \log\sigma_j^2 - \mu_j^2 - \sigma_j^2\bigr) \]

Prior: z ~ N(0, I)  |  Reparameterization: z = μ + σ⊙ε, ε ~ N(0,I)

Style Transfer & Object Detection
Gram Matrix (Style Capture)
\[ G^{[l]} = M^{[l]}\,(M^{[l]})^T \quad M^{[l]} \in \mathbb{R}^{C_o \times H_o W_o} \]
Neural Style Loss
\[ \mathcal{L} = \alpha\,\mathcal{L}_{content} + \beta\,\mathcal{L}_{style} \]

YOLO IoU (Intersection over Union)
\[ \text{IoU} = \frac{\text{Intersection Area}}{\text{Union Area}} \]

Non-max suppression: discard boxes with IoU > 0.5 against best box

Quick Reference — Architecture & Design Rules

CNN Output Size

O = ⌊(W−K+2P)/S⌋ + 1
  • Same padding: P = (K−1)/2
  • Valid padding: P = 0
  • Params = K² × C_in × C_out

Init Summary

  • Xavier/Glorot: tanh, symmetric
  • He/Kaiming: ReLU, sigmoid
  • Bias: always → 0
  • Zero weights: symmetry problem
  • Too large: explode/vanish

Regularization

  • L2: weight decay (ridge)
  • L1: sparsity (lasso)
  • Dropout: random neuron zeroing
  • Early stopping: time regularization
  • Data augmentation: implicit

ResNet Skip Connection

y = F(x, W) + x
  • Solves vanishing gradients in deep nets
  • Output = residual + identity
  • Related to ODE solvers (Euler method)

Condition Number

κ = λ_max / λ_min
  • Large κ → poor conditioning
  • H = U·Diag(λ)·Uᵀ (eigendecomp)
  • Leads to zigzag gradient descent

GAN Objective

max_D min_G E[log D(x)] + E[log(1−D(G(z)))]
  • D = discriminator, G = generator
  • z ~ N(0,1) for generator input
  • Use label smoothing (real=0.9, fake=0.1)

Attention (image captioning)

a = {a1,…,aL}, aᵢ ∈ ℝᴰ ẑ_t = Σᵢ αᵢ aᵢ α_ti = exp(e_ti) / Σ_k exp(e_tk)

MobileNet Bottleneck

expand → depthwise 3×3 → squeeze x_out = squeeze(dw(expand(x))) + x
  • Inverted residual block
  • Fewer params than standard conv
  • ReLU6 = min(max(0,x), 6)