Deep Learning Formula Cheat Sheet

Loss Functions

Mean Squared Error \[ \mathcal{L}_{MSE}(y,\hat{p}) = \frac{1}{B}\sum_{i}(y^{(i)} - \hat{p}^{(i)})^2 \]

Binary Cross-Entropy \[ \mathcal{L}_{BCE} = -\frac{1}{B}\sum_{i}\bigl[y^{(i)}\log\hat{p}^{(i)} + (1-y^{(i)})\log(1-\hat{p}^{(i)})\bigr] \]

Categorical Cross-Entropy \[ \mathcal{L}_{CE} = -\frac{1}{B}\sum_{i}\sum_{k} y_k^{(i)}\log \hat{p}_k^{(i)} \]

B = batch size | k = class index

Activation Functions

Sigmoid \[ \sigma(z) = \frac{1}{1+e^{-z}} \]

Softmax \[ \text{softmax}(z_k) = \frac{e^{z_k}}{\sum_j e^{z_j}} \]

ReLU

relu(z) = max(0, z)

Tanh

tanh(z) = (eᶻ−e⁻ᶻ)/(eᶻ+e⁻ᶻ)

ReLU/sigmoid → He init · tanh → Xavier init

Backprop & Gradient Descent

Weight Update (SGD) \[ w \leftarrow w - \eta \frac{\partial \mathcal{L}}{\partial w} \]

Chain Rule (Backprop) \[ \frac{\partial g}{\partial x} = \frac{\partial g}{\partial f}\cdot\frac{\partial f}{\partial x} \]

η = learning rate (hyperparameter)
Propagate gradients backward layer by layer
Gradient at layer ℓ depends on layer ℓ+1 (chain rule)

Batch Normalization

Normalize \[ \hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \varepsilon}} \]

Scale & Shift (learnable γ, β) \[ y = \gamma\,\hat{x} + \beta \]

μ_B, σ_B² = batch mean & variance | ε = numerical stability

Applied to intermediate layer outputs; stabilizes training.

Regularization

L2 Regularized Objective (Weight Decay) \[ \tilde{\mathcal{L}} = \mathcal{L}(y,\hat{p}) + \frac{\lambda}{2}\|w\|^2 \]

L1 Regularization (Sparsity) \[ \tilde{\mathcal{L}} = \mathcal{L}(y,\hat{p}) + \lambda\|w\|_1 \]

L2 Weight Update (shrink step) \[ w \leftarrow w(1 - \eta\lambda) - \eta\frac{\partial\mathcal{L}}{\partial w} \]

L2 → drives weights toward 0 (ridge / weight decay)
L1 → encourages sparsity (feature selection)
Bias terms are not regularized
Dropout: randomly zero out neurons during training (implicit regularization)
Early stopping: regularization in time

Weight Initialization

Xavier / Glorot (symmetric activations: tanh) \[ W \sim \mathcal{U}\!\left(-\sqrt{\frac{6}{n_{in}+n_{out}}},\; \sqrt{\frac{6}{n_{in}+n_{out}}}\right) \]

He / Kaiming (ReLU / sigmoid) \[ W \sim \mathcal{N}\!\left(0,\; \sqrt{\frac{2}{n_{in}}}\right) \]

Zero init → all neurons learn the same thing (symmetry problem)
Large random init → exploding/vanishing gradients
Bias → always initialize to 0

Convolutional Neural Networks

Output Spatial Size \[ O = \left\lfloor \frac{W - K + 2P}{S} \right\rfloor + 1 \]

Convolution Operation \[ \text{out}[f_o, r, c] = \sum_{f_i}\sum_{i,j} w[f_o,f_i,i,j] \cdot x[f_i, Sr+i, Sc+j] \]

W = input size | K = kernel size | P = padding | S = stride

Parameter count per Conv layer = K × K × C_in × C_out + C_out (bias)

MaxPool: takes max in each window (no params)
1×1 Conv: channel-wise linear projection
DepthwiseConv: one filter per input channel (MobileNet)
Condition number: κ = λ_max/λ_min of Hessian

Embeddings & NLP Metrics

TF-IDF \[ \text{TF}(w,d) = \frac{\text{freq}(w)}{|d|} \quad \text{IDF}(w)=\log\frac{N}{df(w)} \] \[ \text{TF-IDF}(w,d) = \text{TF}(w,d)\cdot\text{IDF}(w) \]

Cosine Similarity \[ \cos(\mathbf{v},\mathbf{w}) = \frac{\mathbf{v}\cdot\mathbf{w}}{\|\mathbf{v}\|\|\mathbf{w}\|} \in [-1,+1] \]

Word analogy: vec(king) − vec(man) + vec(woman) ≈ vec(queen)

Word2Vec (Skip-Gram)

Probability of positive pair (t, c) \[ P(+|t,c) = \sigma(\mathbf{t}\cdot\mathbf{c}) = \frac{1}{1+e^{-\mathbf{t}\cdot\mathbf{c}}} \]

Objective (all context words, independent) \[ \log P(+|t,c) + \sum_{i=1}^{k}\log P(-|t,n_i) \]

Noise word sampling: p^α(w), α = ¾ works well
Learn separate W (target) and C (context) matrices

Simple RNN

Hidden State Update \[ s_t = \tanh(U\,x_t + W\,s_{t-1}) \]

Output \[ o_t = \text{softmax}(V\,s_t) \]

BPTT: backpropagation through time
Long sequences → vanishing/exploding gradients
Gradient clipping: cap ‖∇‖ to prevent explosion

LSTM Gates

Forget Gate \[ f_t = \sigma(W_f[h_{t-1}, x_t] + b_f) \]

Input Gate \[ i_t = \sigma(W_i[h_{t-1}, x_t] + b_i) \]

Cell State Update \[ C_t = f_t \odot C_{t-1} + i_t \odot \tanh(W_c[h_{t-1}, x_t] + b_c) \]

Output Gate & Hidden State \[ o_t = \sigma(W_o[h_{t-1}, x_t] + b_o) \quad h_t = o_t \odot \tanh(C_t) \]

Attention Mechanism (seq2seq)

Attention Score \[ e_{ti} = f_{att}(a_i,\, h_{t-1}) \]

Attention Weight (softmax) \[ \alpha_{ti} = \frac{\exp(e_{ti})}{\sum_{k=1}^{L}\exp(e_{tk})} \]

Context Vector \[ \hat{z}_t = \sum_{i=1}^{L} \alpha_{ti}\, a_i \]

Scaled Dot-Product (Transformer-style) \[ \text{Attention}(Q,K,V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V \]

Keys = encoder hidden states | Query = decoder state | Values = encoder hidden states

Variational Autoencoder (VAE)

ELBO Loss (maximize) \[ \mathcal{L}_{VAE} = \underbrace{\mathbb{E}[\log p(x|z)]}_{\text{reconstruction}} - \underbrace{D_{KL}(q(z|x)\,\|\,p(z))}_{\text{regularization}} \]

KL Divergence (Gaussian prior) \[ D_{KL} = -\frac{1}{2}\sum_j\bigl(1 + \log\sigma_j^2 - \mu_j^2 - \sigma_j^2\bigr) \]

Prior: z ~ N(0, I) | Reparameterization: z = μ + σ⊙ε, ε ~ N(0,I)

Style Transfer & Object Detection

Gram Matrix (Style Capture) \[ G^{[l]} = M^{[l]}\,(M^{[l]})^T \quad M^{[l]} \in \mathbb{R}^{C_o \times H_o W_o} \]

Neural Style Loss \[ \mathcal{L} = \alpha\,\mathcal{L}_{content} + \beta\,\mathcal{L}_{style} \]

YOLO IoU (Intersection over Union) \[ \text{IoU} = \frac{\text{Intersection Area}}{\text{Union Area}} \]

Non-max suppression: discard boxes with IoU > 0.5 against best box

Quick Reference — Architecture & Design Rules

CNN Output Size

O = ⌊(W−K+2P)/S⌋ + 1

Same padding: P = (K−1)/2
Valid padding: P = 0
Params = K² × C_in × C_out

Init Summary

Xavier/Glorot: tanh, symmetric
He/Kaiming: ReLU, sigmoid
Bias: always → 0
Zero weights: symmetry problem
Too large: explode/vanish

Regularization

L2: weight decay (ridge)
L1: sparsity (lasso)
Dropout: random neuron zeroing
Early stopping: time regularization
Data augmentation: implicit

ResNet Skip Connection

y = F(x, W) + x

Solves vanishing gradients in deep nets
Output = residual + identity
Related to ODE solvers (Euler method)

Condition Number

κ = λ_max / λ_min

Large κ → poor conditioning
H = U·Diag(λ)·Uᵀ (eigendecomp)
Leads to zigzag gradient descent

GAN Objective

max_D min_G E[log D(x)] + E[log(1−D(G(z)))]

D = discriminator, G = generator
z ~ N(0,1) for generator input
Use label smoothing (real=0.9, fake=0.1)

Attention (image captioning)

a = {a1,…,aL}, aᵢ ∈ ℝᴰ ẑ_t = Σᵢ αᵢ aᵢ α_ti = exp(e_ti) / Σ_k exp(e_tk)

MobileNet Bottleneck

expand → depthwise 3×3 → squeeze x_out = squeeze(dw(expand(x))) + x

Inverted residual block
Fewer params than standard conv
ReLU6 = min(max(0,x), 6)