Simple MLP - Forward and Backward Pass (With Einsum) Derivation

Neural Network Forward and Backward Pass

Notation

h = Sigmoid(wx + b)
w = Nh × Ni
V = No × Nh
y' = Softmax(Vh + c)
b = Nh
c = No
x = Ni × 1 (1 training sample)

Forward Pass

\[\begin{aligned} h_a &= W @ x + b \quad && (Nh \times Ni) @ (Ni \times 1) + (Nh \times 1) \\ h &= \text{Sigmoid}(h_a) \quad && (Nh \times 1) \\ y_a &= V @ h + c \quad && (No \times Nh) \times (Nh \times 1) + (No \times 1) \\ y &= \text{Softmax}(y_a) \quad && (No \times 1) \end{aligned}\]

Cross-entropy loss: \(L = -\sum_{i=0}^{n} t \log y_i \quad \text{(scalar)} \rightarrow -t \log y \text{(vector)} \quad (No \times 1)\)

Backward Pass

Gradients w.r.t. loss:

\[\frac{\partial L}{\partial L} = 1 ; \quad \frac{\partial L}{\partial y} = -\frac{t}{y}\] \[\frac{\partial L}{\partial y_a} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial y_a} = y - t \quad (No \times 1)\]

Note: Output shape must match the variable shape w.r.t. which derivative is taken.

Derivatives:

\[\frac{\partial L}{\partial V} = \frac{\partial L}{\partial y_a} \cdot \frac{\partial y_a}{\partial V} = (y - t) \cdot h^\top \quad \Rightarrow \quad (No \times 1) @ (1 \times Nh) = (No \times Nh)\] \[\frac{\partial L}{\partial h} = \frac{\partial L}{\partial y_a} \cdot \frac{\partial y_a}{\partial h} = V^\top \cdot (y - t) \quad \Rightarrow \quad (Nh \times No) @ (No \times 1) = (Nh \times 1)\] \[\frac{\partial L}{\partial c} = \frac{\partial L}{\partial y_a} \cdot \frac{\partial y_a}{\partial c} = (y - t) \cdot 1 \quad \Rightarrow \quad (No \times 1)\] \[\frac{\partial L}{\partial h_a} = \frac{\partial L}{\partial h} \cdot \frac{\partial h}{\partial h_a} = \frac{\partial L}{\partial h} \left[ \sigma(h_a) \cdot (1 - \sigma(h_a)) \right] \quad \Rightarrow \quad (Nh \times 1) \cdot (Nh \times 1) = (Nh \times 1)\] \[\frac{\partial L}{\partial w} = \frac{\partial L}{\partial h_a} \cdot \frac{\partial h_a}{\partial w} = \frac{\partial L}{\partial h_a} \cdot x^\top = (Nh \times 1) @ (1 \times Ni) = (Nh \times Ni)\] \[\frac{\partial L}{\partial b} = \frac{\partial L}{\partial h_a} \cdot \frac{\partial h_a}{\partial b} = \frac{\partial L}{\partial h_a} \cdot 1 = (Nh \times 1)\]

Forward and Backward Pass Using `einsum` (Single-Sample MLP)

Model

Hidden layer: \(h = \sigma(Wx + b)\)
Output layer: \(y = \text{softmax}(Vh + c)\)

Index Convention

\(i\): input dimension index, \(i = 1 \dots N_i\)
\(h\): hidden dimension index, \(h = 1 \dots N_h\)
\(o\): output dimension index, \(o = 1 \dots N_o\)

Forward Pass

Hidden pre-activation

\[h_{a,h} = \sum_i W_{h i} x_i + b_h\]

ha = np.einsum("hi,i->h", W, x) + b

Hidden activation (sigmoid)

\[h_h = \sigma(h_{a,h})\]

h = sigmoid(ha)

Output pre-activation

\[y_{a,o} = \sum_h V_{o h} h_h + c_o\]

ya = np.einsum("oh,h->o", V, h) + c

Output activation (softmax)

\[y_o = \frac{e^{y_{a,o}}}{\sum_{o'} e^{y_{a,o'}}}\]

y = softmax(ya)

Loss (Cross-Entropy)

\[L = -\sum_o t_o \log y_o\]

Backward Pass

Gradient w.r.t. output pre-activation

\[\frac{\partial L}{\partial y_{a,o}} = y_o - t_o\]

dLdya = y - t

Gradient w.r.t. output weights \(V\)

\[\frac{\partial L}{\partial V_{o h}} = (y_o - t_o) h_h\]

dLdV = np.einsum("o,h->oh", dLdya, h)

Gradient w.r.t. output bias \(c\)

dLdc = dLdya

Gradient w.r.t. hidden activations

\[\frac{\partial L}{\partial h_h} = \sum_o V_{o h} (y_o - t_o)\]

dLdh = np.einsum("oh,o->h", V, dLdya)

Gradient w.r.t. hidden pre-activations

\[\frac{\partial L}{\partial h_{a,h}} = \frac{\partial L}{\partial h_h} \cdot \sigma(h_{a,h})(1 - \sigma(h_{a,h}))\]

dLdha = dLdh * sigmoidgrad(ha)

Gradient w.r.t. input weights \(W\)

\[\frac{\partial L}{\partial W_{h i}} = \frac{\partial L}{\partial h_{a,h}} x_i\]

dLdW = np.einsum("h,i->hi", dLdha, x)

Gradient w.r.t. hidden bias \(b\)

dLdb = dLdha

Summary;

All operations are tensor contractions.
einsum makes index flow explicit.
Backprop through linear layers reduces to transposes and outer products.

Code:

import numpy as np

# Dimensions
Ni = 784  # Input dimension (e.g., MNIST)
Nh = 500  # Hidden dimension
No = 10   # Output dimension (e.g., 10 classes)

# Initialize parameters
W = np.random.randn(Nh, Ni) * 0.01
b = np.zeros(Nh)
V = np.random.randn(No, Nh) * 0.01
c = np.zeros(No)

# Single training example
x = np.random.randn(Ni)    # Input vector
t = np.zeros(No)           # One-hot target
t[3] = 1                   # Example: class 3

# --- Forward Pass ---
ha = np.einsum("hi,i->h", W, x) + b        # (Nh,)
h = 1 / (1 + np.exp(-ha))                  # (Nh,)
ya = np.einsum("oh,h->o", V, h) + c        # (No,)
# Softmax with numerical stability
ya_max = np.max(ya)
exp_ya = np.exp(ya - ya_max)
y = exp_ya / np.sum(exp_ya)                # (No,)

# Loss
L = -np.sum(t * np.log(y + 1e-8))

# --- Backward Pass ---
dL_dya = y - t                             # (No,)
dL_dV = np.einsum("o,h->oh", dL_dya, h)    # (No, Nh)
dL_dc = dL_dya                             # (No,)
dL_dh = np.einsum("oh,o->h", V, dL_dya)    # (Nh,)
sigmoid_grad = h * (1 - h)                 # (Nh,)
dL_dha = dL_dh * sigmoid_grad              # (Nh,)
dL_dW = np.einsum("h,i->hi", dL_dha, x)    # (Nh, Ni)
dL_db = dL_dha                             # (Nh,)

# --- Update parameters (gradient descent) ---
learning_rate = 0.01
W -= learning_rate * dL_dW
b -= learning_rate * dL_db
V -= learning_rate * dL_dV
c -= learning_rate * dL_dc

Question

When computing
\(\frac{\partial L}{\partial h} = V^\top \frac{\partial L}{\partial y_a},\) using einsum, does einsum implicitly transpose tensors?
Should I change the index order (e.g. use "ho" instead of "oh") to represent the transpose?

**Answer **

No. einsum never performs implicit transposes.
All transposes must be expressed explicitly through index labels, not by rearranging axes incorrectly.

In this example, the weight matrix is defined as:

\[V \in \mathbb{R}^{N_o \times N_h}, \quad \text{i.e. } V_{o h}.\]

The correct derivative is:

\[\frac{\partial L}{\partial h_h} = \sum_o V_{o h} \frac{\partial L}{\partial y_{a,o}}.\]

This maps directly to the following einsum:

```python dL_dh = np.einsum(“oh, o->h”, V, dL_dya)