Backpropagation

Motivation Background Tensor Notation General Backpropagation Requirements Algorithm Minimal Backpropagation Example Setup Layer Overview Linear Layer ReLU Layer Softmax & Cross Entropy Loss Putting it all together Training the model

About

(Based on Chapter 6 of the Deep Learning Book)

Motivation

In the general case, neural architectures are not guaranteed to have simple expressions representing the derivatives that we can easily calculate and hard-code. We require a method for calculating these automatically. This is the domain of the backpropagation algorithm.

The premise of backpropagation is the use of the chain rule to calculate derivatives of parameters with respect to the cost function. Doing this in the naive way for each node in the network requires exponentially many computational steps. The insight for backpropagation is that this can be reduced substantially by computing and retaining the derivatives, computed front-to-back.

💡

The chain rule can be computed from the left or right, the latter of which enables backpropagation. This is typically preferred as it build up gradients from front to back, which is quicker in the (standard) case where the number of neurons reduces as we proceed (with a single output neuron). I think there are some cases though in which it's actually more efficient to go forward.

Here we study the most general form of the backpropagation algorithm, which can be applied to any model comprising differentiable operations.

Background

Tensor Notation

We assume a computation graph where each node represents an operation parameterised by a tensor .

Tensors are typically indexed with a coordinate per rank, which can make notation challenging. To simplify this, we index using a single coordinate representing the tuple of original coordinates. This allows us to essentially treat the tensor like a standard vector in our equations.

Based on this, we can write the chain rule as it applies to tensors. Letting and :

💡

My instinct for this equation is to think that we can represent this as one inner product. But is already a matrix, so unless we want to get into some fairly scary tensor notation (not necessary, potentially less useful!) it's better to leave it like this. Recall as well that our standard assumption is that the derivative of the cost with respect to a parameter tensor has the same shape as that tensor.

General Backpropagation

Our objective here is to create a second computational graph that flows backwards to compute gradients.

This enables us to do the forward and backward pass in exactly the same way. In the textbook it is suggested that these graphs are combined, although they could equally be managed separately.

Requirements

A computation graph where nodes represent variables and edges their dependencies. The function is assumed to exist.

A set of target variables whose gradients must be computed.

The variable to be differentiated .

A backpropagation operation for each variable : .

For each fundamental operation used in , the framework/programmer must define a corresponding backpropagation operation, which should satisfy the following equation:

where is expected to represent . This is just a step of the chain rule we're familiar with for a given operation.

The final thing to note is the existence of a pre-processing step which prunes all nodes that are not on any path from to . When adding the new backwards nodes we do this to the full graph, but for all other purposes we consider the pruned graph.

Algorithm


def general_back_propagation(T, z):
    grad_table = {z: 1}  // mutable
    for V in T:
        build_grad(V, grad_table)
    return {v: g for v, g in grad_table if v in V}

def build_grad(V, grad_table):
    if V in grad_table:
        return grad_table[V]
    grad_v = 0
    for W in get_children(V):
        D = build_grad(W, grad_table)
        grad_v += bprop(V, W, D)
    grad_table[V] = grad_v
    # + function to insert grad_table[V] into graph, along with ops creating it
    return grad_v

Minimal Backpropagation Example

Thus far we've discussed automatic computation of backpropagation. This is an essential tool for modern deep learning frameworks. However, to get a feel for what's really going on here, it helps to look at a simple hard-coded implementation, which reflects what would be automatically computed were we to, say, code this up in PyTorch.

This is exactly what I've created below: the simplest possible neural network architecture and training problem. What's presented below was initially a Jupyter Notebook, so it contains all the necessary code to make this work and has been tested. It is written as a walk-through for users wishing to implement this themselves.

Setup

First things first, there's some setup to do. This isn't a tutorial on data loading, so I'm just going to paste some code for loading up our dataset and we can ignore the details. The only thing worth noting is that we'll be using the classic MNIST character-recognition dataset: (click toggle to show)


import math
import torch
import torchvision
from torchvision.datasets import MNIST
from torch.utils.data import DataLoader
from torch.nn.functional import one_hot
from functools import reduce
import altair as alt
import pandas as pd

batch_sz = 64
train = DataLoader(MNIST('data/', train=True, download=True,
                         transform=torchvision.transforms.Compose([
                             torchvision.transforms.ToTensor(),
                             lambda x: torch.flatten(x),
                         ])
                        ), batch_size=batch_sz, shuffle=True)

This train dataloader can be iterated over and returns minibatches of shape (batch__sz, input_dim). In our case, these values are (64, 28*28=784).

Layer Overview

We'll construct our neural network by making classes for each layer. We'll use a standard setup of: linear layer, ReLU, linear layer, softmax; plus a cross-entropy loss.

For each layer class we require two methods:

__call__(self, x): implements the forward pass. __call__ allows us to feed an input through the layer by treating the initialised layer object as a function. For example: relu_layer = ReLU(); output = relu_layer(input).

bp_input(self, grad): implements the backward pass, allowing us to backpropagate the gradient vector through the layer. The grad parameter is a matrix of partial derivatives of the loss, with respect to the data sent from the given layer to the next. As such, it is a (batch_sz, out) matrix. The job of bp_input is to return a (batch_sz, in) matrix to be sent to the next layer by multiplying grad by the derivative of the forward pass with respect to the input (or an equivalent operation).

There are two other methods we sometimes wish to implement for different layers:

__init__(self, ...): initialises the layer, e.g. weights.

bp_param(self, grad): the "final stop" of the backwards pass. Only applicable for layers with trainable weights. Similar to bp_input, but calculates the derivative with respect to the weights of the layer. Should return a matrix with the same shape as the weights (self.W) to be updated.

💡

The key point to recall when visualising this is that when we have a batch dimension it is always the first dimension. For both the forward and backward pass. This makes everything much simpler!

Linear Layer

Let's start with the linear layer. We do the following:

We start by initialising the weights (in this case using the Xavier initialisation).

We then implement the call method. Rather than adding an explicit bias, we append a vector of ones to the layer's input (this is equivalent, and makes backpropagation simpler).

Backpropagation with respect to the input is just right multiplication by the transpose of the weight matrix (adjusted to remove the added 1s column)

Backpropagation with respect to the output is left multiplication by the transpose of the input matrix.


class LinearLayer:
    def __init__(self, in_sz, out_sz):
	self.W = self._xavier_init(in_sz + 1, out_sz)  # (in+1, out)

    def _xavier_init(self, i, o):
	return torch.Tensor(i, o).uniform_(-1, 1) * math.sqrt(6./(i + o))
    
    def __call__(self, X):  # (batch_sz, in)
	# (batch_sz, in+1)
        self.X = torch.cat([X, torch.ones(X.shape[0], 1)], dim=1)
	# (batch_sz, in+1) @ (in+1, out) = (batch_sz, out)
        return self.X @ self.W
    
    def bp_input(self, grad):
	# (batch_sz, out) @ (out, in) =  (batch_sz, in) 
	return (grad @ self.W.T)[:,:-1]

    def bp_param(self, grad):
	# (in+1, batch_sz) @ (batch_sz, out) = (in+1, out)
	return self.X.T @ grad

ReLU Layer

Some non-linearity is a must! Bring on the ReLU function. The implementation is pretty obvious here. clamp() is doing all the work.


class ReLU:
    def __call__(self, X):
        self.X = X
        return X.clamp(min=0)  # (batch_sz, in)
    
    def bp_input(self, grad):
	return grad * (self.X > 0).float()  # (batch_sz, in)

Softmax & Cross Entropy Loss

What? Both at once, why would you do this??

This is quite common, and I can justify it in two ways:

This layer-loss combination often go together, so why not put them all in one layer? This saves us from having to do two separate forward and backward propagation steps.

I won't prove it here, but it turns out that the derivative of the loss with respect to the input to the softmax, is much simpler than the two intermediate derivative operations (see
🔟
Logistic Regression for full details), and bypasses the numerical stability issues that arise when we do the exponential and the logarithm.

The downside here is that is we're just doing inference then we only want the softmax output. But for the purposes of this tutorial we only really care about training. So this will do just fine!

There's a trick in the second line of the softmax implementation: it turns out subtracting the argmax from the softmax input keeps the output the same, but the intermediate values are more numerically stable. How neat!

Finally, we examine the backprop step. Our starting grad for backprop (the initial grad value passed in is just the ones vector) is the difference in our predicted output vector and the actual one-hot encoded label. This is very intuitive and simple!


# (batch_sz, in=out) for all dims in this layer
class SoftmaxCrossEntropyLoss:
    def __call__(self, X, Y):
        self.Y = Y
        self.Y_prob = self._softmax(X)
        self.loss = self._cross_entropy_loss(Y, self.Y_prob)
        return self.Y_prob, self.loss
    
    def _softmax(self, X):
        self.X = X
        X_adj = X - X.amax(dim=1, keepdim=True)
        exps = torch.exp(X_adj)
        return exps / exps.sum(axis=1, keepdim=True)
    
    def _cross_entropy_loss(self, Y, Y_prob):
	return (-Y * torch.log(Y_prob)).sum(axis=1).mean()
    
    def bp_input(self, grad):
	return (self.Y_prob - self.Y) * grad

Putting it all together

Let's bring these layers together in a class: our NeuralNet implementation.

The evaluate() function does two things. Firstly, it runs the forward pass by chaining the __call__() functions, to generate label probabilities. Secondly, it uses the labels passed to it to calculate the loss and percentage correctly predicted.

💡

For this simplified example we don’t have a pure inference function, but we could add one with a small change to SoftmaxCrossEntropyLoss.

The gradient_descent() function then gets the matrix of updates for each weight matrix and applies the update. The key bit here is how backprop() works. Going backwards through the computation graph we chain the backprop with respect to input methods. Then for each weighted layer we want to update, we apply the backprop with respect to parameters method to the relevant gradient vector.


class NeuralNet:
    def __init__(self, input_size=28*28, hidden_size=32, output_size=10, alpha=0.001):
        self.alpha = alpha
        self.z1 = LinearLayer(input_size, hidden_size)
        self.a1 = ReLU()
        self.z2 = LinearLayer(hidden_size, output_size)
        self.loss = SoftmaxCrossEntropyLoss()
    
    def evaluate(self, X, Y):
        out = self.z2(self.a1(self.z1(X)))
        correct = torch.eq(out.argmax(axis=1), Y).double().mean()
        Y_prob, loss = self.loss(out, one_hot(Y, 10))
        return Y_prob, correct, loss
    
    def gradient_descent(self):
        delta_W1, delta_W2 = self.backprop()
        self.z1.W -= self.alpha * delta_W1
        self.z2.W -= self.alpha * delta_W2
    
    def backprop(self):
        d_out = torch.ones(*self.loss.Y.shape)
        d_z2 = self.loss.bp_input(d_out)
        d_a1 = self.z2.bp_input(d_z2)
        d_z1 = self.a1.bp_input(d_a1)

        d_w2 = self.z2.bp_param(d_z2)
        d_w1 = self.z1.bp_param(d_z1)
        return d_w1, d_w2

Training the model

We're almost there! I won't go into this bit too much because this tutorial isn't about training loops, but it's all very standard. We break the training data into minibatches and train on them over 10 epochs. The evaluation metrics plotted are those recorded during regular training: (click toggle to show)


model = NeuralNet()
stats = {'correct': [], 'loss': [], 'epoch': []}
for epoch in range(2):
    correct, loss = 0, 0
    for i, (X, y) in enumerate(train):
        y_prob, batch_correct, batch_loss = model.evaluate(X, y)
        model.gradient_descent()
        
        correct += batch_correct / len(train)
        loss += batch_loss / len(train)
    stats['correct'].append(correct.item())
    stats['loss'].append(loss.item())
    stats['epoch'].append(epoch)
    print(f'epoch: {epoch} | correct: {correct:.2f}, loss: {loss:.2f}')

base = alt.Chart(pd.DataFrame.from_dict(stats)).mark_line() \
          .encode(alt.X('epoch', axis=alt.Axis(title='epoch')))
line1 = base.mark_line(stroke='#5276A7', interpolate='monotone') \
            .encode(alt.Y('loss'   , axis=alt.Axis(title='Loss'   , titleColor='#5276A7'), scale=alt.Scale(domain=[0.0, max(stats['loss'   ])])), tooltip='loss'   )
line2 = base.mark_line(stroke='#57A44C', interpolate='monotone') \
            .encode(alt.Y('correct', axis=alt.Axis(title='Correct', titleColor='#57A44C'), scale=alt.Scale(domain=[min(stats['correct']), 1.0])), tooltip='correct')
alt.layer(line1, line2).resolve_scale(y = 'independent')

🚨

These results are only on the training set! In practice we should always plot performance on a test set, but I don’t want to clutter the tutorial with this extra detail.

When I run this the results are fantastic! After 10 epochs I'm getting 97% correct. Particularly impressive given how simple this implementation is: we've implemented a complete backpropagation algorithm using only four cells worth of code (plus a couple more to set up and train).

The backpropagation algorithm can seem a little complex and mysterious at first. Hopefully this post makes it seem a little simpler!