Inspired from training Deep Neural Networks up till now, I wanted to write about the working of backpropagation and the flow of gradients during backpropagation.

Some of the commonly used loss functions in training models are as follows:

Let’s go through the feed forward and backpropagation for a fully connected network for the task of classification. We shall use softmax activation with Cross Entropy loss for the final layer.

Softmax Activation

This Activation takes in an N-dimensional vector and produces another N-dimensional vector after applying a non-linearity. The actual function \(f: \mathbb{R}^{N} \rightarrow \mathbb{R}^{N}\) is as follows:

Computing softmax activation is as follows:

    def softmax_activation(x):
      Compute the softmax activation of vector `x`
      y = np.exp(x) / np.sum(np.exp(x))
      return y


    softmax_activation([1.0, 2.0, 3.0])
    [ 0.09003057,  0.24472847,  0.66524096]

Now consider,

    softmax_activation([1000.0, 2000.0, 3000.0])
    [ nan,  nan,  nan]

This occurs due to the overflow which is encountered in exp. To ensure that we don’t blow up the activation values, we need to normalize the input vector. This is done by subtracting element wise the max element from the input vector \(x\). This operation makes all the elements of the vector negative except for the max element which is zero. Thus even for a very large negative element, the softmax on this element returns a value close to zero and we can avoid Nan.

    def stable_softmax_activation(x):
      Compute the stable softmax activation of vector `x`
      y = np.exp(x - np.max(x)) / np.sum(np.exp(x - np.max(x)))
      return y

Now, for

    softmax_activation([1000.0, 2000.0, 3000.0])
    [ 0.,  0.,  1.]

Partial Derivatives of Softmax Activation

called the “logits” and computes the softmax activations which are:

We need to now compute the Jacobian matrix for the \(i\)th output w.r.t \(j\)th input:

For \(i = j\), we have:

And for the case \(i \neq j\), we have:

And thus,

Network Architecture

Let there be \(l\) layers in the network. Softmax Layer involves application of softmax activation to the last fully connected layer. Let the last fully connected layer be parametarized by \(\left(W, b\right)\) i.e, the weight matrix and bias vector. Further let \(W \in \mathbb{R}^{NT}\) and the input \(X \in \mathbb{R}^{N}\). Clearly this classification problem spans over \(T\) classes.

Representing \(p\) as “logits” and \(f\) the softmax function:

We now have , which indicate the probability of belonging to that specific class. Comparing these values to the corresponding ground truth values , which is a one hot vector we can define the cross entropy loss function as follows.

Cross Entropy Loss

Considering just one training example, the loss \(L\) is given by:

\(y\) is one hot vector representing the ground truth, \(f\) is the corresponding predictions which are obtained after applying softmax activation.

In order to update the weights during backpropagation, we need to compute the gradients

Simple application of the chain rule gives us:

Computing the individual gradients gives us the following:

In the one hot vector , let the index of the correct class be (i.e, the index at which the value is ). The loss can now be simplified to,

From above we already have:

We have:

And thus,

Putting it all together, since only the \(z\)th element in \(y\) is non zero: