Up to 5 members in each group. Please include NetID and names on front page.

In this assignment, we will demonstrate that the back-propagation algorithm has a solid mathematical basis in gradient descent, while also being relatively simple to implement.

We begin with a simple single-layer, single-output network as demonstrated in the diagram below. This network isn't capable of much - in fact, it's identical to an ERM approach. However, it provides a useful toy example for us to get started with.

In this model, $\vec{x},\vec{W} \in \mathbb{R}^n$ and $a,z \in \mathbb{R}$. $\vec{x}$ is an input vector, $\vec{W}$ is a hidden weight vector, $a=\vec{W}^T \vec{x}$, and $z=f(a)$ for some transition function $f$. For this assignment, we'll use the logistic transition function and mean squared error loss, so: $$f(a) = z = \frac{1}{1+e^{-a}}$$ $${\cal L}(z) = \frac{1}{2}(y-z)^2$$ where ${\cal L}$ is the stochastic loss over a single input pair $(\vec{x},y)$.

(a) Using these values, prove that: $$\frac{\partial{\cal L}}{\partial\vec{W}} = -z(1-z)(y-z)\vec{x}$$

SOLUTION: By the chain rule, $$\frac{\partial{\cal L}}{\partial \vec{W}} = \frac{\partial{\cal L}}{\partial z}\frac{\partial z}{\partial a}\frac{\partial a}{\partial \vec{W}}$$ $$=-(y-z)(z(1-z))\vec{x}$$ Since $f$ and ${\cal L}$ are applied elementwise, and the derivative of the sigmoid function $f$ is $f(1-f)$.

(b) Similarly, prove that: $$\frac{\partial{\cal L}}{\partial\vec{x}} = -z(1-z)(y-z)\vec{W}$$ While we don't need this result to train this particular neural network, it will become important later on.

SOLUTION: analogous to above, except that $\frac{\partial a}{\partial \vec{W}} = \vec{x}$ is replaced with $\frac{\partial a}{\partial \vec{x}}=\vec{W}$.

We've successfully figured out how to train the simplest neural network known to man. We would obviously like to expand this model to more complicated networks by adding additional layers. First, however, we must adjust our model so that it can deal with multiple outputs, as shown in the diagram below.

In this model, the input $\vec{x}\in\mathbb{R}^n$, the outputs $\vec{a},\vec{z}\in\mathbb{R}^m$, and the weight matrix $\mathbf{W}\in\mathbb{R}^{m\times n}$. $\vec{z}=f(\vec{a})$, where $f$ is the sigmoid function applied elementwise. The loss is still a scalar: $${\cal L}=\sum_{i=1}^m \frac{1}{2}(\vec{y}_i-\vec{z}_i)^2$$

(a) Using your result from question 1 part (a), prove that: $$\frac{\partial{\cal L}}{\partial\mathbf{W}_{i,j}} = -\vec{z}_i(1-\vec{z}_i)(\vec{y}_i-\vec{z}_i)\vec{x}_j$$

(Hint: consider the two cases $\frac{\partial(\frac{1}{2}(y_k-z_k)^2)}{\partial\mathbf{W}_{i,j}}$ where $k=i$, and $k\neq i$. Each of these cases should reduce to a familiar value)

SOLUTION: since $\vec{a}=\mathbf{W}\vec{x}$, the value $\vec{a}_k$ is equal to $\mathbf{W}_{k}\vec{x}$ where $\mathbf{W}_k$ is the $k$th row of $\mathbf{W}$. We consider the two cases $\frac{\partial(\frac{1}{2}(y_k-z_k)^2)}{\partial\mathbf{W}_{i,j}}$ where $k=i$, and $k\neq i$. The first case reduces to our answer for 1a, $-\vec{z}_i(1-\vec{z}_i)(\vec{y}_i-\vec{z}_i)\vec{x}_j$. The second case is zero, since we established that $\vec{a}_i$, and therefore $\vec{z}_i$, is calculated using only values in the $k$th row of $\mathbf{W}$. The derivative of a sum is the sum of the derivatives, so the derivative of the loss is equal to our first case plus zero: $-\vec{z}_i(1-\vec{z}_i)(\vec{y}_i-\vec{z}_i)\vec{x}_j$.

(b) Using your result from part (a), show that the gradient of the loss $\frac{\partial{\cal L}}{\partial\mathbf{W}}$ can be written as the matrix product of two vectors.

SOLUTION: by the definition of matrix multiplication, $M=AB$ iff $M_{ik} = \sum_j A_{ij}B_{jk}$. We can set $j=1$ in which case for any two vectors $\vec{a}$ and $\vec{b}$, $M = \vec{a}\vec{b}^T$ iff $M_{ij} = \vec{a}_i \vec{b}_j$. Let $\vec{a} = -\vec{z}_i \times (1-\vec{z}_i) \times (\vec{y}_i-\vec{z}_i)$ and let $\vec{b} = \vec{x}$, where $\times$ represents elementwise multiplication. From part a we have $\frac{\partial {\cal L}}{\partial \mathbf{W}_{ij}} = \vec{a}_i \vec{b}_j$, so it follows that the full matrix $\frac{\partial {\cal L}}{\partial \mathbf{W}} = \vec{a} \vec{b}^T$.

(c) Using your result from question 1 part (b), prove that: $$\frac{\partial{\cal L}}{\partial\vec{x}} = -\mathbf{W}^T (\vec{z}\times(1-\vec{z})\times(\vec{y}-\vec{z}))$$ where $\times$ represents elementwise multiplication. Again, while we don't need this result in order to train this particular neural network, it will become important later on.

SOLUTION: Let ${\cal L}_i = \frac{1}{2}(\vec{y}_i-\vec{z}_i)$. From question 1b, we know that $$\frac{\partial {\cal L}_i}{\partial \vec{x}} = -\vec{z}_i(1-\vec{z}_i)(\vec{y}_i-\vec{z}_i)\mathbf{W}_i^T$$ Where $\mathbf{W}_i$ is the $i$th row of $\mathbf{W}$. We transpose it because this gradient is a column vector. Since ${\cal L} = \sum_i {\cal L}_i$ and the derivative of a sum is the sum of the derivatives, $$\frac{\partial {\cal L}}{\partial \vec{x}} = \sum_i -\vec{z}_i(1-\vec{z}_i)(\vec{y}_i-\vec{z}_i)\mathbf{W}_i^T$$ $$= \sum_i -\mathbf{W}_i^T (\vec{z}_i(1-\vec{z}_i)(\vec{y}_i-\vec{z}_i))$$ Since $\vec{z}_i(1-\vec{z}_i)(\vec{y}_i-\vec{z}_i)$ is a scalar. By the definition of matrix multiplication, this is equivalent to $-\mathbf{W}^T (\vec{z}\times(1-\vec{z})\times(\vec{y}-\vec{z}))$.

Now we're ready to add additional layers! Our neural network now looks like this:

To make predictions, we use the following algorithm. Feel free to ignore the bias vector $\vec{b}$ - we'll just assume that each vector $\vec{z}$ includes an appended bias term:

Intuitively, it's very difficult to figure out what values the intermediate weights and nodes should take on to minimize the loss, but the algorithm for figuring this out is actually very simple. Again, we can ignore the bias vector $\vec{b}$:

Now for the final part of our proof: we'll show that the back-propagation algorithm is just a fancy version of gradient descent.

(a) Using induction, and your result from question 2 part (c), prove that for all $1\le l \le L$, the algorithm's error signal is the derivative of ${\cal L}$ w.r.t. $\vec{a}_l$, i.e. $\vec{\delta}_l = \frac{\partial{\cal L}}{\partial\vec{a}_l}$

SOLUTION: for our base case, let $l=L$. $\vec{\delta}_L = \frac{\partial {\cal L}}{\partial \vec{z}_L} \frac{\partial \vec{z}_L}{\partial \vec{a}_L}$, from line 2 of the algorithm. Applying the chain rule proves our base case.

For the inductive case, let $\vec{\delta}_l = \frac{\partial{\cal L}}{\partial\vec{a}_l}$. From line 6 of the algorithm, we know that $$\vec{\delta}_{l-1}=\frac{\partial \vec{z}_{l-1}}{\partial \vec{a}_{l-1}} \times (\mathbf{W}_l^T \delta_l)$$ From our inductive hypothesis, we know that $$\mathbf{W}_l^T \delta_l = \mathbf{W}_l^T \frac{\partial{\cal L}}{\partial\vec{a}_l} $$ $$= \mathbf{W}_l^T \frac{\partial{\cal L}}{\partial\vec{z}_l} \frac{\partial \vec{z}_l}{\partial \vec{a}_l} $$ $$= \frac{\partial {\cal L}}{\partial \vec{z}_{l-1}}$$ Where the last equality comes from question 2c. By the chain rule, then, $\frac{\partial \vec{z}_{l-1}}{\partial \vec{a}_{l-1}} \times (\mathbf{W}_l^T \delta_l) = \frac{\partial {\cal L}}{\partial \vec{a}_{l-1}} = \delta_{l-1}$, and the inductive case is proven.

Thus for all levels below $L$, the claim holds.

(b) Using your results from part (a) and question 2 part (b), prove that for all $1\le l \le L$, the algorithm's update matrix $\Delta\mathbf{W}_l = \frac{\partial{\cal L}}{\partial\mathbf{W}_l}$

SOLUTION: In part a we established that $\vec{\delta}_l = \frac{\partial{\cal L}}{\partial\vec{a}_l} = \frac{\partial{\cal L}}{\partial \vec{z}_l} \frac{\partial \vec{z}_l}{\partial\vec{a}_l}$, where the last equality comes from the chain rule. From question 2b, we know that $(\frac{\partial{\cal L}}{\partial \vec{z}_l} \frac{\partial \vec{z}_l}{\partial\vec{a}_l})\vec{z}_{l-1}^T = \frac{\partial {\cal L}}{\partial \mathbf{W}_l}$. By line 4 of the algorithm, this is equivalent to $\nabla \mathbf{W}_l$.

This shows that no matter how large or complicated our network may be, the back-propagation algorithm can perform gradient descent on a chosen loss and transition function with simple matrix multiplications. And just like gradient descent, we can introduce adagrad and regularization for better performance. It can also be efficiently implemented in under 10 lines of code!