Processing math: 100%
20: Neural Network
It is also known as multi-layer perceptrons and deep nets.
Original Problem:
How can we make linear classifiers non-linear?
wT+b→wTΦ(x)+b
Where Kernalization Φ(x) is a clever way to make inner products computationally tractable.
Neural network learns Φ:
Φ(x)=[h1(x)⋮hm(x)]
Each hi(x) is a linear classifier. This learns how level problem that are "simpler".
E.g. In digit classification, these detect vertical edges, round shapes, horizontals. Their output then becomes the input to the main linear classifier.
a′j=∑kw′jk+b′
aj=∑jwjkz′j+b
Forward Propagation:
Quiz: Try to express vecZ in terms of w,w′,b,b′,f,y in matrix notation
We need to learn w,w′,b,b′. We can do so through gradien descent.
Back propagation:
Loss function for a single example: (For the entire training a set average over all training points.)
L(→x,→y)=12(H(→x)−→y)2
Where H(→x)=→z
L=12(→z−→y)2
We learn W with gradient descent.
Observation (chain rule):
∂L∂wij=∂L∂αi∂αi∂wij=∂L∂αiZ′j
∂L∂w′jk=∂L∂α′j∂L∂α′jxk=∂L∂α′jZ″k
Let →δ=∂L∂→α and →δ′=∂L∂→α′ (i.e. δ′j=∂L∂α′j)
Gradients are easy if we know →δ,→δ′,→δ″,→δ‴ (→δ″,→δ‴ are for deeper neural nets. )
So, what is δ?
δi=∂L∂αi=∂L∂zi∂zi∂αi=(zi−yi)g′(αi)=→g′(→α)∘(→z−→y)
Note that L=12(zi−yi)2 and zi=g(αi).
δ′j=∂L∂α′j=∑i∂L∂zi∂zi∂αi∂αi∂z′j∂z′j∂α′j=∑iδi∂αi∂z′j∂z′j∂α′j
Notet hat ∂L∂zi∂zi∂αi=δi, α=∑jwijz′j+b, and ∂z′j∂α′j=δ′(α′j)
f′(α′j)∑iδiWij=→f′(→α′)∘(WTδ)
Typical transition functions:

In the "Old Days", sigmoid and tanh were most popular. Nowadays, Rectified Linear Unit (Relu) are pretty fashionable.
Algorithms:
Forware Pass:

Backward Pass:


with →zl=fl(→al)
Famous Theorem:
-ANN are univeral approximators (like SVMS, GPs, ...)
-Theoretically, a ANN with one hidden layer is as expressive as one with many hidden layers in practice if many nodes are used.
Overfitting in ANN:
Neural Networks learn lots of parameters and therefore are prone to overfitting. This is not necessarily a problem as long as you use regularization. Two popular reglarizers are the following:
- Use l2 regularization on all weights (including bias terms).
- For each input (or mini-batch) randomly remove each hidden node with probability p (e.g. p=0.5) these nodes stay removed during the backprop pass, however are included again for the next input.
Avoidance of local minima
1. use momentum: Decline ▽wt=Δwt+μ▽wt−1
w=w−α▽wt
(i.e. still use some portion of previous gradient to keep you pushing out of small local minima.)

2. Initialize weights cleverly (not all that important)
e.g. use Autoencoders for unsupervised pre-training
3. use Relu instead of sigmoid/tanh (weights don't saturate)
Tricks and Tips
- Rescale your data so that all features are within [0,1]
- Lower learning rate
- use mini-batch (i.e. stochastic gradient descent with maybe 100 inputs at a time - make sure you shuffle inputs rnadomly first.)
- for image use convolution neural network