Reading | Machine Learning

Overview
- Administrative stuff
- Supervised learning setup:
- Feature vectors, Labels
- 0/1 loss, squared loss, absolute loss
- Train / Test split

Reading:
- Lectures Notes
- MLaPP: sec. 1.1, 1.2, 1.4.2, 1.4.3, 1.4.9

k-nearest neighbors
- Hypothesis classes
- Nearest Neighbor Classifier
- Sketch of Covert and Hart proof
(that 1-NN converges to at most 2xBayes Error in the sample limit)
- Curse of Dimensionality

Reading:
- Lectures Notes
- MLaPP: sec. 1.1, 1.2, 1.4.2, 1.4.3, 1.4.9
- Start reading: 2.1 -- 2.5
- videos of different values of k
- Video describing nearest neighbors
- Probably the best explanation of nearest neighbors

Perceptron
- Linear Classifiers
- Absorbing bias into a d+1 dimensional vector
- Perceptron convergence proof

Reading:
- Lectures Notes
- The Perceptron Wiki page
- MLaPP 8.5.4
- Article in the New Yorker on the Perceptron

Estimating Probabilities from data
- MLE
- MAP
- Bayesian vs. Frequentist statistics

Reading:
- Lectures Notes
- Ben Taskar’s notes on MLE vs MAP
- Tom Mitchell’s book chapter on Estimating Probabilities
- Youtube videos on MLE and MAP and the Dirichlet distribution

Naive Bayes
- Naive Bayes Assumption
- Why is estimating probabilities difficult and high dimensions?

Reading:
- Lectures Notes
- Ben Taskar’s notes on Naïve Bayes
- Tom Mitchell’s book chapter on Naive Bayes (chapters 1-3)
- Youtube videos on Naive Bayes
- Xiaojin Zhu’s notes on Multinomial Naive Bayes
- Mannings’ description of Multinomial Naive Bayes

Logistic Regression
- Logistic Regression formulation
- Relationship of LR with Naive Bayes

Reading:
- Lectures Notes
- MLAPP 8, 8.1, 8.2, 8.3.1, 8.3.2, 8.3.4, 8.6
- Ben Taskar’s notes on Logistic Regression
- Tom Mitchell’s book chapter on Naive Bayes and Logistic Regression
- Youtube videos on Logistic Regression
- Logistic Regression Wiki

Gradient Descent:
- Gradient Descent (GD)
- Taylor’s Expansion
- Proof that GD decreases with every step if step size is small enough
- some tricks to set the step size
- Newton’s Method

Reading:
- Lectures Notes
- MLAPP 13.3-13.3.1, 8.3.2, 8.3.6, 13.5.3
- Bayen’s slides

Linear Regression:
- Assumption of linear regression with Gaussian Noise
- Ordinary Least Squares (OLS) = MLE
- Ridge Regression = MAP

Reading:
- Lecture Notes
- MLAPP 7-7.5.1 + 7.5.4
- OLS wiki page
- Youtube video on OLS

Linear SVM:
- What is the margin of a hyperplane classifier
- How to derive a max margin classifier
- That SVMs are convex
- The final QP of SVMs
- Slack variables
- The unconstrained SVM formulation

Reading:
- Lecture Notes
- MLAPP: 14.5 - 14.5.2.2
- Ben Taskar’s Notes on SVMs

Empirical Risk Minimization:
- Setup of loss function and regularizer
- classification loss functions: hinge-loss, log-loss, zero-one loss, exponential
- regression loss functions: absolute loss, squared loss, huber loss, log-cosh
- Properties of the various loss functions
- Which ones are more susceptible to noise, which ones are loss
- Special cases: OLS, Ridge regression, Lasso, Logistic Regression

Reading:
- Lecture Notes
- MLAPP 6.5-6.5.3.2

Reduction of Elastic Net to SVM
- Elastic Net and Lasso can be reduced to SVM (squared hinge-loss, no bias)
- This reduction is exact
- Practical advantage: SVM solvers to be used to solve the Elastic Net

Reading:
- Lecture Notes
- MLAPP 13.3 - 13.4
- Original Publication by Zhou et al. (AAAI 2015)
- Prior publication by Jaggi 2014 on Lasso and SVM

Bias / Variance Tradeoff:
- What is Bias?
- What is Variance?
- What is Noise?
- How error decomposes into Bias, Variance, Noise

Reading:
- Lecture Notes #12
- Ben Taskar’s Notes (recommended!!)
- Excellent Notes by Scott Foreman-Roe
- EoSLR Chapter 2.9
- MLAPP: 6.2.2

ML Debugging, Over- / Underfitting:
- k-fold cross validation
- regularization
- How to debug ML algorithms
- How to recognize high variance scenarios
- What to do about high variance
- how to recognize high bias
- what to do about high bias

Reading:
- Lecture Notes #13
- Ben Taskar’s description of under- and overfitting
- MLaPP: 1.4.7
- Andrew Ng’s lecture on ML debugging

Kernel Machines (reducing Bias):
- How to kernelize an algorithm.
- Why to kernelize an algorithm.
- RBF Kernel, Polynomial Kernel, Linear Kernel
- What happens when you change the RBF kernel width.
- What is required for the kernel trick to apply
1. the weight vector must be a linear combination of the inputs
2. all inputs are only accessed through inner products
- The kernel trick allows you to perform classification indirectly (!) in very high dimensional spaces

Reading:
- Lecture Notes #14
- Lecture Notes #14b
- Ben Taskar’s Notes on SVMs
- MLAPP: 14-14.2.1, 14.4, 14.4.1, 14.5.2, 14.4.1
- Derivation of kernel Ridge regression by Max Welling

Gaussian Processes / Bayesian Global Optimization:
- Properties of Gaussian Distributions
- Gaussian Processes (Assumptions)
- GPs are kernel machines

Reading:
- Lecture Notes #15
- MLAPP: 15
- Matlab code for bayesian optimization
- Genetic Programming bike demo (the “other” GP)

k-nearest neighbors data structures
- How to construct and use a KD-Tree
- How to construct and use a Ball-tree
- What are the advantages of KD-T over B-T and vice versa?

Reading:
- Lecture Notes #16
- KD-Trees wikipedia page
- A great tutorial on KD-Trees
- Best description of Ball trees (Section 2.1)
- LMNN paper

Decision / Regression Trees:
- ID3 algorithm
- Gini Index
- Entropy splitting function (aka Information Gain)
- Regression Trees

Reading:
- Lecture Notes #17
- Decision Tree wiki page
- Ben Taskar’s old notes
- MLAPP: 16.2

Bagging:
- What is Bagging, how does it work, why does it reduce variance
- Bootstrapping
- What the weak law of large numbers has to do with bagging
- Bootstrapping leads to correctly (yet not independent) distributed samples
- Random Forests:
o what are the parameters
o what is the algorithm, what is so nice about it

Reading:
- Lecture Notes #18
- MLAPP: 16.2 - 16.2.6
- Paper on Random Forest and Gradient Boosting

Boosting:
- What is a weak learner (what is the assumption made here)
- Boosting reduces bias
- The Anyboost algorithm
- The Gradient boost algorithm
- The Adaboost algorithm
- The derivation to compute the weight of a weak learner

Reading:
- Lecture Notes #19
- Paper on Random Forest and Gradient Boosting
- MLAPP: 16.4, 16.4.3, 16.4.8 (I believe there may be a
mistake in line 6 of Algorithm 16.2 (it should be 2*(I()-1) instead of just I() )
Additional reading:
- Freund & Schapire’s tutorial on Boosting
- Very nice book chapter on adaboost and relationship with exponential loss
- Ben Taskar’s slides on Adaboost

Artificial Neural Networks / Deep Learning:
- What artificial neural networks (ANN) are
- Forward propagation
- Backward propagation
- Why you need a squashing function

Reading:
- Intro to backprop
- MLAPP 28.1-28.5
- Tricks & Tips how to make ANN perform better