__Overview__**- Administrative stuff**

- Supervised learning setup:

- Feature vectors, Labels

- 0/1 loss, squared loss, absolute loss

- Train / Test split

__k-nearest neighbors__**- Hypothesis classes**

- Nearest Neighbor Classifier

- Sketch of Covert and Hart proof

(that 1-NN converges to at most 2xBayes Error in the sample limit)

- Curse of Dimensionality

__Reading:__- Lectures Notes

- MLaPP: sec. 1.1, 1.2, 1.4.2, 1.4.3, 1.4.9

- Start reading: 2.1 -- 2.5

- videos of different values of k

- Video describing nearest neighbors

- Probably the best explanation of nearest neighbors

__Perceptron__**- Linear Classifiers**

- Absorbing bias into a d+1 dimensional vector

- Perceptron convergence proof

__Reading:__- Lectures Notes

- The Perceptron Wiki page

- MLaPP 8.5.4

- Article in the New Yorker on the Perceptron

__Estimating Probabilities from data__**- MLE**

- MAP

- Bayesian vs. Frequentist statistics

__Reading:__- Lectures Notes

- Ben Taskar’s notes on MLE vs MAP

**- Tom Mitchell’s book chapter on**

**Estimating Probabilities**

- Youtube videos on

- Youtube videos on

**MLE**

**and**

**MAP**

**and the Dirichlet distribution**

**- Naive Bayes Assumption**

__Naive Bayes__

- Why is estimating probabilities difficult and high dimensions?

__Reading:__- Lectures Notes

- Ben Taskar’s notes on Naïve Bayes

**- Tom Mitchell’s book chapter on**

**Naive Bayes**

**(chapters 1-3)**

- Youtube videos on

- Youtube videos on

**Naive Bayes**

- Xiaojin Zhu’s notes on

- Xiaojin Zhu’s notes on

**Multinomial Naive Bayes**

- Mannings’ description of

- Mannings’ description of

**Multinomial Naive Bayes**

**- Logistic Regression formulation**

__Logistic Regression__

- Relationship of LR with Naive Bayes

__Reading:__- Lectures Notes

- MLAPP 8, 8.1, 8.2, 8.3.1, 8.3.2, 8.3.4, 8.6

- Ben Taskar’s notes on Logistic Regression

- Tom Mitchell’s book chapter on Naive Bayes and Logistic Regression

- Youtube videos on Logistic Regression

- Logistic Regression Wiki

**- Gradient Descent (GD)**

__Gradient Descent:__

- Taylor’s Expansion

- Proof that GD decreases with every step if step size is small enough

- some tricks to set the step size

- Newton’s Method

**- Assumption of linear regression with Gaussian Noise**

__Linear Regression:__

- Ordinary Least Squares (OLS) = MLE

- Ridge Regression = MAP

**- What is the margin of a hyperplane classifier**

__Linear SVM:__

- How to derive a max margin classifier

- That SVMs are convex

- The final QP of SVMs

- Slack variables

- The unconstrained SVM formulation

**- Setup of loss function and regularizer**

__Empirical Risk Minimization:__

- classification loss functions: hinge-loss, log-loss, zero-one loss, exponential

- regression loss functions: absolute loss, squared loss, huber loss, log-cosh

- Properties of the various loss functions

- Which ones are more susceptible to noise, which ones are loss

- Special cases: OLS, Ridge regression, Lasso, Logistic Regression

**- Elastic Net and Lasso can be reduced to SVM (squared hinge-loss, no bias)**

__Reduction of Elastic Net to SVM__

- This reduction is exact

- Practical advantage: SVM solvers to be used to solve the Elastic Net

**- Lecture Notes**

__Reading:__

- MLAPP 13.3 - 13.4

- Original Publication by Zhou et al. (AAAI 2015)

- Prior publication by Jaggi 2014 on Lasso and SVM

**- What is Bias?**

__Bias / Variance Tradeoff:__

- What is Variance?

- What is Noise?

- How error decomposes into Bias, Variance, Noise

**- Lecture Notes #12**

__Reading:__

- Ben Taskar’s

- Ben Taskar’s

**Notes**

**(recommended!!)**

- Excellent

- Excellent

**Notes**

**by Scott Foreman-Roe**

- EoSLR Chapter 2.9

- MLAPP: 6.2.2

__ML Debugging, Over- / Underfitting:__- k-fold cross validation

- regularization

- How to debug ML algorithms

- How to recognize high variance scenarios

- What to do about high variance

- how to recognize high bias

- what to do about high bias

**Lecture Notes #13**

__Reading:__

--

**- Ben Taskar’s description of under- and overfitting**

- MLaPP: 1.4.7

- Andrew Ng’s lecture on ML debugging

**- How to kernelize an algorithm.**

__Kernel Machines (reducing Bias):__

- Why to kernelize an algorithm.

- RBF Kernel, Polynomial Kernel, Linear Kernel

- What happens when you change the RBF kernel width.

- What is required for the kernel trick to apply

1. the weight vector must be a linear combination of the inputs

2. all inputs are only accessed through inner products

- The kernel trick allows you to perform classification indirectly (!) in very high dimensional spaces

__Reading:__

**-**Lecture Notes #14

**-**Lecture Notes #14b

**- Ben Taskar’s Notes on SVMs**

- MLAPP: 14-14.2.1, 14.4, 14.4.1, 14.5.2, 14.4.1

- Derivation of kernel Ridge regression by Max Welling

**- Properties of Gaussian Distributions**

__Gaussian Processes / Bayesian Global Optimization:__

- Gaussian Processes (Assumptions)

- GPs are kernel machines

__Reading:__**-**Lecture Notes #15

- MLAPP: 15

- Matlab code for bayesian optimization

- Genetic Programming bike demo (the “other” GP)

__k-nearest neighbors data structures__**- How to construct and use a KD-Tree**

- How to construct and use a Ball-tree

- What are the advantages of KD-T over B-T and vice versa?

__Reading:__

**-**Lecture Notes #16

- KD-Trees wikipedia page

- A great tutorial on KD-Trees

- Best description of Ball trees (Section 2.1)

- LMNN paper

**- ID3 algorithm**

__Decision / Regression Trees:__

- Gini Index

- Entropy splitting function (aka Information Gain)

- Regression Trees

**- What is Bagging, how does it work, why does it reduce variance**

__Bagging:__

- Bootstrapping

- What the weak law of large numbers has to do with bagging

- Bootstrapping leads to correctly (yet not independent) distributed samples

- Random Forests:

o what are the parameters

o what is the algorithm, what is so nice about it

**- What is a weak learner (what is the assumption made here)**

__Boosting:__

- Boosting reduces bias

- The Anyboost algorithm

- The Gradient boost algorithm

- The Adaboost algorithm

- The derivation to compute the weight of a weak learner

**- Lecture Notes #19**

__Reading:__

- Paper on Random Forest and Gradient Boosting

- MLAPP: 16.4, 16.4.3, 16.4.8 (I believe there may be a

mistake in line 6 of Algorithm 16.2 (it should be 2*(I()-1) instead of just I() )

Additional reading:

- Freund & Schapire’s tutorial on Boosting

- Very nice book chapter on adaboost and relationship with exponential loss

- Ben Taskar’s slides on Adaboost

**- What artificial neural networks (ANN) are**

__Artificial Neural Networks / Deep Learning:__

- Forward propagation

- Backward propagation

- Why you need a squashing function