Lectures | Machine Learning

Overview
- Administrative stuff
- Supervised learning setup:
- Feature vectors, Labels
- 0/1 loss, squared loss, absolute loss
- Train / Test split

Reading:
- 00 Introduction
- MLaPP: sec. 1.1, 1.2, 1.4.2, 1.4.3, 1.4.9

Lectures:
- #1: CS4780 Lecture #1: Introduction!
- #2: CS4780 Lecture 2: History of ML, Intro about Supervised Learning
- #3: CS4780 Lecture 3: Setup, Loss functions, Train/test split

k-nearest neighbors
- Hypothesis classes
- Nearest Neighbor Classifier
- Sketch of Covert and Hart proof
(that 1-NN converges to at most 2xBayes Error in the sample limit)
- Curse of Dimensionality

Reading:
- MLaPP: sec. 1.1, 1.2, 1.4.2, 1.4.3, 1.4.9
- Start reading: 2.1 -- 2.5
- videos of different values of k
- Video describing nearest neighbors
- Probably the best explanation of nearest neighbors

Lectures:
- #4 kNN I
- #5 kNN II (proof of convergence)
- #8 curse of dimensionality

Math background
- Discrete / Continuous probability distributions

Reading:
- Lecture Notes on Math Background (here)

Lectures:
- #6 probability review I
- #7 probability review II

Perceptron
- Linear Classifiers
- Absorbing bias into a d+1 dimensional vector
- Perceptron convergence proof

Reading:
- The Perceptron Wiki page
- MLaPP 8.5.4
- Article in the New Yorker on the Perceptron

Lectures:
- #9 Perceptron Algorithm
- #10 Perceptron convergence proof

Estimating Probabilities from data
- MLE
- MAP
- Bayesian vs. Frequentist statistics

Reading:
- Ben Taskar’s notes on MLE vs MAP
- Tom Mitchell’s book chapter on Estimating Probabilities
- Youtube videos on MLE and MAP and the Dirichlet distribution

Lectures:
- #11 MLE and MAP

Naive Bayes
- Naive Bayes Assumption
- Why is estimating probabilities difficult and high dimensions?

Reading:
- Ben Taskar’s notes on Naïve Bayes
- Tom Mitchell’s book chapter on Naive Bayes (chapters 1-3)
- Youtube videos on Naive Bayes
- Xiaojin Zhu’s notes on Multinomial Naive Bayes
- Mannings’ description of Multinomial Naive Bayes

Julia Code from MLE / MAP demo. You can try it out in Julia Box.

Lectures:
- #12 MLE and MAP Example / Naive Bayes
- #13 Naive Bayes - parameter estimation
- #14 Naive Bayes with continuous variables

Logistic Regression
- Logistic Regression formulation
- Relationship of LR with Naive Bayes

Reading:
- MLAPP 8, 8.1, 8.2, 8.3.1, 8.3.2, 8.3.4, 8.6
- Ben Taskar’s notes on Logistic Regression
- Tom Mitchell’s book chapter on Naive Bayes and Logistic Regression
- Youtube videos on Logistic Regression
- Logistic Regression Wiki

Lectures:
- #15 Logistic Regression

Gradient Descent:
- Gradient Descent (GD)
- Taylor’s Expansion
- Proof that GD decreases with every step if step size is small enough
- some tricks to set the step size
- Newton’s Method

Reading:
- MLAPP 13.3-13.3.1, 8.3.2, 8.3.6, 13.5.3

- Nice blogpost on Gradient Descent, Adagrad, Newton’s method

Lectures:
- #16: Gradient Descent

Linear Regression:
- Assumption of linear regression with Gaussian Noise
- Ordinary Least Squares (OLS) = MLE
- Ridge Regression = MAP

Reading:
- MLAPP 7-7.5.1 + 7.5.4
- OLS wiki page
- Youtube video on OLS

Lectures:
- #17: Linear Regression

Linear SVM:
- What is the margin of a hyperplane classifier
- How to derive a max margin classifier
- That SVMs are convex
- The final QP of SVMs
- Slack variables
- The unconstrained SVM formulation

Reading:
- MLAPP: 14.5 - 14.5.2.2
- Ben Taskar’s Notes on SVMs

Lectures:
- #18: Support Vector Machines

Empirical Risk Minimization:
- Setup of loss function and regularizer
- classification loss functions: hinge-loss, log-loss, zero-one loss, exponential
- regression loss functions: absolute loss, squared loss, huber loss, log-cosh
- Properties of the various loss functions
- Which ones are more susceptible to noise, which ones are loss
- Special cases: OLS, Ridge regression, Lasso, Logistic Regression

Reading:
- MLAPP 6.5-6.5.3.2

Lectures:
- #19: Linear SVM / ERM
- #20: Loss functions and regularizations
- #21: Regularization + Midterm Jeopardy
- #22: Midterm Review

Bias / Variance Tradeoff:
- What is Bias?
- What is Variance?
- What is Noise?
- How error decomposes into Bias, Variance, Noise

Reading:
- Ben Taskar’s Notes (recommended!!)
- Excellent Notes by Scott Foreman-Roe
- EoSLR Chapter 2.9
- MLAPP: 6.2.2

Lectures:
- #23: Bias-variance tradeoff
- #24: Bias-variance tradeoff (cont.), regularization, cross validation

ML Debugging, Over- / Underfitting:
- k-fold cross validation
- regularization
- How to debug ML algorithms
- How to recognize high variance scenarios
- What to do about high variance
- how to recognize high bias
- what to do about high bias

Reading:
- Ben Taskar’s description of under- and overfitting
- MLaPP: 1.4.7
- Andrew Ng’s lecture on ML debugging

Lectures:
- #25: ML Debugging, Kernels

Kernels (reducing Bias):
- How to kernelize an algorithm.
- Why to kernelize an algorithm.
- RBF Kernel, Polynomial Kernel, Linear Kernel
- What happens when you change the RBF kernel width.
- What is required for the kernel trick to apply
1. the weight vector must be a linear combination of the inputs
2. all inputs are only accessed through inner products
- The kernel trick allows you to perform classification indirectly (!) in very high dimensional spaces
Kernels (reducing Bias):
- Constructing new kernels
- Kernel SVM

Reading:
- Ben Taskar’s Notes on SVMs
- MLAPP: 14-14.2.1, 14.4, 14.4.1, 14.5.2, 14.4.1
- Derivation of kernel Ridge regression by Max Welling
- Kernel Cookbook by David Duvenaud

Lectures:
- #26: Kernels continued
- #27: Even more Kernels
- #28: Kernel regression, Kernel SVM

Gaussian Processes / Bayesian Global Optimization:
- Properties of Gaussian Distributions
- Gaussian Processes (Assumptions)
- GPs are kernel machines

Reading:
- MLAPP: 15
- Matlab code for bayesian optimization
- Genetic Programming bike demo (the “other” GP)

Lectures:
- #29: Comparison of kNN and kernalized SVM, Gaussian distribution
- #30: Gaussian Process Regression

k-nearest neighbors data structures
- How to construct and use a KD-Tree
- How to construct and use a Ball-tree
- What are the advantages of KD-T over B-T and vice versa?

Reading:
- KD-Trees wikipedia page
- A great tutorial on KD-Trees
- Best description of Ball trees (Section 2.1)
- LMNN paper

Lectures:
- #31: Bayesian optimization, KD trees
- #32: Approximate nearest neighbor search

Decision / Regression Trees:
- ID3 algorithm
- Gini Index
- Entropy splitting function (aka Information Gain)
- Regression Trees

Reading:
- Decision Tree wiki page
- Ben Taskar’s old notes
- MLAPP: 16.2

Lecturess:
- #33: Decision Trees, Gini-Index and Entropy

Bagging:
- What is Bagging, how does it work, why does it reduce variance
- Bootstrapping
- What the weak law of large numbers has to do with bagging
- Bootstrapping leads to correctly (yet not independent) distributed samples
- Random Forests:
o what are the parameters
o what is the algorithm, what is so nice about it

Reading:
- MLAPP: 16.2 - 16.2.6
- Paper on Random Forest and Gradient Boosting that contains easy to understand pseudo-code.

Lectures:
- #34: Decision tree regression, bagging and bootstrapping
- #35: Bagging, Random Forests, Boosting

Boosting:
- What is a weak learner (what is the assumption made here)
- Boosting reduces bias
- The Anyboost algorithm
- The Gradient boost algorithm
- The Adaboost algorithm
- The derivation to compute the weight of a weak learner

Reading:
- Paper on Random Forest and Gradient Boosting
- MLAPP: 16.4, 16.4.3, 16.4.8 (I believe there may be a
mistake in line 6 of Algorithm 16.2 (it should be 2*(I()-1) instead of just I() )
Additional reading:
- Freund & Schapire’s tutorial on Boosting
- Ben Taskar’s slides on Adaboost

Lectures:
- #36: Boosting (cont.)

Artificial Neural Networks / Deep Learning:
- What artificial neural networks (ANN) are
- Forward propagation
- Backward propagation
- Why you need a squashing function

Reading:
- Intro to backprop
- MLAPP 28.1-28.5
- Tricks & Tips how to make ANN perform better

- Neural Network Playground