Overview
- Administrative stuff
- Supervised learning setup:
- Feature vectors, Labels
- 0/1 loss, squared loss, absolute loss
- Train / Test split
- Administrative stuff
- Supervised learning setup:
- Feature vectors, Labels
- 0/1 loss, squared loss, absolute loss
- Train / Test split
Reading:
- 00 Introduction
- MLaPP: sec. 1.1, 1.2, 1.4.2, 1.4.3, 1.4.9
2017 Videolecture:
- #1: CS4780 Lecture #1: Introduction!
- 00 Introduction
- MLaPP: sec. 1.1, 1.2, 1.4.2, 1.4.3, 1.4.9
2017 Videolecture:
- #1: CS4780 Lecture #1: Introduction!
k-nearest neighbors
- Hypothesis classes
- Nearest Neighbor Classifier
- Sketch of Covert and Hart proof
(that 1-NN converges to at most 2xBayes Error in the sample limit)
- Curse of Dimensionality
- Hypothesis classes
- Nearest Neighbor Classifier
- Sketch of Covert and Hart proof
(that 1-NN converges to at most 2xBayes Error in the sample limit)
- Curse of Dimensionality
Reading:
- MLaPP: sec. 1.1, 1.2, 1.4.2, 1.4.3, 1.4.9
- Start reading: 2.1 -- 2.5
- videos of different values of k
- Video describing nearest neighbors
- Probably the best explanation of nearest neighbors
2017 Videolectures:
- #4 kNN I
- #5 kNN II (proof of convergence)
- #8 curse of dimensionality
- MLaPP: sec. 1.1, 1.2, 1.4.2, 1.4.3, 1.4.9
- Start reading: 2.1 -- 2.5
- videos of different values of k
- Video describing nearest neighbors
- Probably the best explanation of nearest neighbors
2017 Videolectures:
- #4 kNN I
- #5 kNN II (proof of convergence)
- #8 curse of dimensionality
Perceptron
- Linear Classifiers
- Absorbing bias into a d+1 dimensional vector
- Perceptron convergence proof
- Linear Classifiers
- Absorbing bias into a d+1 dimensional vector
- Perceptron convergence proof
Reading:
- The Perceptron Wiki page
- MLaPP 8.5.4
- Article in the New Yorker on the Perceptron
Lectures:
- #9 Perceptron Algorithm
- #10 Perceptron convergence proof
- The Perceptron Wiki page
- MLaPP 8.5.4
- Article in the New Yorker on the Perceptron
Lectures:
- #9 Perceptron Algorithm
- #10 Perceptron convergence proof
Reading:
- Ben Taskar’s notes on MLE vs MAP
- Tom Mitchell’s book chapter on Estimating Probabilities
- Youtube videos on MLE and MAP and the Dirichlet distribution
Lectures:
- #11 MLE and MAP
- Ben Taskar’s notes on MLE vs MAP
- Tom Mitchell’s book chapter on Estimating Probabilities
- Youtube videos on MLE and MAP and the Dirichlet distribution
Lectures:
- #11 MLE and MAP
Reading:
- Ben Taskar’s notes on Naïve Bayes
- Tom Mitchell’s book chapter on Naive Bayes (chapters 1-3)
- Youtube videos on Naive Bayes
- Xiaojin Zhu’s notes on Multinomial Naive Bayes
- Mannings’ description of Multinomial Naive Bayes
Julia Code from MLE / MAP demo. You can try it out in Julia Box.
Lectures:
- #12 MLE and MAP Example / Naive Bayes
- #13 Naive Bayes - parameter estimation
- #14 Naive Bayes with continuous variables
- Ben Taskar’s notes on Naïve Bayes
- Tom Mitchell’s book chapter on Naive Bayes (chapters 1-3)
- Youtube videos on Naive Bayes
- Xiaojin Zhu’s notes on Multinomial Naive Bayes
- Mannings’ description of Multinomial Naive Bayes
Julia Code from MLE / MAP demo. You can try it out in Julia Box.
Lectures:
- #12 MLE and MAP Example / Naive Bayes
- #13 Naive Bayes - parameter estimation
- #14 Naive Bayes with continuous variables
Reading:
- MLAPP 8, 8.1, 8.2, 8.3.1, 8.3.2, 8.3.4, 8.6
- Ben Taskar’s notes on Logistic Regression
- Tom Mitchell’s book chapter on Naive Bayes and Logistic Regression
- Youtube videos on Logistic Regression
- Logistic Regression Wiki
Lectures:
- #15 Logistic Regression
- MLAPP 8, 8.1, 8.2, 8.3.1, 8.3.2, 8.3.4, 8.6
- Ben Taskar’s notes on Logistic Regression
- Tom Mitchell’s book chapter on Naive Bayes and Logistic Regression
- Youtube videos on Logistic Regression
- Logistic Regression Wiki
Lectures:
- #15 Logistic Regression
Gradient Descent:
- Gradient Descent (GD)
- Taylor’s Expansion
- Proof that GD decreases with every step if step size is small enough
- some tricks to set the step size
- Newton’s Method
- Gradient Descent (GD)
- Taylor’s Expansion
- Proof that GD decreases with every step if step size is small enough
- some tricks to set the step size
- Newton’s Method
Reading:
- MLAPP 13.3-13.3.1, 8.3.2, 8.3.6, 13.5.3
- Nice blogpost on Gradient Descent, Adagrad, Newton’s method
Lectures:
- #16: Gradient Descent
- MLAPP 13.3-13.3.1, 8.3.2, 8.3.6, 13.5.3
- Nice blogpost on Gradient Descent, Adagrad, Newton’s method
Lectures:
- #16: Gradient Descent
Linear Regression:
- Assumption of linear regression with Gaussian Noise
- Ordinary Least Squares (OLS) = MLE
- Ridge Regression = MAP
- Assumption of linear regression with Gaussian Noise
- Ordinary Least Squares (OLS) = MLE
- Ridge Regression = MAP
Reading:
- MLAPP 7-7.5.1 + 7.5.4
- OLS wiki page
- Youtube video on OLS
Lectures:
- #17: Linear Regression
- MLAPP 7-7.5.1 + 7.5.4
- OLS wiki page
- Youtube video on OLS
Lectures:
- #17: Linear Regression
Gradient Descent:
- Gradient Descent (GD)
- Taylor’s Expansion
- Proof that GD decreases with every step if step size is small enough
- some tricks to set the step size
- Newton’s Method
- Gradient Descent (GD)
- Taylor’s Expansion
- Proof that GD decreases with every step if step size is small enough
- some tricks to set the step size
- Newton’s Method
Reading:
- MLAPP 13.3-13.3.1, 8.3.2, 8.3.6, 13.5.3
- Nice blogpost on Gradient Descent, Adagrad, Newton’s method
Lectures:
- #16: Gradient Descent
- MLAPP 13.3-13.3.1, 8.3.2, 8.3.6, 13.5.3
- Nice blogpost on Gradient Descent, Adagrad, Newton’s method
Lectures:
- #16: Gradient Descent
Linear SVM:
- What is the margin of a hyperplane classifier
- How to derive a max margin classifier
- That SVMs are convex
- The final QP of SVMs
- Slack variables
- The unconstrained SVM formulation
- What is the margin of a hyperplane classifier
- How to derive a max margin classifier
- That SVMs are convex
- The final QP of SVMs
- Slack variables
- The unconstrained SVM formulation
Empirical Risk Minimization:
- Setup of loss function and regularizer
- classification loss functions: hinge-loss, log-loss, zero-one loss, exponential
- regression loss functions: absolute loss, squared loss, huber loss, log-cosh
- Properties of the various loss functions
- Which ones are more susceptible to noise, which ones are loss
- Special cases: OLS, Ridge regression, Lasso, Logistic Regression
- Setup of loss function and regularizer
- classification loss functions: hinge-loss, log-loss, zero-one loss, exponential
- regression loss functions: absolute loss, squared loss, huber loss, log-cosh
- Properties of the various loss functions
- Which ones are more susceptible to noise, which ones are loss
- Special cases: OLS, Ridge regression, Lasso, Logistic Regression
Reading:
- MLAPP 6.5-6.5.3.2
Lectures:
- #19: Linear SVM / ERM
- #20: Loss functions and regularizations
- #21: Regularization + Midterm Jeopardy
- #22: Midterm Review
- MLAPP 6.5-6.5.3.2
Lectures:
- #19: Linear SVM / ERM
- #20: Loss functions and regularizations
- #21: Regularization + Midterm Jeopardy
- #22: Midterm Review
ML Debugging, Over- / Underfitting:	 
- k-fold cross validation
- regularization
- How to debug ML algorithms
- How to recognize high variance scenarios
- What to do about high variance
- how to recognize high bias
- what to do about high bias
- k-fold cross validation
- regularization
- How to debug ML algorithms
- How to recognize high variance scenarios
- What to do about high variance
- how to recognize high bias
- what to do about high bias
Reading:
- Ben Taskar’s description of under- and overfitting
- MLaPP: 1.4.7
- Andrew Ng’s lecture on ML debugging
Lectures:
- #25: ML Debugging, Kernels
- Ben Taskar’s description of under- and overfitting
- MLaPP: 1.4.7
- Andrew Ng’s lecture on ML debugging
Lectures:
- #25: ML Debugging, Kernels
Bias / Variance Tradeoff:
- What is Bias?
- What is Variance?
- What is Noise?
- How error decomposes into Bias, Variance, Noise
- What is Bias?
- What is Variance?
- What is Noise?
- How error decomposes into Bias, Variance, Noise
Reading:
- Ben Taskar’s Notes (recommended!!)
- Excellent Notes by Scott Foreman-Roe
- EoSLR Chapter 2.9
- MLAPP: 6.2.2
Lectures:
- #23: Bias-variance tradeoff
- #24: Bias-variance tradeoff (cont.), regularization, cross validation
- Ben Taskar’s Notes (recommended!!)
- Excellent Notes by Scott Foreman-Roe
- EoSLR Chapter 2.9
- MLAPP: 6.2.2
Lectures:
- #23: Bias-variance tradeoff
- #24: Bias-variance tradeoff (cont.), regularization, cross validation
Kernels (reducing Bias):	 
- How to kernelize an algorithm.
- Why to kernelize an algorithm.
- RBF Kernel, Polynomial Kernel, Linear Kernel
- What happens when you change the RBF kernel width.
- What is required for the kernel trick to apply
1. the weight vector must be a linear combination of the inputs
2. all inputs are only accessed through inner products
- The kernel trick allows you to perform classification indirectly (!) in very high dimensional spaces
Kernels (Lecture Continued):
- Constructing new kernels
- Kernel SVM
- How to kernelize an algorithm.
- Why to kernelize an algorithm.
- RBF Kernel, Polynomial Kernel, Linear Kernel
- What happens when you change the RBF kernel width.
- What is required for the kernel trick to apply
1. the weight vector must be a linear combination of the inputs
2. all inputs are only accessed through inner products
- The kernel trick allows you to perform classification indirectly (!) in very high dimensional spaces
Kernels (Lecture Continued):
- Constructing new kernels
- Kernel SVM
Reading:
- Ben Taskar’s Notes on SVMs
- MLAPP: 14-14.2.1, 14.4, 14.4.1, 14.5.2, 14.4.1
- Derivation of kernel Ridge regression by Max Welling
- Kernel Cookbook by David Duvenaud
- Laurent El Ghaoui’s lectures on duality
Lectures:
- #26: Kernels continued
- #27: Even more Kernels
- #28: Kernel regression, Kernel SVM
- Ben Taskar’s Notes on SVMs
- MLAPP: 14-14.2.1, 14.4, 14.4.1, 14.5.2, 14.4.1
- Derivation of kernel Ridge regression by Max Welling
- Kernel Cookbook by David Duvenaud
- Laurent El Ghaoui’s lectures on duality
Lectures:
- #26: Kernels continued
- #27: Even more Kernels
- #28: Kernel regression, Kernel SVM
Gaussian Processes / Bayesian Global Optimization:
- Properties of Gaussian Distributions
- Gaussian Processes (Assumptions)
- GPs are kernel machines
- Properties of Gaussian Distributions
- Gaussian Processes (Assumptions)
- GPs are kernel machines
Reading:
- MLAPP: 15
- Matlab code for bayesian optimization
- Genetic Programming bike demo (the “other” GP)
Lectures:
- #29: Comparison of kNN and kernalized SVM, Gaussian distribution
- #30: Gaussian Process Regression
- MLAPP: 15
- Matlab code for bayesian optimization
- Genetic Programming bike demo (the “other” GP)
Lectures:
- #29: Comparison of kNN and kernalized SVM, Gaussian distribution
- #30: Gaussian Process Regression
k-nearest neighbors data structures (not covered in SP2018)
- How to construct and use a KD-Tree
- How to construct and use a Ball-tree
- What are the advantages of KD-T over B-T and vice versa?
- How to construct and use a KD-Tree
- How to construct and use a Ball-tree
- What are the advantages of KD-T over B-T and vice versa?
Reading:
- KD-Trees wikipedia page
- A great tutorial on KD-Trees
- Best description of Ball trees (Section 2.1)
- LMNN paper
Lectures:
- #31: Bayesian optimization, KD trees
- #32: Approximate nearest neighbor search
- KD-Trees wikipedia page
- A great tutorial on KD-Trees
- Best description of Ball trees (Section 2.1)
- LMNN paper
Lectures:
- #31: Bayesian optimization, KD trees
- #32: Approximate nearest neighbor search
Decision / Regression Trees:
- ID3 algorithm
- Gini Index
- Entropy splitting function (aka Information Gain)
- Regression Trees
- ID3 algorithm
- Gini Index
- Entropy splitting function (aka Information Gain)
- Regression Trees
Reading:
- Decision Tree wiki page
- Ben Taskar’s old notes
- MLAPP: 16.2
Lecturess:
- #33: Decision Trees, Gini-Index and Entropy
- Decision Tree wiki page
- Ben Taskar’s old notes
- MLAPP: 16.2
Lecturess:
- #33: Decision Trees, Gini-Index and Entropy
Bagging:
- What is Bagging, how does it work, why does it reduce variance
- Bootstrapping
- What the weak law of large numbers has to do with bagging
- Bootstrapping leads to correctly (yet not independent) distributed samples
- Random Forests:
o what are the parameters
o what is the algorithm, what is so nice about it
- What is Bagging, how does it work, why does it reduce variance
- Bootstrapping
- What the weak law of large numbers has to do with bagging
- Bootstrapping leads to correctly (yet not independent) distributed samples
- Random Forests:
o what are the parameters
o what is the algorithm, what is so nice about it
Reading:
- MLAPP: 16.2 - 16.2.6
- Paper on Random Forest and Gradient Boosting that contains easy to understand pseudo-code.
Lectures:
- #34: Decision tree regression, bagging and bootstrapping
- #35: Bagging, Random Forests, Boosting
- MLAPP: 16.2 - 16.2.6
- Paper on Random Forest and Gradient Boosting that contains easy to understand pseudo-code.
Lectures:
- #34: Decision tree regression, bagging and bootstrapping
- #35: Bagging, Random Forests, Boosting
Boosting:
- What is a weak learner (what is the assumption made here)
- Boosting reduces bias
- The Anyboost algorithm
- The Gradient boost algorithm
- The Adaboost algorithm
- The derivation to compute the weight of a weak learner
	
- What is a weak learner (what is the assumption made here)
- Boosting reduces bias
- The Anyboost algorithm
- The Gradient boost algorithm
- The Adaboost algorithm
- The derivation to compute the weight of a weak learner
Reading:
- Paper on Random Forest and Gradient Boosting
- MLAPP: 16.4, 16.4.3, 16.4.8 (I believe there may be a
mistake in line 6 of Algorithm 16.2 (it should be 2*(I()-1) instead of just I() )
Additional reading:
- Freund & Schapire’s tutorial on Boosting
- Ben Taskar’s slides on Adaboost
Lectures:
- #36: Boosting (cont.)
- Paper on Random Forest and Gradient Boosting
- MLAPP: 16.4, 16.4.3, 16.4.8 (I believe there may be a
mistake in line 6 of Algorithm 16.2 (it should be 2*(I()-1) instead of just I() )
Additional reading:
- Freund & Schapire’s tutorial on Boosting
- Ben Taskar’s slides on Adaboost
Lectures:
- #36: Boosting (cont.)
Artificial Neural Networks / Deep Learning:
- What artificial neural networks (ANN) are
- Forward propagation
- Backward propagation
- Why you need a squashing function
- What artificial neural networks (ANN) are
- Forward propagation
- Backward propagation
- Why you need a squashing function
Reading:
- Intro to backprop
- MLAPP 28.1-28.5
- Tricks & Tips how to make ANN perform better
- Neural Network Playground
- Intro to backprop
- MLAPP 28.1-28.5
- Tricks & Tips how to make ANN perform better
- Neural Network Playground