- Build a KD-type tree with only
__pure__leaves - Descent test point and make decision based on leaf label. Exact nearest neighbor is not really needed.
Binary decision tree. Only labels are stored.

- Maximally compact
- Only has pure leaves

Yes, if and only if no two input vectors have identical features but different labels

$S=S_1\cup \dots \cup S_c$

Gini impurity of a tree: \[G^T(S)=\frac{\left | S_L \right |}{\left | S \right |}G^T(S_L)+\frac{\left | S_R \right |}{\left | S \right |}G^T(S_R)\] where:Fig: Gini Impurity Function

- $\left ( S=S_L\cup S_R \right )$
- $S_L\cap S_R=\varnothing$
- $\frac{\left | S_L \right |}{\left | S \right |}\leftarrow \textrm{fraction of inputs in left substree}$
- $\frac{\left | S_R \right |}{\left | S \right |}\leftarrow \textrm{fraction of inputs in right substree}$

Note: $KL$-Divergence is not a metric because it is not symmetric, i.e., $KL(p||q)\neq KL(q||p)$.

Let $q_1,\dots,q_c$ be the uniform label/distribution. i.e. $q_k=\frac{1}{c} \forall k$ \[KL(p||q)=\sum_{k=1}^{c}p_klog\frac{p_k}{q_k}\geq 0 \leftarrow \textrm{$KL$-Divergence}\] \[=\sum_{k}p_klog(p_k)-p_klog(q_k)\textrm{ where }q_k=\frac{1}{c}\] \[=\sum_{k}p_klog(p_k)+p_klog(c)\] \[=\sum_{k}p_klog(p_k)+log(c)\sum_{k}p_k \textrm{ where } log(c)\leftarrow\textrm{constant}, \sum_{k}p_k=1\]

\[\max_{p}KL(p||q)=\max_{p}\sum_{k}p_klog(p_k)\] \[=\min_{p}-\sum_{k}p_klog(p_k)\] \[=\min_{p}H(s) \leftarrow\textrm{Entropy}\]

Try all features and all possible splits. Pick the split that minimizes impurity $(\textrm{e.g. } s>t)$ where $f\leftarrow$feature and $t\leftarrow$threshold

Fig 4: Example XOR

- First split does not improve impurity
- Decision trees are myopic

Fig: CART

- CART are very light weight classifiers
- Very fast during testing
- Usually not competitive in accuracy but can become very strong through
__bagging__(Random Forests) and__boosting__(Gradient Boosted Trees)

Fig: ID3-trees are prone to overfitting as the tree depth increases. The left plot shows the learned decision boundary of a binary data set drawn from two Gaussian distributions. The right plot shows the testing and training errors with increasing tree depth.