- Classification (yi∈{+1,−1}) i = weak learners,
h∈H are binary, h(xi)∈{−1,+1},∀x
- Perform line-search to obtain best step size
- Loss function: Exponential loss l(H)=∑ni=1e−yiH(xi)
Finding the best weak learner
Gradient: ∂l∂H(xi)=−yie−yiH(xi)⏟definedfromwi<0∀xi
ri⏟rowweight=e−H(xi)yiwi⏟normalizedweight=e−H(xi)yiz⏟normalizationforconvenience,∀xi
z=∑ni=1e−H(xi)yi so that ∑ni=1wi=1
argminh−∑ni=1yie−H(xi)yih(xi)=argmaxh∑ni=1wiyih(xi)⏟+1ifh(xi)=yi,−10/w⏟thisisthetrainingaccuracy(uptoscaling)weightedbydistributionw1,w2...wn
So for AdaBoost, we only need a classifier that can take training data and a distribution over the training set and which returns a classifier h∈H with less than 0.5 weighted training error.
Weighted training error: ϵ=∑i:h(xi)yi=−1wi
Condition: for w1,...,wn s.t.: wi≥0 and ∑ni=1wi=1h(D,w1,...wn) is such that ϵ<0.5
Finding the stepsize α
(by line search to minimize l)
Remember: ϵ=∑i:yih(xi)=1wi
Choose: α=argminαl(H+αh)
= argminα∑ni=1eyi[H(xi)+αh(xi)]
↓ Differentiating w.r.l.α and equating with zero.
= ∑ni=1eyiH(xi)+αyih(xi)∗yih(xi)⏟∈{+1,−1}=0
−∑i:h(xi)yi=1e−(yiH(xi)+αyih(xi)⏟1)+∑i:h(xi)yi≠1e−(yiH(xi)+αyih(xi)⏟−1)=0∣:∑ni=1⏟normalizere−yiH(xi)
−∑i:h(xi)yi=1wie−α+∑i:h(xi)yi≠1wie+α=0
−(1−ϵ)e−α+ϵe+α=0
e2α=1−ϵϵ
α=12ln1−ϵϵ
AdaBoost Pseudo-code