In [1]:
using PyPlot

Let's look at using stochastic gradient descent with various methods to optimize logistic regression. First, we'll generate a training set at random from the generative model associated with logistic regression.

In [2]:
# generate the data
srand(424242)
d = 20;
N = 10000;
wtrue = randn(d);
wtrue = d^2 * wtrue / norm(wtrue);
X = randn(N, d);
X ./= sqrt(sum(X.^2, 2));
Y = (1 ./ (1 + exp(-X * wtrue)) .>= rand(N)) * 2 - 1;
gamma = 0.001;

Let's do logistic regression with regularization here. Our objective samples will be of the form

$$f_i(w) = -\log \left( 1 + \exp(-y_i x_i^T w) \right) + \frac{\gamma}{2} \| w \|^2$$

and the SGD updates will look like

$$w_{t+1} = w_t + \alpha_t \left( \frac{y_i x_i}{1 + \exp(y_i x_i^T w_t)} - \gamma w_t \right).$$

Let's look at the constants of strong convexity and Lipschitz continuity for this problem, to get a handle on the theory/optimal parameters. If we differentiate the objective twice, we get

$$\nabla^2 f_i(w) = x_i x_i^T \frac{1}{(1 + \exp(y_i x_i^T w_t)) (1 + \exp(-y_i x_i^T w_t))} + \gamma I.$$

It's pretty easy to see that

$$0 < \frac{1}{(1 + \exp(u)) (1 + \exp(-u))} \le \frac{1}{4},$$

and so since we initialized such that $\| x_i \|^2 = 1$, from the way we generated the examples, we can approximate

$$\gamma I \preceq \nabla^2 f_i(w) \preceq \left(\gamma + \frac{1}{4} \right) I.$$

So we can set $\mu = \gamma$ and $L = \gamma + \frac{1}{4}$. What about bounding the variance of the gradient samples? (Again here I'm using the nonstandard definition of variance for vectors: $\mathbf{Var}(X) = \mathbf{E}[\| X \|^2] - \| \mathbf{E}[ X ] \|^2$.) Well,

\begin{align*} \mathbf{Var}(\nabla f_i(w)) &= \mathbf{Var}\left( \frac{y_i x_i}{1 + \exp(y_i x_i^T w)} - \gamma w \right) \\ &= \mathbf{Var}\left( \frac{y_i x_i}{1 + \exp(y_i x_i^T w)} \right) \\ &\le \mathbf{E}\left[ \left\| \frac{y_i x_i}{1 + \exp(y_i x_i^T w)} \right\|^2 \right] \\ &\le \mathbf{E}\left[ \left\| x_i \right\|^2 \right] \\ &\le 1 \end{align*}

where this last line happens because we sampled $x_i$ uniformly from the unit ball. So we can set $\sigma = 1$.

In [3]:
mu = gamma;
L = gamma + 0.25;
sigma = 1;

What is the optimal step size for SGD under these conditions? Well, from Lecture 2, we had $$\alpha_t = \frac{2 \mu \| w_0 - w^* \|^2}{4 M + \mu^2 \| w_0 - w^* \|^2 t}$$ or $$\alpha_t = \frac{\alpha_0}{1 + \gamma t}$$ where $$\alpha_0 = \frac{2 \mu \| w_0 - w^* \|^2}{4 M}$$ and $$\gamma = \frac{\mu^2 \| w_0 - w^* \|^2}{4 M}.$$

In [4]:
w0 = randn(d);
In [5]:
function sgd_logreg(w0, alpha0, gamma, X, Y, gamma, niters, wopt)
    w = w0
    (N, d) = size(X)
    dist_to_optimum = zeros(niters)
    for k = 1:niters
        alpha = alpha0 / (1 + gamma * (k-1));
        i = rand(1:N)
        xi = X[i,:];
        yi = Y[i];
        w = (1 - alpha * gamma) * w + alpha * xi * yi / (1 + exp(yi * dot(xi, w)));
        dist_to_optimum[k] = norm(w - wopt);
    end
    return (w, dist_to_optimum);
end
Out[5]:
sgd_logreg (generic function with 1 method)
In [6]:
# find the true minimum
function newton_logreg(w0, X, Y, gamma, niters)
    N = size(X, 1);
    d = size(X, 2);
    w = w0;
    for k = 1:niters
        g = -X' * (Y ./ (1 + exp(Y .* (X * w)))) + N * gamma * w;
        H = X' * ((1 ./ ((1 + exp(Y .* (X * w))) .* (1 + exp(-Y .* (X * w))))) .* X) + N * gamma * eye(d);
        w = w - H \ g;
        println("gradient norm: $(norm(g))")
    end
    return w
end
Out[6]:
newton_logreg (generic function with 1 method)
In [7]:
wopt = newton_logreg(wtrue, X, Y, gamma, 10);
gradient norm: 3999.9774577369785
gradient norm: 896.3972587628689
gradient norm: 233.89715512117843
gradient norm: 59.370770718984
gradient norm: 5.959623034633453
gradient norm: 0.06910877657919792
gradient norm: 9.433183413409561e-6
gradient norm: 2.777799306773163e-13
gradient norm: 6.563211845551944e-14
gradient norm: 4.618954771558417e-14
In [8]:
alpha0 = 2 * mu * norm(w0 - wopt)^2 / (4 * M);
gamma = mu^2 * norm(w0 - wopt)^2 / (4 * M);
In [9]:
srand(123456);
(w, dto) = sgd_logreg(w0, alpha0, gamma, X, Y, gamma, 50000, wopt);
In [10]:
plot(dto)
xlabel("iteration");
ylabel("distance to optimum");
In [11]:
srand(123456);
(w2, dto2) = sgd_logreg(w0, 2*alpha0, 4*gamma, X, Y, gamma, 50000, wopt);
srand(123456);
(w2, dto3) = sgd_logreg(w0, 3*alpha0, 9*gamma, X, Y, gamma, 50000, wopt);
In [12]:
semilogy(dto; label = "optimal step size")
semilogy(dto2; label = "2x optimal")
semilogy(dto3; label = "3x optimal")
xlabel("iteration");
ylabel("distance to optimum");
legend();

What is the best assignment of the step size after 20000 iterations?

In [13]:
etas = exp(collect(-1:0.05:3));
dists = [ sgd_logreg(w0, eta*alpha0, eta^2*gamma, X, Y, gamma, 20000, wopt)[2][end] for eta in etas ];
In [14]:
loglog(etas, dists);
scatter(etas[21], dists[21]);
imin = indmin(dists);
scatter(etas[imin], dists[imin]; color="red");
xlabel("step size scaled by");
ylabel("distance to optimum after 20000 steps");

Takeaway: the theory gave us something good, but not the best.

In [ ]: