In [1]:
using PyPlot
using Statistics
using LinearAlgebra

â”Œ Warning: PyPlot is using tkagg backend, which is known to cause crashes on MacOS (#410); use the MPLBACKEND environment variable to request a different backend.
â”” @ PyPlot /Users/cdesa/.julia/packages/PyPlot/4wzW1/src/init.jl:192


### Demonstration of the usefulness of subsampling.¶

Let's consider a setting in which we are using 0-1 loss for our empirical risk, and imagine that our error rate is $p = 0.3$ over the whole dataset of $n = 1000000$ examples. Without loss of generality, suppose that the first $30\%$ of the examples are errors and the remainder are not. We can construct the losses of these examples as follows.

In [2]:
n = 1000000;
p = 0.3;
L = vcat(ones(Int64(n * p)), zeros(n - Int64(n * p)));


Next, let's sample some variables $Z_k$ which are samples from the empirical risk.

In [3]:
Kmax = 100000;
Z = [rand(L) for k = 1:Kmax];


Next, we compute the partial averages $$S_K = \frac{1}{K} \sum_{k=1}^K Z_k.$$

In [4]:
S = cumsum(Z) ./ (1:Kmax);


Now we can plot this average and see how it changes as we increase $K$.

In [5]:
Kplot = 100;
plot(collect(1:Kplot), S[1:Kplot]; label="average");
plot(collect(1:Kplot), p * ones(Kplot), "k:"; label="true empirical risk");
legend();
show();

# what's the error at the end?
println("true  empirical  risk: $p"); println("approx empirical risk:$(S[Kplot])");
println("error                : $(abs(S[Kplot]-p))");  true empirical risk: 0.3 approx empirical risk: 0.27 error : 0.02999999999999997  In [6]: Kplot = 1000; plot(collect(1:Kplot), S[1:Kplot]; label="average"); plot(collect(1:Kplot), p * ones(Kplot), "k:"; label="true empirical risk"); legend(); show(); # what's the error at the end? println("true empirical risk:$p");
println("approx empirical risk: $(S[Kplot])"); println("error :$(abs(S[Kplot]-p))");

true  empirical  risk: 0.3
approx empirical risk: 0.282
error                : 0.018000000000000016

In [7]:
Kplot = Kmax;
plot(collect(1:Kplot), S[1:Kplot]; label="average");
plot(collect(1:Kplot), p * ones(Kplot), "k:"; label="true empirical risk");
legend();
show();

# what's the error at the end?
println("true  empirical  risk: $p"); println("approx empirical risk:$(S[Kplot])");
println("error                : $(abs(S[Kplot]-p))");  true empirical risk: 0.3 approx empirical risk: 0.30075 error : 0.0007500000000000284  ## How long does it take to do subsampling for empirical risk?¶ Let's choose some$x$and$y\$ completely at random, and evaluate the hypothesis $$h_w(x) = \operatorname{sign}(x^T w).$$

In [8]:
n = 1000000;
d = 256;
Xs = [randn(d) for i=1:n];
w = randn(d);
Ys = [sign(rand([-1.0,0.9,1.0,1.1])*sign(dot(Xs[i],w))) for i=1:n];

# error should be about 25%

In [9]:
function total_error(Xs::Array{Array{Float64,1},1}, Ys::Array{Float64,1}, w::Array{Float64,1})
n = length(Ys);
return mean(sign(dot(Xs[i],w)) != Ys[i] for i = 1:n);
end

Out[9]:
total_error (generic function with 1 method)
In [10]:
# warm up the compiler
total_error(Xs, Ys, w);
total_error(Xs, Ys, w);

@time total_error(Xs, Ys, w)

  0.166619 seconds (7 allocations: 240 bytes)

Out[10]:
0.250302
In [11]:
function estimate_error(Xs::Array{Array{Float64,1},1}, Ys::Array{Float64,1}, w::Array{Float64,1}, K::Int64)
n = length(Ys);
return mean(sign(dot(Xs[i],w)) != Ys[i] for i = rand(1:n, K));
end

Out[11]:
estimate_error (generic function with 1 method)
In [22]:
# warm up the compiler
estimate_error(Xs, Ys, w, 1000);
estimate_error(Xs, Ys, w, 1000);

@time estimate_error(Xs, Ys, w, 1000)

  0.004707 seconds (8 allocations: 8.172 KiB)

Out[22]:
0.261
In [23]:
@time estimate_error(Xs, Ys, w, 10000)

  0.047278 seconds (9 allocations: 78.438 KiB)

Out[23]:
0.2441
In [24]:
@time estimate_error(Xs, Ys, w, 100000)

  0.623598 seconds (9 allocations: 781.563 KiB)

Out[24]:
0.24941
In [25]:
@time estimate_error(Xs, Ys, w, 1000000)

  2.245438 seconds (9 allocations: 7.630 MiB)

Out[25]:
0.249732
In [27]:
50*log(200)

Out[27]:
264.9158683274018
In [ ]: