Lecture 18: Memory and Memory Locality

CS4787 — Principles of Large-Scale Machine Learning Systems

$\newcommand{\R}{\mathbb{R}}$

  • Over the past two lectures, we've been talking about parallel computing in machine learning, which allows us to take advantage of the parallel capabilities of our hardware to substantially speed up training and inference.

  • This is an instance of the general principle: Use algorithms that fit your hardware, and use hardware that fits your algorithms.

  • But compute is only half the story of making algorithms that fit the hardware.

  • How data is stored and accessed can be just as important as how it is processed.

  • This is especially the case for machine learning tasks, which often run on very large datasets that can push the limits of the memory subsystem of the hardware.

Today, we'll be talking about how memory affects the performance of the machine learning pipeline.

How do modern CPUs handle memory?

CPUs have a deep cache hierarchy. In fact, many CPUs are mostly cache by area.

The motivation for this was the ever-increasing gap between the speed at which the arithmetic units on the CPU could execute instructions and the time it took to read/write data to system memory.

Without some faster cache to temporarily store data, the performance of the CPU would be bottlenecked by the cost of reading and/or writing to RAM after every instruction.

A simplified view of memory on a CPU

This is what the "shared memory" programming model sees.

But CPUs also have caches

Caches are small and fast memories that are located physically on the CPU chip, and which mirror data stored in RAM so that it can be accessed more quickly by the CPU.

The usual setup of memory on a CPU

  • a fast L1 cache (typically about 32KB) on each core

  • a somewhat slower, but larger L2 cache (e.g. 256 KB) on each core

  • an even slower and even larger L3 cache (e.g. 2 MB/core) shared among cores

  • DRAM — off-chip memory

  • Persistent storage — a hard disk or flash drive

A model of a multi-socket computer

Multiple CPU chips on the same motherboard communicate with each other through physical connections on the motherboard.

The full view across multiple machines

Multiple CPU chips on the same motherboard communicate with each other through physical connections on the motherboard.

One important thing to notice here:

As we zoom out, much more of this diagram is "memory" boxes than compute boxes.

Hand-wavy consequence: as we scale up, the effect of memory becomes more and more important.

Another important take-away:

Memory has a hierarchical structure

  • Memories lower in the hierarchy are faster, but smaller

  • Memories higher in the hierarchy are larger, but slower, and are often shared among many compute units

Two ways to measure performance of a part of the memory hierarchy.

  • Latency: how much time does it take to access data at a new address in memory?

  • Throughput (a.k.a. bandwidth): how much data total can we access in a given length of time?

We saw these metrics earlier when evaluating the effect of parallelism.

Ideally, we'd like all of our memory accesses to go to the fast L1 cache, since it has high throughput and low latency.

What prevents this from happening in a practical program?

Result: the hardware needs to decide what is stored in the cache at any given time.

It wants to avoid, as much as possible, a situation in which the processor needs to access data that's not stored in the cache—this is called a cache miss.

Hardware uses two heuristics:

  • The principle of temporal locality: if a location in memory is accessed, it is likely that that location will be accessed again in the near future.

  • The principle of spatial locality: if a location in memory is accessed, it is likely that other nearby locations will be accessed in the near future.

Memory Locality

Temporal locality and spatial locality are both types of memory locality.

  • We say that a program has good spatial locality and/or temporal locality and/or memory locality when it conforms to these heuristics.

  • When a program has good memory locality, it makes good use of the caches available on the hardware.

  • In practice, the throughput of a program is often substantially affected by the cache, and can be improved by increasing locality.

Prefetching

A third important heuristic used by both the hardware and the compiler to improve cache performance is prefetching.

Prefetching loads data into the cache before it is ever accessed, and is particularly useful when the program or the hardware can predict what memory will be used ahead of time.

Question: What can we do in the ML pipeline to increase locality and/or enable prefetching?

  • Access training exmaples in the order they appear in memory

    • Prefetch the training examples (prefetch the ones we're going to be using next)
  • Efficient matrix multiplications

DEMO

A matrix multiply of $A \in \mathbb{R}^{m \times n}$ and $B \in \mathbb{R}^{n \times p}$, producing output $C \in \mathbb{R}^{m \times p}$, can be written by running

$$ C_{i,k} \mathrel{+}= A_{i,j} \cdot B_{j,k} $$

for each value of $i \in \{1, \ldots, m\}$, $j \in \{1, \ldots, n\}$, and $k \in \{1, \ldots, p\}$. The natural way to do this is with three for loops. But what order should we run these loops? And how does the way we store $A$, $B$, and $C$ affect performance?

In [1]:
using Libdl # open a dynamic library that links to the C code
tm_lib = Libdl.dlopen("demo/test_memory.lib");
In [2]:
function test_mmpy(loop_order::String, Amaj::String, Bmaj::String, Cmaj::String, m, n, p, num_runs)
    @assert(loop_order in ["ijk","ikj","jki","jik","kij","kji"])
    @assert(Amaj in ["r","c"])
    @assert(Bmaj in ["r","c"])
    @assert(Cmaj in ["r","c"])
    f = Libdl.dlsym(tm_lib, "test_$(loop_order)_A$(Amaj)B$(Bmaj)C$(Cmaj)")
    ccall(f, Float64, (Int32, Int32, Int32, Int32), m, n, p, num_runs)
end
Out[2]:
test_mmpy (generic function with 1 method)
In [6]:
measurements = []
for loop_order in ["ijk","ikj","jki","jik","kij","kji"]
    for Am in ["r","c"]
        for Bm in ["r","c"]
            for Cm in ["r","c"]
                push!(measurements, (loop_order * "_" * Am * Bm * Cm, test_mmpy(loop_order, Am, Bm, Cm, 2048, 2048, 2048, 10)));
            end
        end
    end
end
time elapsed: 6.007000 seconds
time elapsed: 7.026000 seconds
time elapsed: 7.165000 seconds
time elapsed: 7.245000 seconds
time elapsed: 6.463000 seconds
time elapsed: 6.854000 seconds
time elapsed: 7.078000 seconds
time elapsed: 7.001000 seconds
time elapsed: 7.151000 seconds
time elapsed: 7.034000 seconds

average time: 6.902400 seconds

digest: 9.913091e+28
time elapsed: 57.874000 seconds
time elapsed: 58.401000 seconds
time elapsed: 58.862000 seconds
time elapsed: 58.836000 seconds
time elapsed: 58.331000 seconds
time elapsed: 58.038000 seconds
time elapsed: 57.944000 seconds
time elapsed: 58.647000 seconds
time elapsed: 58.112000 seconds
time elapsed: 58.226000 seconds

average time: 58.327100 seconds

digest: 9.889404e+28
time elapsed: 52.433000 seconds
time elapsed: 52.296000 seconds
time elapsed: 52.559000 seconds
time elapsed: 52.793000 seconds
time elapsed: 52.616000 seconds
time elapsed: 52.493000 seconds
time elapsed: 52.534000 seconds
time elapsed: 52.436000 seconds
time elapsed: 52.502000 seconds
time elapsed: 52.234000 seconds

average time: 52.489600 seconds

digest: 9.929104e+28
time elapsed: 123.728000 seconds
time elapsed: 148.349000 seconds
time elapsed: 148.852000 seconds
time elapsed: 149.856000 seconds
time elapsed: 137.153000 seconds
time elapsed: 144.027000 seconds
time elapsed: 145.449000 seconds
time elapsed: 141.324000 seconds
time elapsed: 139.429000 seconds
time elapsed: 143.682000 seconds

average time: 142.184900 seconds

digest: 9.913388e+28
time elapsed: 4.034000 seconds
time elapsed: 4.049000 seconds
time elapsed: 4.062000 seconds
time elapsed: 4.063000 seconds
time elapsed: 7.013000 seconds
time elapsed: 7.461000 seconds
time elapsed: 7.373000 seconds
time elapsed: 7.496000 seconds
time elapsed: 7.115000 seconds
time elapsed: 7.431000 seconds

average time: 6.009700 seconds

digest: 9.889719e+28
time elapsed: 57.338000 seconds
time elapsed: 58.313000 seconds
time elapsed: 57.769000 seconds
time elapsed: 57.690000 seconds
time elapsed: 57.526000 seconds
time elapsed: 57.613000 seconds
time elapsed: 57.506000 seconds
time elapsed: 56.667000 seconds
time elapsed: 57.112000 seconds
time elapsed: 58.553000 seconds

average time: 57.608700 seconds

digest: 9.890267e+28
time elapsed: 52.584000 seconds
time elapsed: 52.313000 seconds
time elapsed: 52.301000 seconds
time elapsed: 52.648000 seconds
time elapsed: 52.670000 seconds
time elapsed: 52.676000 seconds
time elapsed: 51.665000 seconds
time elapsed: 54.121000 seconds
time elapsed: 52.652000 seconds
time elapsed: 52.749000 seconds

average time: 52.637900 seconds

digest: 9.908002e+28
time elapsed: 136.274000 seconds
time elapsed: 124.671000 seconds
time elapsed: 125.077000 seconds
time elapsed: 125.358000 seconds
time elapsed: 130.296000 seconds
time elapsed: 110.085000 seconds
time elapsed: 109.068000 seconds
time elapsed: 108.855000 seconds
time elapsed: 118.505000 seconds
time elapsed: 139.920000 seconds

average time: 122.810900 seconds

digest: 9.874835e+28
time elapsed: 64.635000 seconds
time elapsed: 58.400000 seconds
time elapsed: 57.860000 seconds
time elapsed: 57.527000 seconds
time elapsed: 57.453000 seconds
time elapsed: 56.038000 seconds
time elapsed: 57.184000 seconds
time elapsed: 59.682000 seconds
time elapsed: 54.481000 seconds
time elapsed: 54.470000 seconds

average time: 57.773000 seconds

digest: 9.899344e+28
time elapsed: 55.857000 seconds
time elapsed: 54.812000 seconds
time elapsed: 56.501000 seconds
time elapsed: 57.090000 seconds
time elapsed: 53.701000 seconds
time elapsed: 55.528000 seconds
time elapsed: 55.967000 seconds
time elapsed: 55.966000 seconds
time elapsed: 802.586000 seconds
time elapsed: 54.014000 seconds

average time: 130.202200 seconds

digest: 9.904967e+28
time elapsed: 8.863000 seconds
time elapsed: 9.780000 seconds
time elapsed: 10.773000 seconds
time elapsed: 10.762000 seconds
time elapsed: 10.765000 seconds
time elapsed: 10.760000 seconds
time elapsed: 10.771000 seconds
time elapsed: 10.761000 seconds
time elapsed: 10.774000 seconds
time elapsed: 10.770000 seconds

average time: 10.477900 seconds

digest: 9.902124e+28
time elapsed: 13.371000 seconds
time elapsed: 13.385000 seconds
time elapsed: 13.362000 seconds
time elapsed: 13.449000 seconds
time elapsed: 13.447000 seconds
time elapsed: 13.469000 seconds
time elapsed: 13.460000 seconds
time elapsed: 13.477000 seconds
time elapsed: 13.476000 seconds
time elapsed: 13.442000 seconds

average time: 13.433800 seconds

digest: 9.911482e+28
time elapsed: 137.951000 seconds
time elapsed: 143.653000 seconds
time elapsed: 142.328000 seconds
time elapsed: 143.088000 seconds
time elapsed: 138.995000 seconds
time elapsed: 144.829000 seconds
time elapsed: 150.038000 seconds
time elapsed: 152.158000 seconds
time elapsed: 149.456000 seconds
time elapsed: 153.029000 seconds

average time: 145.552500 seconds

digest: 9.865839e+28
time elapsed: 148.099000 seconds
time elapsed: 142.020000 seconds
time elapsed: 137.476000 seconds
time elapsed: 125.487000 seconds
time elapsed: 127.121000 seconds
time elapsed: 133.665000 seconds
time elapsed: 133.965000 seconds
time elapsed: 146.447000 seconds
time elapsed: 145.243000 seconds
time elapsed: 150.609000 seconds

average time: 139.013200 seconds

digest: 9.899728e+28
time elapsed: 49.672000 seconds
time elapsed: 49.669000 seconds
time elapsed: 47.744000 seconds
time elapsed: 45.777000 seconds
time elapsed: 46.896000 seconds
time elapsed: 46.243000 seconds
time elapsed: 45.074000 seconds
time elapsed: 45.709000 seconds
time elapsed: 45.799000 seconds
time elapsed: 46.931000 seconds

average time: 46.951400 seconds

digest: 9.877856e+28
time elapsed: 47.501000 seconds
time elapsed: 47.470000 seconds
time elapsed: 47.574000 seconds
time elapsed: 115.393000 seconds
time elapsed: 47.998000 seconds
time elapsed: 47.587000 seconds
time elapsed: 48.171000 seconds
time elapsed: 50.004000 seconds
time elapsed: 47.553000 seconds
time elapsed: 47.666000 seconds

average time: 54.691700 seconds

digest: 9.899156e+28
time elapsed: 148.331000 seconds
time elapsed: 148.340000 seconds
time elapsed: 149.073000 seconds
time elapsed: 143.828000 seconds
time elapsed: 7348.777000 seconds
time elapsed: 29811.908000 seconds
time elapsed: 93.905000 seconds
time elapsed: 133.805000 seconds
time elapsed: 132.391000 seconds
time elapsed: 128.146000 seconds

average time: 3823.850400 seconds

digest: 9.903024e+28
time elapsed: 52.062000 seconds
time elapsed: 51.434000 seconds
time elapsed: 50.931000 seconds
time elapsed: 52.549000 seconds
time elapsed: 52.310000 seconds
time elapsed: 52.921000 seconds
time elapsed: 53.034000 seconds
time elapsed: 52.969000 seconds
time elapsed: 53.119000 seconds
time elapsed: 53.178000 seconds

average time: 52.450700 seconds

digest: 9.896969e+28
time elapsed: 110.639000 seconds
time elapsed: 112.696000 seconds
time elapsed: 113.396000 seconds
time elapsed: 110.950000 seconds
time elapsed: 109.240000 seconds
time elapsed: 111.902000 seconds
time elapsed: 109.641000 seconds
time elapsed: 110.748000 seconds
time elapsed: 111.543000 seconds
time elapsed: 111.693000 seconds

average time: 111.244800 seconds

digest: 9.901374e+28
time elapsed: 53.233000 seconds
time elapsed: 52.279000 seconds
time elapsed: 52.989000 seconds
time elapsed: 53.125000 seconds
time elapsed: 53.359000 seconds
time elapsed: 53.110000 seconds
time elapsed: 53.450000 seconds
time elapsed: 53.532000 seconds
time elapsed: 53.166000 seconds
time elapsed: 53.112000 seconds

average time: 53.135500 seconds

digest: 9.901417e+28
time elapsed: 59.131000 seconds
time elapsed: 59.767000 seconds
time elapsed: 61.889000 seconds
time elapsed: 61.672000 seconds
time elapsed: 62.681000 seconds
time elapsed: 63.191000 seconds
time elapsed: 62.883000 seconds
time elapsed: 65.805000 seconds
time elapsed: 65.429000 seconds
time elapsed: 63.383000 seconds

average time: 62.583100 seconds

digest: 9.899146e+28
time elapsed: 22.581000 seconds
time elapsed: 22.567000 seconds
time elapsed: 22.553000 seconds
time elapsed: 22.545000 seconds
time elapsed: 22.523000 seconds
time elapsed: 22.563000 seconds
time elapsed: 22.484000 seconds
time elapsed: 22.514000 seconds
time elapsed: 22.484000 seconds
time elapsed: 22.515000 seconds

average time: 22.532900 seconds

digest: 9.896839e+28
time elapsed: 65.569000 seconds
time elapsed: 60.487000 seconds
time elapsed: 61.836000 seconds
time elapsed: 60.691000 seconds
time elapsed: 61.870000 seconds
time elapsed: 64.449000 seconds
time elapsed: 61.099000 seconds
time elapsed: 63.669000 seconds
time elapsed: 63.313000 seconds
time elapsed: 65.083000 seconds

average time: 62.806600 seconds

digest: 9.903770e+28
time elapsed: 20.904000 seconds
time elapsed: 20.933000 seconds
time elapsed: 20.946000 seconds
time elapsed: 20.995000 seconds
time elapsed: 20.987000 seconds
time elapsed: 20.869000 seconds
time elapsed: 20.882000 seconds
time elapsed: 20.960000 seconds
time elapsed: 20.941000 seconds
time elapsed: 21.027000 seconds

average time: 20.944400 seconds

digest: 9.912550e+28
time elapsed: 6.346000 seconds
time elapsed: 5.742000 seconds
time elapsed: 6.100000 seconds
time elapsed: 6.613000 seconds
time elapsed: 6.268000 seconds
time elapsed: 6.787000 seconds
time elapsed: 6.342000 seconds
time elapsed: 6.713000 seconds
time elapsed: 7.260000 seconds
time elapsed: 6.820000 seconds

average time: 6.499100 seconds

digest: 9.910695e+28
time elapsed: 65.439000 seconds
time elapsed: 64.723000 seconds
time elapsed: 62.983000 seconds
time elapsed: 62.821000 seconds
time elapsed: 64.334000 seconds
time elapsed: 65.543000 seconds
time elapsed: 65.290000 seconds
time elapsed: 65.686000 seconds
time elapsed: 65.220000 seconds
time elapsed: 65.460000 seconds

average time: 64.749900 seconds

digest: 9.898595e+28
time elapsed: 49.621000 seconds
time elapsed: 49.303000 seconds
time elapsed: 49.686000 seconds
time elapsed: 49.347000 seconds
time elapsed: 49.043000 seconds
time elapsed: 49.175000 seconds
time elapsed: 49.026000 seconds
time elapsed: 48.630000 seconds
time elapsed: 49.188000 seconds
time elapsed: 48.858000 seconds

average time: 49.187700 seconds

digest: 9.901914e+28
time elapsed: 133.349000 seconds
time elapsed: 133.380000 seconds
time elapsed: 139.028000 seconds
time elapsed: 136.188000 seconds
time elapsed: 138.605000 seconds
time elapsed: 138.885000 seconds
time elapsed: 129.199000 seconds
time elapsed: 128.709000 seconds
time elapsed: 128.203000 seconds
time elapsed: 135.526000 seconds

average time: 134.107200 seconds

digest: 9.906016e+28
time elapsed: 6.703000 seconds
time elapsed: 6.473000 seconds
time elapsed: 7.473000 seconds
time elapsed: 7.743000 seconds
time elapsed: 7.139000 seconds
time elapsed: 7.070000 seconds
time elapsed: 7.265000 seconds
time elapsed: 7.170000 seconds
time elapsed: 7.421000 seconds
time elapsed: 7.566000 seconds

average time: 7.202300 seconds

digest: 9.922074e+28
time elapsed: 66.483000 seconds
time elapsed: 65.114000 seconds
time elapsed: 66.024000 seconds
time elapsed: 65.756000 seconds
time elapsed: 66.441000 seconds
time elapsed: 66.330000 seconds
time elapsed: 65.723000 seconds
time elapsed: 65.738000 seconds
time elapsed: 66.469000 seconds
time elapsed: 67.455000 seconds

average time: 66.153300 seconds

digest: 9.911091e+28
time elapsed: 49.305000 seconds
time elapsed: 49.267000 seconds
time elapsed: 49.222000 seconds
time elapsed: 49.046000 seconds
time elapsed: 47.291000 seconds
time elapsed: 47.363000 seconds
time elapsed: 47.355000 seconds
time elapsed: 483.112000 seconds
time elapsed: 43.719000 seconds
time elapsed: 46.825000 seconds

average time: 91.250500 seconds

digest: 9.926025e+28
time elapsed: 137.950000 seconds
time elapsed: 137.705000 seconds
time elapsed: 236.334000 seconds
time elapsed: 126.361000 seconds
time elapsed: 132.557000 seconds
time elapsed: 132.088000 seconds
time elapsed: 132.368000 seconds
time elapsed: 131.702000 seconds
time elapsed: 132.571000 seconds
time elapsed: 127.087000 seconds

average time: 142.672300 seconds

digest: 9.907281e+28
time elapsed: 46.504000 seconds
time elapsed: 46.474000 seconds
time elapsed: 46.635000 seconds
time elapsed: 46.551000 seconds
time elapsed: 46.413000 seconds
time elapsed: 46.247000 seconds
time elapsed: 46.535000 seconds
time elapsed: 46.775000 seconds
time elapsed: 47.474000 seconds
time elapsed: 43.336000 seconds

average time: 46.294400 seconds

digest: 9.926747e+28
time elapsed: 47.603000 seconds
time elapsed: 47.411000 seconds
time elapsed: 47.378000 seconds
time elapsed: 47.642000 seconds
time elapsed: 47.545000 seconds
time elapsed: 47.588000 seconds
time elapsed: 47.684000 seconds
time elapsed: 46.191000 seconds
time elapsed: 47.667000 seconds
time elapsed: 47.829000 seconds

average time: 47.453800 seconds

digest: 9.912329e+28
time elapsed: 10.220000 seconds
time elapsed: 10.226000 seconds
time elapsed: 10.220000 seconds
time elapsed: 10.230000 seconds
time elapsed: 10.226000 seconds
time elapsed: 10.212000 seconds
time elapsed: 10.146000 seconds
time elapsed: 10.903000 seconds
time elapsed: 10.604000 seconds
time elapsed: 10.549000 seconds

average time: 10.353600 seconds

digest: 9.911104e+28
time elapsed: 14.032000 seconds
time elapsed: 13.381000 seconds
time elapsed: 13.336000 seconds
time elapsed: 13.134000 seconds
time elapsed: 12.997000 seconds
time elapsed: 12.781000 seconds
time elapsed: 12.807000 seconds
time elapsed: 12.793000 seconds
time elapsed: 12.811000 seconds
time elapsed: 12.784000 seconds

average time: 13.085600 seconds

digest: 9.887391e+28
time elapsed: 143.643000 seconds
time elapsed: 141.039000 seconds
time elapsed: 138.597000 seconds
time elapsed: 142.200000 seconds
time elapsed: 147.193000 seconds
time elapsed: 131.053000 seconds
time elapsed: 138.224000 seconds
time elapsed: 140.886000 seconds
time elapsed: 139.012000 seconds
time elapsed: 149.055000 seconds

average time: 141.090200 seconds

digest: 9.901493e+28
time elapsed: 140.788000 seconds
time elapsed: 142.897000 seconds
time elapsed: 139.818000 seconds
time elapsed: 107.175000 seconds
time elapsed: 108.034000 seconds
time elapsed: 109.302000 seconds
time elapsed: 112.171000 seconds
time elapsed: 111.892000 seconds
time elapsed: 111.540000 seconds
time elapsed: 106.974000 seconds

average time: 119.059100 seconds

digest: 9.899543e+28
time elapsed: 50.840000 seconds
time elapsed: 50.629000 seconds
time elapsed: 50.762000 seconds
time elapsed: 50.752000 seconds
time elapsed: 50.762000 seconds
time elapsed: 49.554000 seconds
time elapsed: 49.726000 seconds
time elapsed: 50.727000 seconds
time elapsed: 50.733000 seconds
time elapsed: 50.849000 seconds

average time: 50.533400 seconds

digest: 9.907391e+28
time elapsed: 51.589000 seconds
time elapsed: 51.570000 seconds
time elapsed: 51.577000 seconds
time elapsed: 51.080000 seconds
time elapsed: 51.781000 seconds
time elapsed: 51.686000 seconds
time elapsed: 51.658000 seconds
time elapsed: 51.616000 seconds
time elapsed: 51.550000 seconds
time elapsed: 51.603000 seconds

average time: 51.571000 seconds

digest: 9.915525e+28
time elapsed: 131.084000 seconds
time elapsed: 127.458000 seconds
time elapsed: 130.587000 seconds
time elapsed: 135.175000 seconds
time elapsed: 136.610000 seconds
time elapsed: 138.783000 seconds
time elapsed: 137.482000 seconds
time elapsed: 145.750000 seconds
time elapsed: 144.681000 seconds
time elapsed: 145.007000 seconds

average time: 137.261700 seconds

digest: 9.899386e+28
time elapsed: 63.797000 seconds
time elapsed: 63.976000 seconds
time elapsed: 64.190000 seconds
time elapsed: 64.155000 seconds
time elapsed: 64.310000 seconds
time elapsed: 63.140000 seconds
time elapsed: 61.149000 seconds
time elapsed: 59.334000 seconds
time elapsed: 59.374000 seconds
time elapsed: 59.387000 seconds

average time: 62.281200 seconds

digest: 9.921057e+28
time elapsed: 141.745000 seconds
time elapsed: 145.864000 seconds
time elapsed: 146.092000 seconds
time elapsed: 144.201000 seconds
time elapsed: 147.164000 seconds
time elapsed: 147.341000 seconds
time elapsed: 147.352000 seconds
time elapsed: 148.253000 seconds
time elapsed: 146.093000 seconds
time elapsed: 148.659000 seconds

average time: 146.276400 seconds

digest: 9.899731e+28
time elapsed: 58.987000 seconds
time elapsed: 59.082000 seconds
time elapsed: 57.748000 seconds
time elapsed: 59.193000 seconds
time elapsed: 59.062000 seconds
time elapsed: 59.179000 seconds
time elapsed: 59.057000 seconds
time elapsed: 58.602000 seconds
time elapsed: 59.094000 seconds
time elapsed: 59.016000 seconds

average time: 58.902000 seconds

digest: 9.904197e+28
time elapsed: 51.268000 seconds
time elapsed: 50.811000 seconds
time elapsed: 51.172000 seconds
time elapsed: 51.228000 seconds
time elapsed: 51.240000 seconds
time elapsed: 51.140000 seconds
time elapsed: 51.341000 seconds
time elapsed: 51.057000 seconds
time elapsed: 49.245000 seconds
time elapsed: 45.089000 seconds

average time: 50.359100 seconds

digest: 9.918298e+28
time elapsed: 20.391000 seconds
time elapsed: 21.106000 seconds
time elapsed: 21.276000 seconds
time elapsed: 21.274000 seconds
time elapsed: 21.276000 seconds
time elapsed: 21.271000 seconds
time elapsed: 21.107000 seconds
time elapsed: 21.275000 seconds
time elapsed: 21.266000 seconds
time elapsed: 21.284000 seconds

average time: 21.152600 seconds

digest: 9.898525e+28
time elapsed: 53.226000 seconds
time elapsed: 53.173000 seconds
time elapsed: 52.859000 seconds
time elapsed: 53.067000 seconds
time elapsed: 53.127000 seconds
time elapsed: 52.689000 seconds
time elapsed: 52.862000 seconds
time elapsed: 53.302000 seconds
time elapsed: 53.239000 seconds
time elapsed: 52.996000 seconds

average time: 53.054000 seconds

digest: 9.914975e+28
time elapsed: 22.408000 seconds
time elapsed: 22.381000 seconds
time elapsed: 22.380000 seconds
time elapsed: 22.065000 seconds
time elapsed: 22.444000 seconds
time elapsed: 23.216000 seconds
time elapsed: 22.119000 seconds
time elapsed: 22.029000 seconds
time elapsed: 22.069000 seconds
time elapsed: 21.996000 seconds

average time: 22.310700 seconds

digest: 9.893765e+28
In [7]:
m_sorted = [(m,t) for (t,m) in sort([(t,m) for (m,t) in measurements])]
for (m,t) in m_sorted
    println("$m --> $t");
end
ijk_crr --> 6.0097
jik_rrr --> 6.4991
ijk_rrr --> 6.9024
jik_crr --> 7.202300000000001
kij_rcr --> 10.3536
ikj_rcr --> 10.477899999999998
kij_rcc --> 13.0856
ikj_rcc --> 13.4338
jki_ccc --> 20.9444
kji_crc --> 21.1526
kji_ccc --> 22.3107
jki_crc --> 22.5329
kij_rrr --> 46.2944
ikj_ccr --> 46.9514
kij_rrc --> 47.45380000000001
jik_rcr --> 49.1877
kji_crr --> 50.359100000000005
kij_ccr --> 50.53339999999999
kij_ccc --> 51.57099999999999
jki_rrc --> 52.450700000000005
ijk_rcr --> 52.489599999999996
ijk_ccr --> 52.6379
kji_ccr --> 53.05400000000001
jki_rcc --> 53.1355
ikj_ccc --> 54.691700000000004
ijk_crc --> 57.6087
ikj_rrr --> 57.77300000000001
ijk_rrc --> 58.32710000000001
kji_rcc --> 58.902
kji_rrc --> 62.28119999999999
jki_crr --> 62.5831
jki_ccr --> 62.80659999999999
jik_rrc --> 64.74990000000001
jik_crc --> 66.1533
jik_ccr --> 91.25050000000002
jki_rcr --> 111.24480000000001
kij_crc --> 119.05909999999999
ijk_ccc --> 122.81089999999999
ikj_rrc --> 130.2022
jik_rcc --> 134.1072
kji_rrr --> 137.26170000000002
ikj_crc --> 139.01319999999998
kij_crr --> 141.0902
ijk_rcc --> 142.18490000000003
jik_ccc --> 142.6723
ikj_crr --> 145.5525
kji_rcr --> 146.27640000000002
jki_rrr --> 3823.8504000000003
In [8]:
maximum(t for (m,t) in m_sorted) / minimum(t for (m,t) in m_sorted)
Out[8]:
636.2797477411519

An over $20 \times$ difference just from accessing memory in a different order!

Scan order

Scan order refers to the order in which the training examples are used in a learning algorithm.

As you saw in the programming assignment, using a non-random scan order is an option that can sometimes improve performance by increasing memory locality.

Here are a few scan orders that people use:

  • Random sampling with replacement (a.k.a. random scan): every time we need a new sample, we pick one at random from the whole training dataset.

  • Random sampling without replacement: every time we need a new sample, we pick one at random and then discard it (it won't be sampled again). Once we've gone through the whole training set, we replace all the samples and continue.

  • Sequential scan (a.k.a. systematic scan): sample the data in the order in which it appears in memory. When you get to the end of the training set, restart at the beginning.

  • Shuffle-once: at the beginning of execution, randomly shuffle the training data. Then sample the data in that shuffled order. When you get to the end of the training set, restart at the beginning.

  • Random reshuffling: at the beginning of execution, randomly shuffle the training data. Then sample the data in that shuffled order. When you get to the end of the training set, reshuffle the training set, then restart at the beginning.

How does the memory locality of these different scan orders compare?

  • Worst memory locality: random scan with and without resampling
  • Okay memory locality: random reshuffling
  • Very good: shuffle once
  • Best: sequential scan

Two of these scan orders are actually statistically equivalent! Which ones?

  • Random sampling w/o replacement and random reshuffling

How does the statistical performance of these different scan orders compare?

random reshuffling = w/o replacement > shuffle once > sequential scan > random with replacement

A good first choice: shuffle once

Generally it performs quite well statistically (although it might have weaker theoretical guarantees), and it has good memory locality.

Memory and sparsity

How does the use of sparsity impact the memory subsystem?

Two major effects:

  • Sparsity lowers the total amount of memory in use by the program.
  • Sparsity lowers the memory locality.
    • Why? Accesses are not dense and so are less predictable.

What else can we do to lower the total memory usage of the machine learning pipeline?