{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Lecture 20: Memory\n",
"\n",
"## CS4787 — Principles of Large-Scale Machine Learning Systems\n",
"\n",
"$\\newcommand{\\R}{\\mathbb{R}}$"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"* Last time we talked about parallel computing in machine learning, which allows us to take advantage of the parallel capabilities of our hardware to substantially speed up training and inference.\n",
"\n",
"* This is an instance of the general principle: **Use algorithms that fit your hardware, and use hardware that fits your algorithms**."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* But compute is only half the story of making algorithms that fit the hardware.\n",
"\n",
"* How data is stored and accessed can be just as important as how it is processed.\n",
"\n",
"* This is especially the case for machine learning tasks, which often run on very large datasets that can push the limits of the memory subsystem of the hardware."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Today, we'll be talking about how memory affects the performance of the machine learning pipeline."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### How do modern CPUs handle memory? \n",
"\n",
"CPUs have a deep **cache hierarchy**. In fact, many CPUs are mostly cache by area.\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"The motivation for this was the ever-increasing gap between the speed at which the arithmetic units on the CPU could execute instructions and the time it took to read/write data to system memory.\n",
"\n",
"\n",
"Without some faster cache to temporarily store data, the performance of the CPU would be bottlenecked by the cost of reading and/or writing to RAM after every instruction."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### A simplified view of memory on a CPU\n",
"\n",
"This is what the \"shared memory\" programming model sees.\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### But CPUs also have caches\n",
"\n",
"Caches are small and fast memories that are located physically on the CPU chip, and which mirror data stored in RAM so that it can be accessed more quickly by the CPU.\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"## The usual setup of memory on a CPU\n",
"\n",
"* a fast L1 cache (typically about 32KB) on each core\n",
"\n",
"* a somewhat slower, but larger L2 cache (e.g. 256 KB) on each core\n",
"\n",
"* an even slower and even larger L3 cache (e.g. 2 MB/core) shared among cores\n",
"\n",
"* DRAM — off-chip memory\n",
"\n",
"* Persistent storage — a hard disk or flash drive"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### A model of a multi-socket computer\n",
"\n",
"Multiple CPU chips on the same motherboard communicate with each other through physical connections on the motherboard.\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### The full view across multiple machines\n",
"\n",
"Multiple CPU chips on the same motherboard communicate with each other through physical connections on the motherboard.\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"One important thing to notice here:\n",
"\n",
"### As we zoom out, much more of this diagram is \"memory\" boxes than compute boxes.\n",
"\n",
"\n",
"Hand-wavy consequence: as we scale up, the effect of memory becomes more and more important."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Another important take-away:\n",
"\n",
"### Memory has a hierarchical structure\n",
"\n",
"* Memories lower in the hierarchy are faster, but smaller\n",
"\n",
"* Memories higher in the hierarchy are larger, but slower, and are often shared among many compute units"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Two ways to measure performance of a part of the memory hierarchy.\n",
"\n",
"* **Latency**: how much time does it take to access data at a new address in memory?\n",
"\n",
"* **Throughput** (a.k.a. bandwidth): how much data total can we access in a given length of time?\n",
"\n",
"We saw these metrics earlier when evaluating the effect of parallelism."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Ideally, we'd like all of our memory accesses to go to the fast L1 cache, since it has high throughput and low latency.\n",
"\n",
"What prevents this from happening in a practical program?"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Result: **the hardware needs to decide what is stored in the cache at any given time.**\n",
"\n",
"It wants to avoid, as much as possible, a situation in which the processor needs to access data that's not stored in the cache—this is called a **cache miss**.\n",
"\n",
"Hardware uses **two heuristics**:\n",
"\n",
"* The principle of temporal locality: **if a location in memory is accessed, it is likely that that location will be accessed again in the near future.**\n",
"\n",
"* The principle of spatial locality: **if a location in memory is accessed, it is likely that other nearby locations will be accessed in the near future.**"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Memory Locality\n",
"\n",
"Temporal locality and spatial locality are both types of **memory locality**. \n",
"\n",
"* We say that a program has good spatial locality and/or temporal locality and/or memory locality when it conforms to these heuristics.\n",
"\n",
"* When a program has good memory locality, it makes good use of the caches available on the hardware.\n",
"\n",
"* In practice, the throughput of a program is often substantially affected by the cache, and can be improved by increasing locality."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Prefetching\n",
"\n",
"A third important heuristic used by both the hardware and the compiler to improve cache performance is **prefetching**.\n",
"\n",
"Prefetching loads data into the cache **before it is ever accessed**, and is particularly useful when the program or the hardware can predict what memory will be used ahead of time."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Question: What can we do in the ML pipeline to increase locality and/or enable prefetching?\n",
"\n",
"* Access training exmaples in the order they appear in memory\n",
" * Prefetch the training examples (prefetch the ones we're going to be using next)\n",
" \n",
"* Efficient matrix multiplications"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"DEMO"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A matrix multiply of $A \\in \\mathbb{R}^{m \\times n}$ and $B \\in \\mathbb{R}^{n \\times p}$, producing output $C \\in \\mathbb{R}^{m \\times p}$, can be written by running\n",
"\n",
"$$ C_{i,k} \\mathrel{+}= A_{i,j} \\cdot B_{j,k} $$\n",
"\n",
"for each value of $i \\in \\{1, \\ldots, m\\}$, $j \\in \\{1, \\ldots, n\\}$, and $k \\in \\{1, \\ldots, p\\}$.\n",
"The natural way to do this is with three for loops.\n",
"But what order should we run these loops?\n",
"And how does the way we store $A$, $B$, and $C$ affect performance?"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"using Libdl # open a dynamic library that links to the C code\n",
"tm_lib = Libdl.dlopen(\"demo/test_memory.lib\");"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/plain": [
"test_mmpy (generic function with 1 method)"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"function test_mmpy(loop_order::String, Amaj::String, Bmaj::String, Cmaj::String, m, n, p, num_runs)\n",
" @assert(loop_order in [\"ijk\",\"ikj\",\"jki\",\"jik\",\"kij\",\"kji\"])\n",
" @assert(Amaj in [\"r\",\"c\"])\n",
" @assert(Bmaj in [\"r\",\"c\"])\n",
" @assert(Cmaj in [\"r\",\"c\"])\n",
" f = Libdl.dlsym(tm_lib, \"test_$(loop_order)_A$(Amaj)B$(Bmaj)C$(Cmaj)\")\n",
" ccall(f, Float64, (Int32, Int32, Int32, Int32), m, n, p, num_runs)\n",
"end"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"time elapsed: 0.295000 seconds\n",
"time elapsed: 0.283000 seconds\n",
"time elapsed: 0.299000 seconds\n",
"time elapsed: 0.277000 seconds\n",
"time elapsed: 0.278000 seconds\n",
"time elapsed: 0.276000 seconds\n",
"time elapsed: 0.283000 seconds\n",
"time elapsed: 0.278000 seconds\n",
"time elapsed: 0.273000 seconds\n",
"time elapsed: 0.276000 seconds\n",
"\n",
"average time: 0.281800 seconds\n",
"\n",
"digest: 1.549011e+27\n",
"time elapsed: 0.439000 seconds\n",
"time elapsed: 0.424000 seconds\n",
"time elapsed: 0.429000 seconds\n",
"time elapsed: 0.431000 seconds\n",
"time elapsed: 0.430000 seconds\n",
"time elapsed: 0.425000 seconds\n",
"time elapsed: 0.427000 seconds\n",
"time elapsed: 0.428000 seconds\n",
"time elapsed: 0.425000 seconds\n",
"time elapsed: 0.429000 seconds\n",
"\n",
"average time: 0.428700 seconds\n",
"\n",
"digest: 1.546028e+27\n",
"time elapsed: 0.306000 seconds\n",
"time elapsed: 0.306000 seconds\n",
"time elapsed: 0.304000 seconds\n",
"time elapsed: 0.304000 seconds\n",
"time elapsed: 0.302000 seconds\n",
"time elapsed: 0.306000 seconds\n",
"time elapsed: 0.303000 seconds\n",
"time elapsed: 0.303000 seconds\n",
"time elapsed: 0.305000 seconds\n",
"time elapsed: 0.304000 seconds\n",
"\n",
"average time: 0.304300 seconds\n",
"\n",
"digest: 1.547263e+27\n",
"time elapsed: 0.515000 seconds\n",
"time elapsed: 0.518000 seconds\n",
"time elapsed: 0.514000 seconds\n",
"time elapsed: 0.536000 seconds\n",
"time elapsed: 0.516000 seconds\n",
"time elapsed: 0.513000 seconds\n",
"time elapsed: 0.546000 seconds\n",
"time elapsed: 0.513000 seconds\n",
"time elapsed: 0.518000 seconds\n",
"time elapsed: 0.514000 seconds\n",
"\n",
"average time: 0.520300 seconds\n",
"\n",
"digest: 1.547960e+27\n",
"time elapsed: 0.274000 seconds\n",
"time elapsed: 0.277000 seconds\n",
"time elapsed: 0.274000 seconds\n",
"time elapsed: 0.273000 seconds\n",
"time elapsed: 0.273000 seconds\n",
"time elapsed: 0.277000 seconds\n",
"time elapsed: 0.276000 seconds\n",
"time elapsed: 0.273000 seconds\n",
"time elapsed: 0.277000 seconds\n",
"time elapsed: 0.272000 seconds\n",
"\n",
"average time: 0.274600 seconds\n",
"\n",
"digest: 1.547434e+27\n",
"time elapsed: 0.424000 seconds\n",
"time elapsed: 0.431000 seconds\n",
"time elapsed: 0.427000 seconds\n",
"time elapsed: 0.431000 seconds\n",
"time elapsed: 0.429000 seconds\n",
"time elapsed: 0.425000 seconds\n",
"time elapsed: 0.427000 seconds\n",
"time elapsed: 0.427000 seconds\n",
"time elapsed: 0.425000 seconds\n",
"time elapsed: 0.434000 seconds\n",
"\n",
"average time: 0.428000 seconds\n",
"\n",
"digest: 1.550654e+27\n",
"time elapsed: 0.304000 seconds\n",
"time elapsed: 0.306000 seconds\n",
"time elapsed: 0.303000 seconds\n",
"time elapsed: 0.307000 seconds\n",
"time elapsed: 0.309000 seconds\n",
"time elapsed: 0.303000 seconds\n",
"time elapsed: 0.310000 seconds\n",
"time elapsed: 0.306000 seconds\n",
"time elapsed: 0.304000 seconds\n",
"time elapsed: 0.304000 seconds\n",
"\n",
"average time: 0.305600 seconds\n",
"\n",
"digest: 1.545477e+27\n",
"time elapsed: 0.521000 seconds\n",
"time elapsed: 0.520000 seconds\n",
"time elapsed: 0.516000 seconds\n",
"time elapsed: 0.522000 seconds\n",
"time elapsed: 0.519000 seconds\n",
"time elapsed: 0.537000 seconds\n",
"time elapsed: 0.520000 seconds\n",
"time elapsed: 0.519000 seconds\n",
"time elapsed: 0.519000 seconds\n",
"time elapsed: 0.516000 seconds\n",
"\n",
"average time: 0.520900 seconds\n",
"\n",
"digest: 1.546651e+27\n",
"time elapsed: 0.193000 seconds\n",
"time elapsed: 0.205000 seconds\n",
"time elapsed: 0.210000 seconds\n",
"time elapsed: 0.210000 seconds\n",
"time elapsed: 0.191000 seconds\n",
"time elapsed: 0.178000 seconds\n",
"time elapsed: 0.174000 seconds\n",
"time elapsed: 0.174000 seconds\n",
"time elapsed: 0.176000 seconds\n",
"time elapsed: 0.174000 seconds\n",
"\n",
"average time: 0.188500 seconds\n",
"\n",
"digest: 1.543978e+27\n",
"time elapsed: 0.186000 seconds\n",
"time elapsed: 0.180000 seconds\n",
"time elapsed: 0.173000 seconds\n",
"time elapsed: 0.174000 seconds\n",
"time elapsed: 0.187000 seconds\n",
"time elapsed: 0.176000 seconds\n",
"time elapsed: 0.183000 seconds\n",
"time elapsed: 0.187000 seconds\n",
"time elapsed: 0.174000 seconds\n",
"time elapsed: 0.173000 seconds\n",
"\n",
"average time: 0.179300 seconds\n",
"\n",
"digest: 1.544756e+27\n",
"time elapsed: 0.163000 seconds\n",
"time elapsed: 0.162000 seconds\n",
"time elapsed: 0.162000 seconds\n",
"time elapsed: 0.157000 seconds\n",
"time elapsed: 0.156000 seconds\n",
"time elapsed: 0.159000 seconds\n",
"time elapsed: 0.161000 seconds\n",
"time elapsed: 0.165000 seconds\n",
"time elapsed: 0.166000 seconds\n",
"time elapsed: 0.160000 seconds\n",
"\n",
"average time: 0.161100 seconds\n",
"\n",
"digest: 1.545344e+27\n",
"time elapsed: 0.117000 seconds\n",
"time elapsed: 0.121000 seconds\n",
"time elapsed: 0.118000 seconds\n",
"time elapsed: 0.117000 seconds\n",
"time elapsed: 0.118000 seconds\n",
"time elapsed: 0.119000 seconds\n",
"time elapsed: 0.119000 seconds\n",
"time elapsed: 0.118000 seconds\n",
"time elapsed: 0.118000 seconds\n",
"time elapsed: 0.117000 seconds\n",
"\n",
"average time: 0.118200 seconds\n",
"\n",
"digest: 1.546831e+27\n",
"time elapsed: 0.335000 seconds\n",
"time elapsed: 0.324000 seconds\n",
"time elapsed: 0.370000 seconds\n",
"time elapsed: 0.341000 seconds\n",
"time elapsed: 0.384000 seconds\n",
"time elapsed: 0.391000 seconds\n",
"time elapsed: 0.394000 seconds\n",
"time elapsed: 0.374000 seconds\n",
"time elapsed: 0.396000 seconds\n",
"time elapsed: 0.382000 seconds\n",
"\n",
"average time: 0.369100 seconds\n",
"\n",
"digest: 1.546806e+27\n",
"time elapsed: 0.385000 seconds\n",
"time elapsed: 0.385000 seconds\n",
"time elapsed: 0.385000 seconds\n",
"time elapsed: 0.335000 seconds\n",
"time elapsed: 0.339000 seconds\n",
"time elapsed: 0.383000 seconds\n",
"time elapsed: 0.366000 seconds\n",
"time elapsed: 0.350000 seconds\n",
"time elapsed: 0.325000 seconds\n",
"time elapsed: 0.372000 seconds\n",
"\n",
"average time: 0.362500 seconds\n",
"\n",
"digest: 1.545685e+27\n",
"time elapsed: 0.212000 seconds\n",
"time elapsed: 0.199000 seconds\n",
"time elapsed: 0.201000 seconds\n",
"time elapsed: 0.205000 seconds\n",
"time elapsed: 0.201000 seconds\n",
"time elapsed: 0.178000 seconds\n",
"time elapsed: 0.191000 seconds\n",
"time elapsed: 0.179000 seconds\n",
"time elapsed: 0.179000 seconds\n",
"time elapsed: 0.181000 seconds\n",
"\n",
"average time: 0.192600 seconds\n",
"\n",
"digest: 1.544665e+27\n",
"time elapsed: 0.199000 seconds\n",
"time elapsed: 0.199000 seconds\n",
"time elapsed: 0.185000 seconds\n",
"time elapsed: 0.179000 seconds\n",
"time elapsed: 0.185000 seconds\n",
"time elapsed: 0.206000 seconds\n",
"time elapsed: 0.209000 seconds\n",
"time elapsed: 0.207000 seconds\n",
"time elapsed: 0.204000 seconds\n",
"time elapsed: 0.194000 seconds\n",
"\n",
"average time: 0.196700 seconds\n",
"\n",
"digest: 1.544349e+27\n",
"time elapsed: 0.543000 seconds\n",
"time elapsed: 0.537000 seconds\n",
"time elapsed: 0.594000 seconds\n",
"time elapsed: 0.566000 seconds\n",
"time elapsed: 0.568000 seconds\n",
"time elapsed: 0.558000 seconds\n",
"time elapsed: 0.572000 seconds\n",
"time elapsed: 0.558000 seconds\n",
"time elapsed: 0.563000 seconds\n",
"time elapsed: 0.560000 seconds\n",
"\n",
"average time: 0.561900 seconds\n",
"\n",
"digest: 1.543135e+27\n",
"time elapsed: 0.210000 seconds\n",
"time elapsed: 0.204000 seconds\n",
"time elapsed: 0.214000 seconds\n",
"time elapsed: 0.205000 seconds\n",
"time elapsed: 0.216000 seconds\n",
"time elapsed: 0.214000 seconds\n",
"time elapsed: 0.211000 seconds\n",
"time elapsed: 0.208000 seconds\n",
"time elapsed: 0.219000 seconds\n",
"time elapsed: 0.209000 seconds\n",
"\n",
"average time: 0.211000 seconds\n",
"\n",
"digest: 1.548973e+27\n",
"time elapsed: 0.558000 seconds\n",
"time elapsed: 0.566000 seconds\n",
"time elapsed: 0.567000 seconds\n",
"time elapsed: 0.566000 seconds\n",
"time elapsed: 0.571000 seconds\n",
"time elapsed: 0.572000 seconds\n",
"time elapsed: 0.572000 seconds\n",
"time elapsed: 0.570000 seconds\n",
"time elapsed: 0.564000 seconds\n",
"time elapsed: 0.565000 seconds\n",
"\n",
"average time: 0.567100 seconds\n",
"\n",
"digest: 1.549237e+27\n",
"time elapsed: 0.227000 seconds\n",
"time elapsed: 0.209000 seconds\n",
"time elapsed: 0.215000 seconds\n",
"time elapsed: 0.222000 seconds\n",
"time elapsed: 0.212000 seconds\n",
"time elapsed: 0.214000 seconds\n",
"time elapsed: 0.213000 seconds\n",
"time elapsed: 0.215000 seconds\n",
"time elapsed: 0.218000 seconds\n",
"time elapsed: 0.219000 seconds\n",
"\n",
"average time: 0.216400 seconds\n",
"\n",
"digest: 1.543348e+27\n",
"time elapsed: 0.448000 seconds\n",
"time elapsed: 0.447000 seconds\n",
"time elapsed: 0.451000 seconds\n",
"time elapsed: 0.446000 seconds\n",
"time elapsed: 0.452000 seconds\n",
"time elapsed: 0.446000 seconds\n",
"time elapsed: 0.453000 seconds\n",
"time elapsed: 0.455000 seconds\n",
"time elapsed: 0.446000 seconds\n",
"time elapsed: 0.433000 seconds\n",
"\n",
"average time: 0.447700 seconds\n",
"\n",
"digest: 1.548308e+27\n",
"time elapsed: 0.053000 seconds\n",
"time elapsed: 0.054000 seconds\n",
"time elapsed: 0.050000 seconds\n",
"time elapsed: 0.052000 seconds\n",
"time elapsed: 0.051000 seconds\n",
"time elapsed: 0.051000 seconds\n",
"time elapsed: 0.052000 seconds\n",
"time elapsed: 0.050000 seconds\n",
"time elapsed: 0.052000 seconds\n",
"time elapsed: 0.052000 seconds\n",
"\n",
"average time: 0.051700 seconds\n",
"\n",
"digest: 1.548871e+27\n",
"time elapsed: 0.422000 seconds\n",
"time elapsed: 0.441000 seconds\n",
"time elapsed: 0.420000 seconds\n",
"time elapsed: 0.417000 seconds\n",
"time elapsed: 0.429000 seconds\n",
"time elapsed: 0.426000 seconds\n",
"time elapsed: 0.416000 seconds\n",
"time elapsed: 0.416000 seconds\n",
"time elapsed: 0.421000 seconds\n",
"time elapsed: 0.418000 seconds\n",
"\n",
"average time: 0.422600 seconds\n",
"\n",
"digest: 1.543851e+27\n",
"time elapsed: 0.053000 seconds\n",
"time elapsed: 0.053000 seconds\n",
"time elapsed: 0.051000 seconds\n",
"time elapsed: 0.053000 seconds\n",
"time elapsed: 0.052000 seconds\n",
"time elapsed: 0.052000 seconds\n",
"time elapsed: 0.052000 seconds\n",
"time elapsed: 0.058000 seconds\n",
"time elapsed: 0.054000 seconds\n",
"time elapsed: 0.051000 seconds\n",
"\n",
"average time: 0.052900 seconds\n",
"\n",
"digest: 1.547475e+27\n",
"time elapsed: 0.279000 seconds\n",
"time elapsed: 0.275000 seconds\n",
"time elapsed: 0.276000 seconds\n",
"time elapsed: 0.275000 seconds\n",
"time elapsed: 0.281000 seconds\n",
"time elapsed: 0.279000 seconds\n",
"time elapsed: 0.275000 seconds\n",
"time elapsed: 0.279000 seconds\n",
"time elapsed: 0.275000 seconds\n",
"time elapsed: 0.275000 seconds\n",
"\n",
"average time: 0.276900 seconds\n",
"\n",
"digest: 1.543076e+27\n",
"time elapsed: 0.424000 seconds\n",
"time elapsed: 0.421000 seconds\n",
"time elapsed: 0.422000 seconds\n",
"time elapsed: 0.435000 seconds\n",
"time elapsed: 0.423000 seconds\n",
"time elapsed: 0.419000 seconds\n",
"time elapsed: 0.424000 seconds\n",
"time elapsed: 0.422000 seconds\n",
"time elapsed: 0.426000 seconds\n",
"time elapsed: 0.444000 seconds\n",
"\n",
"average time: 0.426000 seconds\n",
"\n",
"digest: 1.547920e+27\n",
"time elapsed: 0.282000 seconds\n",
"time elapsed: 0.280000 seconds\n",
"time elapsed: 0.280000 seconds\n",
"time elapsed: 0.289000 seconds\n",
"time elapsed: 0.293000 seconds\n",
"time elapsed: 0.278000 seconds\n",
"time elapsed: 0.289000 seconds\n",
"time elapsed: 0.310000 seconds\n",
"time elapsed: 0.297000 seconds\n",
"time elapsed: 0.285000 seconds\n",
"\n",
"average time: 0.288300 seconds\n",
"\n",
"digest: 1.550322e+27\n",
"time elapsed: 0.519000 seconds\n",
"time elapsed: 0.519000 seconds\n",
"time elapsed: 0.519000 seconds\n",
"time elapsed: 0.519000 seconds\n",
"time elapsed: 0.516000 seconds\n",
"time elapsed: 0.514000 seconds\n",
"time elapsed: 0.516000 seconds\n",
"time elapsed: 0.519000 seconds\n",
"time elapsed: 0.517000 seconds\n",
"time elapsed: 0.525000 seconds\n",
"\n",
"average time: 0.518300 seconds\n",
"\n",
"digest: 1.545686e+27\n",
"time elapsed: 0.294000 seconds\n",
"time elapsed: 0.291000 seconds\n",
"time elapsed: 0.292000 seconds\n",
"time elapsed: 0.294000 seconds\n",
"time elapsed: 0.293000 seconds\n",
"time elapsed: 0.293000 seconds\n",
"time elapsed: 0.294000 seconds\n",
"time elapsed: 0.293000 seconds\n",
"time elapsed: 0.291000 seconds\n",
"time elapsed: 0.293000 seconds\n",
"\n",
"average time: 0.292800 seconds\n",
"\n",
"digest: 1.548871e+27\n",
"time elapsed: 0.424000 seconds\n",
"time elapsed: 0.426000 seconds\n",
"time elapsed: 0.424000 seconds\n",
"time elapsed: 0.419000 seconds\n",
"time elapsed: 0.423000 seconds\n",
"time elapsed: 0.428000 seconds\n",
"time elapsed: 0.422000 seconds\n",
"time elapsed: 0.425000 seconds\n",
"time elapsed: 0.425000 seconds\n",
"time elapsed: 0.428000 seconds\n",
"\n",
"average time: 0.424400 seconds\n",
"\n",
"digest: 1.544744e+27\n",
"time elapsed: 0.280000 seconds\n",
"time elapsed: 0.276000 seconds\n",
"time elapsed: 0.275000 seconds\n",
"time elapsed: 0.284000 seconds\n",
"time elapsed: 0.276000 seconds\n",
"time elapsed: 0.278000 seconds\n",
"time elapsed: 0.278000 seconds\n",
"time elapsed: 0.278000 seconds\n",
"time elapsed: 0.280000 seconds\n",
"time elapsed: 0.279000 seconds\n",
"\n",
"average time: 0.278400 seconds\n",
"\n",
"digest: 1.549813e+27\n",
"time elapsed: 0.514000 seconds\n",
"time elapsed: 0.518000 seconds\n",
"time elapsed: 0.516000 seconds\n",
"time elapsed: 0.516000 seconds\n",
"time elapsed: 0.516000 seconds\n",
"time elapsed: 0.512000 seconds\n",
"time elapsed: 0.519000 seconds\n",
"time elapsed: 0.520000 seconds\n",
"time elapsed: 0.517000 seconds\n",
"time elapsed: 0.521000 seconds\n",
"\n",
"average time: 0.516900 seconds\n",
"\n",
"digest: 1.546276e+27\n",
"time elapsed: 0.202000 seconds\n",
"time elapsed: 0.208000 seconds\n",
"time elapsed: 0.199000 seconds\n",
"time elapsed: 0.192000 seconds\n",
"time elapsed: 0.192000 seconds\n",
"time elapsed: 0.193000 seconds\n",
"time elapsed: 0.198000 seconds\n",
"time elapsed: 0.183000 seconds\n",
"time elapsed: 0.203000 seconds\n",
"time elapsed: 0.185000 seconds\n",
"\n",
"average time: 0.195500 seconds\n",
"\n",
"digest: 1.549451e+27\n",
"time elapsed: 0.198000 seconds\n",
"time elapsed: 0.181000 seconds\n",
"time elapsed: 0.196000 seconds\n",
"time elapsed: 0.187000 seconds\n",
"time elapsed: 0.194000 seconds\n",
"time elapsed: 0.194000 seconds\n",
"time elapsed: 0.195000 seconds\n",
"time elapsed: 0.195000 seconds\n",
"time elapsed: 0.188000 seconds\n",
"time elapsed: 0.199000 seconds\n",
"\n",
"average time: 0.192700 seconds\n",
"\n",
"digest: 1.550307e+27\n",
"time elapsed: 0.162000 seconds\n",
"time elapsed: 0.161000 seconds\n",
"time elapsed: 0.162000 seconds\n",
"time elapsed: 0.162000 seconds\n",
"time elapsed: 0.162000 seconds\n",
"time elapsed: 0.159000 seconds\n",
"time elapsed: 0.159000 seconds\n",
"time elapsed: 0.158000 seconds\n",
"time elapsed: 0.156000 seconds\n",
"time elapsed: 0.158000 seconds\n",
"\n",
"average time: 0.159900 seconds\n",
"\n",
"digest: 1.547840e+27\n",
"time elapsed: 0.117000 seconds\n",
"time elapsed: 0.118000 seconds\n",
"time elapsed: 0.117000 seconds\n",
"time elapsed: 0.116000 seconds\n",
"time elapsed: 0.119000 seconds\n",
"time elapsed: 0.117000 seconds\n",
"time elapsed: 0.116000 seconds\n",
"time elapsed: 0.117000 seconds\n",
"time elapsed: 0.116000 seconds\n",
"time elapsed: 0.117000 seconds\n",
"\n",
"average time: 0.117000 seconds\n",
"\n",
"digest: 1.541549e+27\n",
"time elapsed: 0.324000 seconds\n",
"time elapsed: 0.337000 seconds\n",
"time elapsed: 0.322000 seconds\n",
"time elapsed: 0.322000 seconds\n",
"time elapsed: 0.386000 seconds\n",
"time elapsed: 0.358000 seconds\n",
"time elapsed: 0.352000 seconds\n",
"time elapsed: 0.385000 seconds\n",
"time elapsed: 0.380000 seconds\n",
"time elapsed: 0.380000 seconds\n",
"\n",
"average time: 0.354600 seconds\n",
"\n",
"digest: 1.542940e+27\n",
"time elapsed: 0.391000 seconds\n",
"time elapsed: 0.370000 seconds\n",
"time elapsed: 0.378000 seconds\n",
"time elapsed: 0.375000 seconds\n",
"time elapsed: 0.369000 seconds\n",
"time elapsed: 0.363000 seconds\n",
"time elapsed: 0.322000 seconds\n",
"time elapsed: 0.382000 seconds\n",
"time elapsed: 0.393000 seconds\n",
"time elapsed: 0.370000 seconds\n",
"\n",
"average time: 0.371300 seconds\n",
"\n",
"digest: 1.547824e+27\n",
"time elapsed: 0.200000 seconds\n",
"time elapsed: 0.201000 seconds\n",
"time elapsed: 0.202000 seconds\n",
"time elapsed: 0.175000 seconds\n",
"time elapsed: 0.187000 seconds\n",
"time elapsed: 0.187000 seconds\n",
"time elapsed: 0.201000 seconds\n",
"time elapsed: 0.196000 seconds\n",
"time elapsed: 0.177000 seconds\n",
"time elapsed: 0.183000 seconds\n",
"\n",
"average time: 0.190900 seconds\n",
"\n",
"digest: 1.549576e+27\n",
"time elapsed: 0.175000 seconds\n",
"time elapsed: 0.179000 seconds\n",
"time elapsed: 0.190000 seconds\n",
"time elapsed: 0.186000 seconds\n",
"time elapsed: 0.179000 seconds\n",
"time elapsed: 0.186000 seconds\n",
"time elapsed: 0.190000 seconds\n",
"time elapsed: 0.182000 seconds\n",
"time elapsed: 0.189000 seconds\n",
"time elapsed: 0.186000 seconds\n",
"\n",
"average time: 0.184200 seconds\n",
"\n",
"digest: 1.545518e+27\n",
"time elapsed: 0.518000 seconds\n",
"time elapsed: 0.517000 seconds\n",
"time elapsed: 0.514000 seconds\n",
"time elapsed: 0.511000 seconds\n",
"time elapsed: 0.518000 seconds\n",
"time elapsed: 0.517000 seconds\n",
"time elapsed: 0.514000 seconds\n",
"time elapsed: 0.515000 seconds\n",
"time elapsed: 0.515000 seconds\n",
"time elapsed: 0.512000 seconds\n",
"\n",
"average time: 0.515100 seconds\n",
"\n",
"digest: 1.550872e+27\n",
"time elapsed: 0.184000 seconds\n",
"time elapsed: 0.195000 seconds\n",
"time elapsed: 0.195000 seconds\n",
"time elapsed: 0.198000 seconds\n",
"time elapsed: 0.197000 seconds\n",
"time elapsed: 0.192000 seconds\n",
"time elapsed: 0.197000 seconds\n",
"time elapsed: 0.182000 seconds\n",
"time elapsed: 0.182000 seconds\n",
"time elapsed: 0.180000 seconds\n",
"\n",
"average time: 0.190200 seconds\n",
"\n",
"digest: 1.548188e+27\n",
"time elapsed: 0.512000 seconds\n",
"time elapsed: 0.512000 seconds\n",
"time elapsed: 0.528000 seconds\n",
"time elapsed: 0.514000 seconds\n",
"time elapsed: 0.516000 seconds\n",
"time elapsed: 0.515000 seconds\n",
"time elapsed: 0.514000 seconds\n",
"time elapsed: 0.515000 seconds\n",
"time elapsed: 0.515000 seconds\n",
"time elapsed: 0.513000 seconds\n",
"\n",
"average time: 0.515400 seconds\n",
"\n",
"digest: 1.550548e+27\n",
"time elapsed: 0.195000 seconds\n",
"time elapsed: 0.192000 seconds\n",
"time elapsed: 0.192000 seconds\n",
"time elapsed: 0.185000 seconds\n",
"time elapsed: 0.194000 seconds\n",
"time elapsed: 0.186000 seconds\n",
"time elapsed: 0.193000 seconds\n",
"time elapsed: 0.186000 seconds\n",
"time elapsed: 0.187000 seconds\n",
"time elapsed: 0.195000 seconds\n",
"\n",
"average time: 0.190500 seconds\n",
"\n",
"digest: 1.547162e+27\n",
"time elapsed: 0.428000 seconds\n",
"time elapsed: 0.424000 seconds\n",
"time elapsed: 0.424000 seconds\n",
"time elapsed: 0.421000 seconds\n",
"time elapsed: 0.424000 seconds\n",
"time elapsed: 0.425000 seconds\n",
"time elapsed: 0.423000 seconds\n",
"time elapsed: 0.425000 seconds\n",
"time elapsed: 0.417000 seconds\n",
"time elapsed: 0.424000 seconds\n",
"\n",
"average time: 0.423500 seconds\n",
"\n",
"digest: 1.549798e+27\n",
"time elapsed: 0.052000 seconds\n",
"time elapsed: 0.049000 seconds\n",
"time elapsed: 0.050000 seconds\n",
"time elapsed: 0.049000 seconds\n",
"time elapsed: 0.049000 seconds\n",
"time elapsed: 0.050000 seconds\n",
"time elapsed: 0.050000 seconds\n",
"time elapsed: 0.051000 seconds\n",
"time elapsed: 0.050000 seconds\n",
"time elapsed: 0.051000 seconds\n",
"\n",
"average time: 0.050100 seconds\n",
"\n",
"digest: 1.545493e+27\n",
"time elapsed: 0.428000 seconds\n",
"time elapsed: 0.423000 seconds\n",
"time elapsed: 0.423000 seconds\n",
"time elapsed: 0.420000 seconds\n",
"time elapsed: 0.419000 seconds\n",
"time elapsed: 0.422000 seconds\n",
"time elapsed: 0.418000 seconds\n",
"time elapsed: 0.419000 seconds\n",
"time elapsed: 0.424000 seconds\n",
"time elapsed: 0.421000 seconds\n",
"\n",
"average time: 0.421700 seconds\n",
"\n",
"digest: 1.549679e+27\n",
"time elapsed: 0.065000 seconds\n",
"time elapsed: 0.064000 seconds\n",
"time elapsed: 0.063000 seconds\n",
"time elapsed: 0.064000 seconds\n",
"time elapsed: 0.067000 seconds\n",
"time elapsed: 0.066000 seconds\n",
"time elapsed: 0.064000 seconds\n",
"time elapsed: 0.063000 seconds\n",
"time elapsed: 0.063000 seconds\n",
"time elapsed: 0.063000 seconds\n",
"\n",
"average time: 0.064200 seconds\n",
"\n",
"digest: 1.545748e+27\n"
]
}
],
"source": [
"d = 512;\n",
"num_runs = 10;\n",
"measurements = []\n",
"for loop_order in [\"ijk\",\"ikj\",\"jki\",\"jik\",\"kij\",\"kji\"]\n",
" for Am in [\"r\",\"c\"]\n",
" for Bm in [\"r\",\"c\"]\n",
" for Cm in [\"r\",\"c\"]\n",
" push!(measurements, (loop_order * \"_\" * Am * Bm * Cm, test_mmpy(loop_order, Am, Bm, Cm, d, d, d, num_runs)));\n",
" end\n",
" end\n",
" end\n",
"end"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"scrolled": false,
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"kji_crc --> 0.0501\n",
"jki_crc --> 0.0517\n",
"jki_ccc --> 0.0529\n",
"kji_ccc --> 0.0642\n",
"kij_rcc --> 0.11700000000000002\n",
"ikj_rcc --> 0.1182\n",
"kij_rcr --> 0.1599\n",
"ikj_rcr --> 0.1611\n",
"ikj_rrc --> 0.17930000000000001\n",
"kij_ccc --> 0.18419999999999997\n",
"ikj_rrr --> 0.18849999999999995\n",
"kji_rrc --> 0.19019999999999998\n",
"kji_rcc --> 0.1905\n",
"kij_ccr --> 0.19090000000000001\n",
"ikj_ccr --> 0.19260000000000005\n",
"kij_rrc --> 0.1927\n",
"kij_rrr --> 0.1955\n",
"ikj_ccc --> 0.1967\n",
"jki_rrc --> 0.211\n",
"jki_rcc --> 0.2164\n",
"ijk_crr --> 0.27460000000000007\n",
"jik_rrr --> 0.2769\n",
"jik_ccr --> 0.2784000000000001\n",
"ijk_rrr --> 0.28180000000000005\n",
"jik_rcr --> 0.2883\n",
"jik_crr --> 0.2928\n",
"ijk_rcr --> 0.3043\n",
"ijk_ccr --> 0.3056\n",
"kij_crr --> 0.3546\n",
"ikj_crc --> 0.3625\n",
"ikj_crr --> 0.36910000000000004\n",
"kij_crc --> 0.3713\n",
"kji_ccr --> 0.4217000000000001\n",
"jki_ccr --> 0.4226\n",
"kji_crr --> 0.42349999999999993\n",
"jik_crc --> 0.4244\n",
"jik_rrc --> 0.42600000000000005\n",
"ijk_crc --> 0.42799999999999994\n",
"ijk_rrc --> 0.42869999999999997\n",
"jki_crr --> 0.44770000000000004\n",
"kji_rrr --> 0.5151\n",
"kji_rcr --> 0.5153999999999999\n",
"jik_ccc --> 0.5169\n",
"jik_rcc --> 0.5183000000000001\n",
"ijk_rcc --> 0.5203\n",
"ijk_ccc --> 0.5208999999999999\n",
"jki_rrr --> 0.5619\n",
"jki_rcr --> 0.5671000000000002\n"
]
}
],
"source": [
"m_sorted = [(m,t) for (t,m) in sort([(t,m) for (m,t) in measurements])]\n",
"for (m,t) in m_sorted\n",
" println(\"$m --> $t\");\n",
"end"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/plain": [
"11.319361277445113"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"maximum(t for (m,t) in m_sorted) / minimum(t for (m,t) in m_sorted)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"source": [
"An over $10 \\times$ difference just from accessing memory in a different order!\n",
"\n",
"$$\\begin{bmatrix}a & b \\\\ c & d\\end{bmatrix}$$\n",
"\n",
"[a,b,c,d]\n",
"\n",
"[a,c,b,d]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Scan order\n",
"\n",
"**Scan order** refers to the order in which the training examples are used in a learning algorithm.\n",
"\n",
"As you saw in the programming assignment, using a non-random scan order is an option that can sometimes improve performance by increasing memory locality.\n",
"\n",
"Here are a few scan orders that people use:\n",
"\n",
"* **Random sampling with replacement** (a.k.a. random scan): every time we need a new sample, we pick one at random from the whole training dataset.\n",
"\n",
"* **Random sampling without replacement**: every time we need a new sample, we pick one at random and then discard it (it won't be sampled again). Once we've gone through the whole training set, we replace all the samples and continue.\n",
"\n",
"* **Sequential scan** (a.k.a. systematic scan): sample the data in the order in which it appears in memory. When you get to the end of the training set, restart at the beginning.\n",
"\n",
"* **Shuffle-once**: at the beginning of execution, randomly shuffle the training data. Then sample the data in that shuffled order. When you get to the end of the training set, restart at the beginning.\n",
"\n",
"* **Random reshuffling**: at the beginning of execution, randomly shuffle the training data. Then sample the data in that shuffled order. When you get to the end of the training set, reshuffle the training set, then restart at the beginning."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"How does the memory locality of these different scan orders compare?\n",
"\n",
"* Worst memory locality: random scan with and without resampling\n",
"* Okay memory locality: random reshuffling\n",
"* Very good: shuffle once\n",
"* Best: sequential scan\n",
"\n",
"Two of these scan orders are actually statistically equivalent! Which ones?\n",
"\n",
"* Random sampling w/o replacement and random reshuffling\n",
"\n",
"How does the statistical performance of these different scan orders compare?\n",
"\n",
"random reshuffling = w/o replacement > shuffle once > sequential scan > random with replacement"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## A good first choice when compute is light: shuffle once\n",
"\n",
"Generally it performs quite well statistically (although it might have weaker theoretical guarantees), and it has good memory locality."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Another good choice: without-replacement sampling\n",
"\n",
"This is particularly good when you're doing some sort of data augmentation, since you can construct the without-replacement minibatches on-the-fly."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Memory and sparsity\n",
"\n",
"How does the use of sparsity impact the memory subsystem?\n",
"\n",
"Two major effects:\n",
"\n",
"* Sparsity lowers the total amount of memory in use by the program.\n",
"* Sparsity lowers the memory locality.\n",
" * Why? Accesses are not dense and so are less predictable."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"What else can we do to lower the total memory usage of the machine learning pipeline?"
]
}
],
"metadata": {
"@webio": {
"lastCommId": null,
"lastKernelId": null
},
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Julia 1.5.4",
"language": "julia",
"name": "julia-1.5"
},
"language_info": {
"file_extension": ".jl",
"mimetype": "application/julia",
"name": "julia",
"version": "1.5.4"
},
"rise": {
"scroll": true
}
},
"nbformat": 4,
"nbformat_minor": 2
}