Demo

"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Random projection: A post-script\n",
"\n",
"While using Gaussian random variables is sufficient, it's not very computationally efficient to generate the matrix $A$, communicate it, and multiply by it. There's a lot of work into making random projections faster by using other distributions and more structured matrices, so if you want to use random projection at scale, you should consider using these methods."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Principal Component Analysis\n",
"\n",
"We saw that random projection can do a good job of preserving distances, but weirdly, we *didn't use the data* at all to construct the random projection. \n",
"\n",
"* Can we use the data to do better than random projection?\n",
"\n",
"**Idea:** instead of using a random linear projection, pick an orthogonal linear map that maximizes the variance of the resulting transformed data.\n",
"\n",
"* Intuition: preserve as much of the \"interesting signal\" in the data as possible\n",
"\n",
"Concretely, if we're given some data $x_1, \\ldots, x_n \\in \\R^d$, we want to find an orthonormal matrix $A \\in \\R^{r \\times d}$ (i.e. a matrix with orthogonal rows all of norm $1$, i.e. $A A^T = I$) that maximizes\n",
"\n",
"$$\\frac{1}{n} \\sum_{i=1}^n \\norm{ A x_i - \\frac{1}{n} \\sum_{j=1}^n A x_j }^2\n",
"\\; \\text{ over orthogonal projections }A \\in \\R^{r \\times d}.$$"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Observe that this objective is equal to\n",
"\n",
"\\begin{align*}&\\frac{1}{n} \\sum_{i=1}^n \\left( A x_i - \\frac{1}{n} \\sum_{j=1}^n A x_j \\right)^T \\left( A x_i - \\frac{1}{n} \\sum_{j=1}^n A x_j \\right)\n",
" \\\\&\\hspace{2em}=\n",
" \\operatorname{trace}\\left( A \\left( \\frac{1}{n} \\sum_{i=1}^n \\left( x_i - \\frac{1}{n} \\sum_{j=1}^n x_j \\right) \\left( x_i - \\frac{1}{n} \\sum_{j=1}^n x_j \\right)^T \\right) A^T \\right).\\end{align*}\n",
" \n",
"So, if we let $\\Sigma$ be the empirical covariance matrix of the data,\n",
"\n",
"$$\\Sigma = \\frac{1}{n} \\sum_{i=1}^n \\left( x_i - \\frac{1}{n} \\sum_{j=1}^n x_j \\right) \\left( x_i - \\frac{1}{n} \\sum_{j=1}^n x_j \\right)^T,$$\n",
"\n",
"this is solved by maximizing $A \\Sigma A^T$."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"The solution to this is to pick the matrix $A$ such that the rows of $A$ are the $r$ eigenvectors of $\\Sigma$ associated with the $r$-largest eigenvalues.\n",
"That is, find the eigendecomposition of $\\Sigma$, and let the $r$ largest eigenvalues' eigenvectors be the rows of $A$.\n",
"\n",
"* One downside of this direct approach is that doing so requires $O(d^2)$ space (to store the covariance matrix) and even more time (to do the eigendecomposition).\n",
"* As a result, many methods for fast PCA have been developed, and you should consider using these if you want to use PCA at scale.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Demo

"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Autoencoders\n",
"\n",
"A more complicated approach, but not always one that works better.\n",
"\n",
"Idea: use **deep learning** to learn two nonlinear models, one of which (the **encoder**, $\\phi$) goes from our original data in $\\R^d$ to a compressed representation in $\\R^r$ for $r < d$, and the other of which (the **decoder**, $\\psi$) goes from the compressed representation in $\\R^r$ back to $\\R^d$.\n",
"\n",
"\n",
"\n",
"We want to train in such a way as to minimize the distance between the original examples in $\\R^d$ and the \"recovered\" examples that result from encoding and then decoding the example.\n",
"Formally, given some dataset $x_1, \\ldots, x_n$, we want to minimize\n",
"\n",
"$$\\frac{1}{n} \\sum_{i=1}^n \\norm{ \\psi(\\phi(x_i)) - x_i }^2$$\n",
"over some parameterized class of nonlinear transformations $\\phi: \\R^d \\rightarrow \\R^r$ and $\\psi: \\R^r \\rightarrow \\R^d$ defined by a neural network."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Sparsity\n",
"\n",
"An alternate approach to deal with large dimension $d$ is **sparsity**.\n",
"\n",
"* A vector or matrix is informally called **sparse** when few of its entries are non-zero.\n",
"\n",
"* The **density** of a sparse matrix or vector is the fraction of its entries that are non-zero.\n",
"For example, the vector\n",
"\n",
"$$\\begin{bmatrix} 3 & 0 & 0 & 0 & 2 & 0 & 0 & 0 & 7 & 0 \\end{bmatrix}$$\n",
"\n",
"has density $3/10 = 0.3$.\n",
"\n",
"\n",
"* When the density of a matrix is low, we can store and compute with it in a format specialized for sparse matrices.\n",
"\n",
"* This results in computations that have cost proportional to the number of nonzero entries of the matrix, rather than its dimensions.\n",
"\n",
" * So if $d$ is large, but the matrices involved have low density, we can often save a lot of compute and memory by using sparse matrix formats."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Storing sparse vectors.\n",
"\n",
"The standard way to store a sparse vector to store only the nonzero entries as pairs consisting of the index and the value of the nonzero entry.\n",
"\n",
"Concretely, for the vector above, we would store it as\n",
"\n",
"$$(\\texttt{index: } 0, \\texttt{value: } 3), (\\texttt{index: } 4, \\texttt{value: } 2), (\\texttt{index: } 8, \\texttt{value: } 7).$$\n",
"\n",
"Usually this is stored more compactly as two arrays: one array of indexes and one array of values.\n",
"\n",
"\\begin{align*}\n",
" \\texttt{indexes: } & \\begin{bmatrix} 0 & 4 & 8 \\end{bmatrix} \\\\\n",
" \\texttt{values: } & \\begin{bmatrix} 3 & 2 & 7 \\end{bmatrix}.\n",
"\\end{align*}"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Storing sparse matrices.\n",
"\n",
"There are many ways to store a sparse matrix. The simplest way is a direct adaptation of the way to store sparse vectors: we store the indexes (as a row/column pair) and values of all the nonzero entries.\n",
"This format is called \"coordinate list\" or **COO**.\n",
"For example, the matrix\n",
"\n",
"$$\\begin{bmatrix} 5 & 0 & 0 & 3 & 0 & 1 \\\\ 0 & 4 & 0 & 0 & 0 & 0 \\\\ 1 & 2 & 0 & 0 & 3 & 0 \\end{bmatrix}$$\n",
"\n",
"could be encoded in COO as\n",
"\n",
"\\begin{align*}\n",
" \\texttt{row indexes: } & \\begin{bmatrix} 0 & 0 & 0 & 1 & 2 & 2 & 2 \\end{bmatrix} \\\\\n",
" \\texttt{column indexes: } & \\begin{bmatrix} 0 & 3 & 5 & 1 & 0 & 1 & 4 \\end{bmatrix} \\\\\n",
" \\texttt{values: } & \\begin{bmatrix} 5 & 3 & 1 & 4 & 1 & 2 & 3 \\end{bmatrix}.\n",
"\\end{align*}\n",
"\n",
"Note that for COO, the order of the nonzero entries does not matter and no particular order is specified, although they are often sorted for performance reasons."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Storing Sparse Matrices Continued\n",
"\n",
"Another common format is \"compressed sparse row\" or **CSR**. \n",
"CSR effectively stores all the rows of the matrix as sparse vectors concatenated into one big array, and then uses another row offset array to index into it.\n",
"The above matrix encoded in CSR would look like\n",
"\n",
"\\begin{align*}\n",
" \\texttt{row offsets: } & \\begin{bmatrix} 0 & 3 & 4 \\end{bmatrix} \\\\\n",
" \\texttt{column indexes: } & \\begin{bmatrix} 0 & 3 & 5 & 1 & 0 & 1 & 4 \\end{bmatrix} \\\\\n",
" \\texttt{values: } & \\begin{bmatrix} 5 & 3 & 1 & 4 & 1 & 2 & 3 \\end{bmatrix}\n",
"\\end{align*}\n",
"\n",
"where $0$, $3$, and $4$ are the offsets within the column index and values arrays at which rows $0$, $1$, and $2$ begin, respectively. \n",
"\n",
"\"Compressed sparse column,\" or **CSC** is just the transpose-dual of CSR: it stores the columns of a matrix as sparse vectors rather than the rows.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Comparing Storage Formats\n",
"\n",
"While COO is better for building matrices by adding entries, CSR usually allows for faster matrix multiply with dense vectors and matrices."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Demo

"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Sparsity Epilogue: Other storage formats.\n",
"\n",
"There are many other ways to store sparse matrices, especially in the case where the matrix has some structure \n",
"\n",
"* for example, if it is banded matrix or a symmetric matrix\n",
"* or a tridiagonal matrix\n",
"\n",
"Some sparse matrix formats take advantage of dense sub-blocks within the matrix, which they store in a dense format to save memory and compute."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Questions?

"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"@webio": {
"lastCommId": null,
"lastKernelId": null
},
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
},
"rise": {
"scroll": true,
"transition": "none"
}
},
"nbformat": 4,
"nbformat_minor": 2
}