{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Lecture 24: ML Accelerators\n", "\n", "## CS4787 — Principles of Large-Scale Machine Learning Systems" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "using PyPlot\n", "using LinearAlgebra\n", "using Statistics" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Final Exam Logistics\n", "\n", "Scheduled for release at 5/17/2020 at 2:00 PM, and due 48 hours later.\n", "\n", "I will have office hours as usual on Wednesday, but the TAs will mostly end their office hours tomorrow.\n", "\n", "Practice final released." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "So far, we've talked about machine learning running on two types of classical hardware: **CPUs** and **GPUs**.\n", "\n", "But these are not the only options for training and inferring ML models.\n", "\n", "An **exciting new generation of computer processors** is being developed to accelerate machine learning calculations.\n", "\n", "
_MSR and Bing launched hardware microservices, enabling one web-scale service to leverage multiple FPGA-accelerated applications distributed across a datacenter. Bing deployed the first FPGA-accelerated Deep Neural Network (DNN). MSR demonstrated that FPGAs can enable real-time AI, beating GPUs in ultra-low latency, even without batching inference requests._\n", "\n", "And from the paper _\"Serving DNNs in RealTime at Datacenter Scale with Project Brainwave\"_:\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### The Tensor Processing Unit\n", "\n", "Google's Tensor Processing Unit (TPU) made a splash in 2015 as one of the first specialized architectures for machine learning and AI applications.\n", "\n", "The original version focused on fast inference via high-throughput 8-bit arithmetic.\n", "* Most of the chip is dedicated to accelerating 8-bit integer dense-matrix-dense-matrix multiplies\n", " * Note that even though the numbers it multiples are in 8-bit, it uses 32-bit accumulators to sum up the results.\n", " * This larger accumulator is common in ML architectures that use low precision. \n", "* ...with a little bit of logic on the side to apply activation functions.\n", "The second- and third-generation TPUs are designed to also support training and can calculate in floating point.\n", "\n", "This is now a real product you can buy and use yourself, both on the cloud (from https://cloud.google.com/tpu)\n", "\n", "\n", "\n", "and in physical hardware\n", "\n", "\n", "\n", "This was not the case for ML accelerator hardware even three years ago!\n", "\n", "* Shows how much the market has changed and expanded." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## How do we program a TPU?\n", "\n", "* Google supports running TensorFlow code on TPUs. This makes it (relatively) easy to train and infer deep neural networks using tools you're already familiar with." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Why might we use a TPU instead of a CPU/GPU?\n", "\n", "* Pro for TPU: Google has some evidence that the TPU outperforms GPUs and other accelerators on benchmark tasks. From Google's blog: (https://cloud.google.com/blog/products/ai-machine-learning/mlperf-benchmark-establishes-that-google-cloud-offers-the-most-accessible-scale-for-machine-learning-training) \n", "\n", " * \"For example, it’s possible to achieve a 19\\% speed-up with a TPU v3 Pod on a chip-to-chip basis versus the current best-in-class on-premise system when tested on ResNet-50\"\n", " \n", " * But other hardware manufacturers make claims that their hardware is better...so you'll need to do some research to determine what is likely to be the best for your task and for the price point you care about.\n", "\t\n", "* Pro for TPU: Seems to have better power and somewhat better scalability than other options. E.g. you can scale up to 256 v3 TPUs in a pod.\n", "\n", "* Con for TPU: It can tie you to Google's Cloud Platform.\n", "\n", "* Con for TPU: Still might be a bit harder to program than GPUs/CPUs.\n", "\n", " * For example, if you are writing your code in PyTorch" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Other ML accelerators.\n", "\n", "Intel's **Nirvana Neural Network Processor**(NNP). (https://www.intel.ai/intel-nervana-neural-network-processors-nnp-redefine-ai-silicon/)\n", "\n", "* New AI hardware devices such as the neural compute stick\n", "\n", "\n", "\n", "* New class of hardware called \"Vision Processing Units\" (VPU) for computer vision on edge devices\n", "\n", "* Intel acquired Habana Labs recently and has been building new AI processors\n", "\n", "* **Main take-away here: there's buy-in for the idea of ML accelerator hardware from the biggest players in the computer processor space.**" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "**Apple's Neural Engine** within the A11 Bionic system-on-a-chip (and subsequent chips) for neural networks on iPhones.\n", "\n", "* General buy-in from device manufacturers that support for neural networks is needed in hardware for mobile devices." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "**Many start-ups in the ML hardware space** developing innovative new architectures.\n", "\n", "* There are too many chips to list...and unlike even last year we're now seeing many of these chips ready to buy on the market or use in the cloud." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "Questions?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Course Summary and Open Questions\n", "\n", "Scaling machine learning methods is increasingly important.\n", "\n", "In this course, we addressed the high-level question:\n", "**What principles underlie the methods that allow us to scale machine learning?**\n", "\n", "To answer this question, we used techniques from three broad areas: statistics, optimization, and systems.\n", "\n", "We articulated three broad principles, one in each area.\n", "\n", "* **Statistics Principle: Make it easier to process a large dataset by processing a small random subsample instead.**\n", "\n", "* **Optimization Principle: Write your learning task as an optimization problem, and solve it via fast general algorithms that update the model iteratively.**\n", "\n", "* **Systems Principle: Use algorithms that fit your hardware, and use hardware that fits your algorithms.**\n", "\n", "Now, some open questions in scalable ML that relate to the principles." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Open problem: Reproducibility and debugging of machine learning systems.\n", "\n", "* Most of the algorithms we discussed in class are randomized, and random algorithms are hard to reproduce.\n", " * Even when we don't use explicitly randomized methods, floating point imprecision can still make results difficult to reproduce exactly.\n", " * For hardware efficiency, the compiler loves to reorder floating point operations (this is sometimes called fast math mode) which can introduce slight differences in the output of an ML system.\n", "\t* As a result, even running the same learning algorithm on the same data on different ML frameworks _can_ result in different learned models!\n", "* Reproducibility is also made more challenging when hyperparameter optimization is used.\n", " * Unless you have the random seed, it's impossible to reproduce someone else's random search.\n", " * Hyperparameter optimization provides lots of opportunity for (possibly unintentional) cheating, where the test set is used improperly.\n", "* ML models are difficult to debug because they often **learn around bugs**." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Open problem: More scalable distributed machine learning\n", "\n", "* Distributed machine learning has this fundamental tradeoff with the batch size.\n", " * Larger batch size good for systems because there's more parallelism.\n", " * Smaller batch size good for statistics because we can make more \"progress\" per gradient sample. (For the same reason that SGD is generally better than gradient descent.)\n", "* Communication among workers is expensive in distributed learning.\n", " * Need provably robust ways of compressing this communication to use fewer bits.\n", "* The datacenters of the future will likely have many heterogeneous workers available.\n", " * How can we best distribute a learning workload across heterogeneous workers? \n", "* When running many workers in parallel, the performance will start to be bound by **stragglers**, workers that take longer to work than their counterparts. How can we deal with this while still retaining performance guarantees?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Open problem: Robustness to adversarial examples.\n", "\n", "* It's easy to construct examples that fool a deep neural network.\n", "* How can we make our scalable ML methods provably robust to these type of attacks?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Thank you!\n", "\n", "### That's all for this course. You all have been a pleasure to work with, even though the challenges remote learning has presented." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Julia 1.4.0", "language": "julia", "name": "julia-1.4" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "1.4.0" }, "rise": { "scroll": true } }, "nbformat": 4, "nbformat_minor": 2 }