CS6465: Emerging Cloud Technologies and Systems Challenges

Room TBA, Tuesday/Thursday 1:25-2:40

CS6465 is a PhD-level class in systems that tracks emerging cloud computing technology, opportunities and challenges. It is unusual among CS graduate classes: the course is aimed at a small group of students, uses a discussion oriented style, and the main "topic" is actually an unsolved problem in computer systems. The intent is to think about how one might reduce that open problem to subproblems, learn about prior work on those, and extract exciting research questions. The PhD focus centers on that last agenda element.

Piazza Discussion Page for this class: here.

Focus for Fall 2019: AI Sys

In the past few offerings, we've looked mostly at the way that new hardware is reshaping the cloud "edge", but the area didn't have a name. By now it does: people are using the term AI Sys for the whole of "AI and Systems working together", and so this recent emphasis has been on "Edge AI Sys." Very trendy!

In Fall 2019, our particular focus will be on the dialog between IoT Edge computing infrastructures and cloud-connected devices such as cameras and videos. If you think about it, the IoT device is both dumb and brilliant. It is dumb in the sense that it probably operates with a tight power budget, won't have a fast general purpose computer onboard, lacks AI accelerators like TPU and GPU and reconfigurable FPGA, and doesn't know the big picture (that single sensor might be one of many, perhaps in a cloud system supporting a smart highway or smart city, where situational context involves fusing data from many perspectives, in real-time). Yet the camera is brilliant in other ways: it can magically pivot to point to a specific spot, pre-focus or adjust its hyperspectral imaging configuration for particular lighting or other goals, take precisely the image the cloud needs. That limited processor on board can still do data compression and storage, and might even be able to decide if the image is "interesting", given a model from the cloud of what it needs to know. And it could send back thumbnails, holding the big images or video until the cloud decides to download them.

Conversely, the cloud has amazing superpowers that the sensor lacks: the situational context just mentioned. Precomputed models for all sorts of things: recognizing every imaginable kind of vehicle (or even a specialized version for recognizing a particular vehicle). Current situational context, fresh as of a few milliseconds ago, and drawing on data from many sources. Databases with things like speaker accents for every style of speech used worldwide.

To carry out tomorrow's applications, we'll need the cloud and the IoT edge devices to cooperator -- and this will require a new kind of AI Sys "operating system" for the edge. In 2019 our goal will be to work together as speculative researchers to try and identify the key elements of the needed solution, and then read and discuss papers relevant to those topics. The hope would be that this might be useful background for work in the area, no matter who you plan to work with, and even if you don't plan to do research on the systems side of AI Sys.

This course is open to ANYONE in the CS PhD program, no matter what their research area is, as long as they have enough systems background to read the papers and join in discussions. Everyone else needs permission from the instructor.

CS6465 is only open to CS PhD students and to other students with the instructor's explicit permission. To get that permission you need to convince Ken that

You are following a research track (not necessarily in computer systems; it could be in any area and in fact the emphasis on moving intelligence into the edge of the cloud has many ML students interested in this topic).
You have the needed background (mostly ugrad operating systems and/or databases, networking, stuff like that).

Now, in fact, this PhD thing is interpreted in a very flexible way. Historically we have often had CS undergraduates in the class: students who were already solidly on a track leading to PhD research in graduate school. But in contrast, we have almost never had MEng students in this class: CS5412 in the spring is a much better choice for an MEng student, because MEng is just not a research track.

We welcome PhD students from programs like ECE, CAM, or other programs, as long as they have background similar to the CS PhDs. But you really do need the background if you want to attend a course like this, so don't be unrealistic about the importance of having a solid grounding in computer systems before attending this particular class. You won't hate the experience, but just won't be able to contribute in a meaningful way without that.

Students who needed permission and didn't get it won't be permitted to remain in the course even if the registrar system allows them to enroll. Permission to take this course comes from the instructor, not the automated registrar computing system.

How does the class work?

CS6465 was introduced in 2018. In this third offering, the focus will be on issues raised by moving machine learning to the edge of the cloud. In this phrasing, edge computing still occurs within the data center, but for reasons of rapid response, involves smart functionality close to the client, under time pressure. So you would think of an AI or ML algorithm written in as standard a way as possible (perhaps, Tensor Flow, or Spark/Databricks using Hadoop, etc). But whereas normally that sort of code runs deep in the cloud, many minutes or hours from when data is acquired, the goal now is to keep the code unchanged (or minimally changed) and be able to run on the stream of data as it flows into the system, milliseconds after it was acquired. We might also push aspects of machine learned behavior right out to the sensors.

This idea is a big new thing in cloud settings -- they call it "edge" computing or "intelligent" real-time behavior. But today edge computing often requires totally different programming styles than back-end computing. Our angle in cs6465 is really to try and understand why this is so: could we more or less "migrate" code from the back-end to the edge? What edge functionality would this require? Or is there some inherent reason that the techniques used in the back-end platforms simply can't be used in the edge, even with some sort of smart tool trying to help.

The goal of this focus on an intelligent edge is, of course, to motivate research on the topic. As a systems person, Ken's group is thinking about how to build new infrastructure tools for the intelligent edge. Those tools could be the basis of great research papers and might have real impact. But others work in this area too, and we'll want to read papers they have written.

Gaps can arise at other layers too. For example, Tensor Flow is hugely popular at Google in the AI/ML areas, and Spark/Databricks plus Hadoop (plus Kafka, Hive, HBase, Zookeeper, not to mention plus MatLab, SciPy, Graphlab, Pregle, and a gazillion other tools) are insanely widely used. So if we assume that someone is a wizard at solving AI/ML problems using this standard infrastructure, but now wants parts of their code to work on an intelligent edge, what exactly would be needed to make that possible? Perhaps we would need some new knowledge representation, or at least some new way of storing knowledge, indexing it, and searching for it. This would then point to opportunities for research at the AI/ML level as well as opportunities in databases or systems to support those new models of computing.

CS6465 runs as a mix of discussions and short mini-lectures (mostly by Ken), with some small take-home topics that might require a little bit of out-of-class research, thinking and writing. Tthere won't be a required project, or any exams, and the amount of written material required will be small, perhaps a few pages to hand in per week. Grading will mostly be based on in-class participation.

CS6465 can satisfy the same CS graduate requirements (in the systems area) as any other CS6xxx course we offer. Pick the course closest to your interests, no matter what you may have heard. CS6410 has no special status.

Schedule and Readings/Slides (The overall pattern will match that of the fall 2018 version, but the details are different)

Date

Topic

Readings, other comments on the topic

Thought questions

Thu Aug 29

1. Overview of our topic: bringing machine learning to the edge.

Just a get-to-know you meeting. Ken will probably show some Microsoft slides from a recent MSR faculty summit where they told us about Azure IoT Edge and "Intelligent Edge".

Chris, Bharath and Ken wrote a white paper on the topic we discussed today:

Cloud-Hosted Intelligence for Real-time IoT Applications. Ken Birman, Bharath Hariharan, Christopher De Sa. 2019. SIGOPS Oper. Syst. Rev. 53, 1 (July 2019), 7�13. DOI:https://doi.org/10.1145/3352020.3352023

If interested in topics we touched on and keen to read more, good places to start are these Wikipedia pages:

* The Price of Anarchy

* The Tragedy of the Commons

* Causal ordering model. The classic paper on this topic is here.

* Consistent cuts. The classic paper on this topic is here.

Tue Sept 3

2. Coordination in an edge environment

We touched on the idea of consistency in one specific situation, but not in a very systematic way. n my lecture I'll pose the problem in a more careful way, and then we will look at variations that have developed for different settings: data centers, wireless swarms.

A classic practical question in distributed computing involves the best ways to "package" mechanisms that can guarantee consistency as well as fault-tolerance. For example, our little robot might need to agree with some other robots on which will look into the possible insect problem involving the flower. Here we have a consistent agreement question, but because connectivity is erratic and hence participants might briefly "vanish", it has fault tolerance dimensions too.

The core ideas we'll focus on underly many tools that often look superficially different, and yet could really all reside on a single basic substrate, were the field to reach agreement inside itself about how that substrate should look. So in fact, the things we will discuss are key to the way that mechanisms such as locking, data replication, notification mechanisms and many other "tools" work. Believe it or not, this list really could even include blockchain!

We probably will read some papers on this topic for Thursday but none are required for today.

At this point, it would be good to read these papers, which were mentioned above:

On the topic of vector clocks, the wikipedia link actually is quite a good reference.

Leslie Lamport. 1978. Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21, 7 (July 1978), 558-565. DOI=http://dx.doi.org/10.1145/359545.359563

K. Mani Chandy and Leslie Lamport. 1985. Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst. 3, 1 (February 1985), 63-75. DOI=http://dx.doi.org/10.1145/214451.214456

Thu Sept 5

3. Packaging coordination primitives

We'll wrap up our discussion of how to implement a coordination primitive and shift to how to "package" it for use by programmers.

This leads to a few different abstractions: atomic multicast, append-once logs (also known as Paxos), 2-phase commit, distributed barriers, publish-subscribe. We'll have a look at the options and will think about the pros and cons of using them in our edge IoT setting.

This isn't a one-size fits all story but more of a question of thinking about the patterns of computing we will need to support and how each matches, or fails to match, with the likely use cases.

Just the same, we will discuss:

The Freeze-Frame File System. Weijia Song, Theo Gkountouvas, Qi Chen, Zhen Xiao, Ken Birman. ACM Symposium on Operating Systems Principles (SOCC 2016). Santa Clara, CA, October 05 - 07, 2016.

Tue Sept 10

4.RDMA and the FaRM KVS

This will be a two-part lecture. First we will be learning about a powerful hardware-accelerated communication option called RDMA. The technology can mimic TCP or UDP, but can also support distributed shared memory that is as fast or perhaps even faster than your NUMA laptop (which has more than one core, and more than one memory "unit" -- it is in fact a miniature distributed computer on a single board).

Next, we will talk about the "thought problem" (see the panel to the right). Please come to class with your best answer already prepared, and email a short description of your proposed solution to Ken, with a short explanation of why you think it is the best possible. Short means: "no longer than necessary to make your points." This said, Ken is hoping to see 1 or 2 page answers, not 10 or 20! By preparing a solution before class and emailing it before class, you will be ready for this discussion. Sending it after class would be like waiting to know the answer, and only then writing it up: easier, but less useful to you in terms of learning by doing. So try and get this to Ken before class.

Finally, we will study a use-case for RDMA: a "key-value" storage system called FaRM. The concept of a key-value system is obvious: these are just hash tables (you hash the key to find the value). The novelty is that a KVS system in a distributed setting spreads the data over the nodes in the machine, replicated ("sharded") for fault-tolerance. Usually shards have 2 or 3 members (replicas). FaRM goes one step further and offers a full transactional API: atomic updates to multiple (key,value) pairs and atomic multi-shard reads, implemented on RDMA.

One wonderful classic paper we will eventually be thinking about as we discuss FaRM and Derecho is this:

Hints for computer system design. Butler W. Lampson. In Proceedings of the 9th ACM symposium on Operating systems principles (SOSP '83). ACM, New York, NY, USA, 33-48.

Even if you don't have time to read the "hints" paper, please do read this paper before class:

FaRM: Fast Remote Memory. Aleksandar Dragojević, Dushyanth Narayanan, Orion Hodson, and Miguel Castro. . In Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation (NSDI'14). USENIX Association, Berkeley, CA, USA, 401-414. 2014

Thought question: consider a system doing a series of simple 2-phase vote operations. It has a leader P, and the membership {P, Q, ... } is fixed and known to every member. No failures occur.

Each interaction is as follows: P issues a request ("Please prepare to do X"). Every member gets ready and then responds ("I am ready to perform operation X"). When P has received all responses, it authorizes the action ("Now you should perform X"). Later members report completion ("Finished with X") and then P finalizes ("X is finished"). This allows garbage collection of any temporary data. Assume that X itself is a uniquely identified update operation and that 100 bytes of space are needed to represent it.

P is in a loop and will be doing 1 billion of these updates. Our goal is to do them as fast as possible.

Now, assume we are given RDMA and that it has a fixed low one-way latency of 1us (1/1,000,000 second) and comes in a series of speeds: 5 gigabytes per second (written 5GB or 40G, the latter representing bits/second), 12.5GB, 25GB, and with speeds like 1.6T (200GB) anticipated. Yet the latency won't change: it would be stuck at 1us for all of these data rates. Moreover, assume that each RDMA device has a smallest packet size, proportional to its speed. So for example if you send 100 bytes and the minimum packet size is 100 bytes, you could get perfect efficiency from the RDMA hardware. But if you only sent 50 bytes on that device, it would automatically pad your request to 100, and 50% of the potential bandwidth would be wasted.

You can also assume that the network can perform at most 100M RDMA operations per second, in total, no matter what the individual RDMA operation actually does, who issues it or to whom, and no matter the rated total speed. So a "faster" RDMA device moves more data per second using bigger packets, but actually has fixed latency (delays) from node to node, and in fact would max out at 100M RDMA operations if you somehow tried to ask it to do more. Of course if you don't issue 100M of them per second for whatever reason, this limit wouldn't matter.

What is the best implementation possible for P's series of updates, given this particular hardware setup?

Thu Sept 12

5.Ken's Derecho system.

Ken will be using this as an opportunity to test-drive a slide set for a tutorial he is going to teach on Derecho at the 2019 SOSP in a few weeks. Like FaRM, Derecho is an RDMA technology, although it also works with TCP if RDMA is not available. But while Derecho does support a key-value object store, the actual core of the system focuses on a group communication model. We'll discuss the motivation, the model, the implementation and the performance. Then we'll see some future opportunities related to aspects of Derecho that might be theoretically optimal, but that in fact in a practical sense are probably not remotely as fast as they could be.

Derecho: Fast State Machine Replication for Cloud Services. Sagar Jha, Jonathan Behrens, Theo Gkountouvas, Matthew Milano, Weijia Song, Edward Tremel, Robbert Van Renesse, Sydney Zink, and Kenneth P. Birman. 2019.ACM Trans. Comput. Syst. 36, 2, Article 4 (April 2019), 49 pages. DOI: ttps://doi.org/10.1145/3302258

Tue Sept 17

6. FarmBeats

(guest lecture)

Thu Sept 19

7. Derecho object store

(guest lecture)

Tue Sept 24

8. Zookeeper: A deeper dive. ZAB protocols. Zookeeper API.

A simple totally ordered broadcast protocol. Benjamin Reed and Flavio P. Junqueira. 2008. In Proceedings of the 2nd Workshop on Large-Scale Distributed Systems and Middleware (LADIS '08). ACM, New York, NY, USA, , Article 2 , 6 pages. DOI=http://dx.doi.org/10.1145/1529974.1529978

The life and times of a zookeeper. Flavio P. Junqueira and Benjamin C. Reed. 2009. In Proceedings of the 28th ACM symposium on Principles of distributed computing (PODC '09). ACM, New York, NY, USA, 4-4. DOI: https://doi.org/10.1145/1582716.1582721

ZooKeeper: Distributed Process Coordination. Flavio Junqueira and Benjamin Reed. 2017, O'Reilly. ISBN-13: 978-1449361303. ISBN-10: 1449361307 Apache Zookeeper Site: https://zookeeper.apache.org

Question: Is ZAB described in a clear and convincing way in this LADIS paper, or the PODC paper?

ZAB is a multicast protocol, not a durable storage protocol, but Zookeeper uses periodic checkpoints once every five seconds to provide persisted storage. Does this policy actually solve Paxos? If not, what are some of the ways that an application might notice the difference?

Thu Sept 26

9. The two results we will talk about are: the FLP impossibility result and the weakest failure detector for guaranteeing progress.

Impossibility of distributed consensus with one faulty process. Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. In Proceedings of the 2nd ACM SIGACT-SIGMOD symposium on Principles of database systems(PODS '83). ACM, New York, NY, USA, 1-7.

The weakest failure detector for solving consensus. Tushar Deepak Chandra, Vassos Hadzilacos, and Sam Toueg. 1996. J. ACM 43, 4 (July 1996), 685-722.

Slides are here.

This lecture is about some very difficult, yet important, mathematics. Think of it as a one-day theoretical "side tour" into the mathematics of the distributed computing area.

Some CS6465 students lack the background to read these kinds of papers. Don't panic! Not everyone is prepared to read this sort of very difficult theory. Even so, it can be useful to know about it. If you can't follow the math, just try and understand the basic idea of what they are saying.

A thought question: what does "impossibility" mean in the FLP paper?

Tue Oct 1

10. What abstractions will edge programmers actually want?

Remote regions: A simple abstraction for remote memory. Marcos K. Aguilera, Nadav Amit, Irina Calciu, Xavier Deguillard, Jayneel Gandhi, Stanko Novakovic, Arun Ramanathan, Pratap Subrahmanyam, Lalith Suresh, Kiran Tati, Rajesh Venkatasubramanian, and Michael Wei. To appear, NSDI 2018.

Slides from this talk.

We'll be looking at an ultra-simple remote memory idea, built on RDMA at VMWare.

Thu Oct 3

11. Continued discussion about roles a technology like the ones we've discussed in class up to now.

We will situate our discussion in the context of an edge infrastructure using modern function computing models.

Cloud functions" are a hot new model that seems to be the next big thing for programming cloud applications. You can read about Azure Function Server, Function PaaS models, or Amazon AWS Lambda to see examples.

A cloud function is really just a short-running program triggered by some kind of event (think of a remote method invocation), that does anything you like, and then terminates. These functions don't retain any local data (they are "stateless") but they can definitely write to the file system or to a database, etc. They just don't create local data structures that would be used on the next event -- each event sees a "clean" initial state.

Functions can normally be coded in languages like Python, although Microsoft prefers C# .NET or F# .NET. Functions run as programs inside container environments, and really are no different from other programs in private virtual machines. But the model is intended to be very lightweight, with millisecond startup delays, and very elastic: "pay for cycles you actually use."

Functions run on very basic VMs, and hence don't have direct access to things like local access to GPU accelerators (one can definitely access a GPU accelerator from a container VM if the system is set up to allow that, but a function "server" wouldn't be configured that way). Instead, think of a function as the director of an orchestra: it sends various tasks on their merry way, but does little direct work of its own. So for GPU tasks, a function would typically hand objects like images off to GPU servers that have accelerators attached to the server nodes. This is a source of delay, but common in today's solutions.

There isn't any special reading for this lecture. The goal is simply to get a better sense of what functions can do, and what they can't do, and how this informs the broader issue of programming the cloud IoT edge and the reactive outer tier of the cloud itself.

Tue Oct 8

12. Spark RDDs and file system caching performance.

Spark: Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica. HotCloud 2010.

Improving MapReduce Performance in Heterogeneous Environments, M. Zaharia, A. Konwinski, A.D. Joseph, R. Katz and I. Stoica, OSDI 2008, December 2008.

Spark is the world champion for "big data" computing, but normally runs in a batch style and is viewed as a kind of back end layer of the cloud, doing big computations offline that you'll later draw on through massive files that represent the output (machine-learned models, precomputed indices, etc). The RDD model is a cool and widely popular example of a different kind of function PaaS, even though they don't really pitch it that way. In fact Spark RDDs could be of real interest near the edge, even without MapReduce (RDDs can be used from SciPy, GraphLab, MatLab, Mathmatica...)

Questions to think about: RDDs give Spark a big benefit for Hadoop jobs, but those are used mostly in the back-end of the data center for analytics. Could there also be an edge opportunity? What would reuse of RDDs at the edge require?

Thu Oct 10

13. Temporal queries. The concept of adequate caching

Same papers

This topic is really based on work Theo Gkountouvas has been doing on caching for systems that deal with queries to temporal data: Data with a time index, like a machine learning system fitting data to an autoregressive integrative moving average (ARIMA) model.

(Oct 12 - Oct 15)

Fall break, no class

Let's hope for amazingly colorful leaves! (It somewhat depends on the timing of the first real frost: we need one or two nights of really cold weather to trigger a "flash" from green to bright colors, and that doesn't happen some years.)

Thu Oct 17

14. The amazing power of GPUs and GPU clusters. CUDA. Dandelion: Programming tool for GPU management.

Dandelion: a compiler and runtime for heterogeneous systems. Christopher J. Rossbach, Yuan Yu, Jon Currey, Jean-Philippe Martin, and Dennis Fetterly. 2013. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP '13). ACM, New York, NY, USA, 49-68. DOI: https://doi.org/10.1145/2517349.2522715

slides

This topic is kind of a pivot for us. We'll start to look at hardware accelerators we might want to attach to our edge computing infrastructure. GPUs are normally programmed using a language called CUDA, but there is a perception that CUDA is a barrier to widespread exploitation of the technology. Dandelion is one example of a response (not super successful, but very well explained).

Tue Oct 22

15. Challenges of integrating GPUs with other parts of the O/S stack, RDMA, etc.

GPUfs: the case for operating system services on GPUs. Mark Silberstein, Bryan Ford, and Emmett Witchel. 2014. Commun. ACM 57, 12 (November 2014), 68-79. DOI: https://doi.org/10.1145/2656206

slides

Once you are building the GPU service itself, you need ways to get to data, and to the network. Here are a few additional papers on other aspects of those topics.

SPIN: Seamless Operating System Integration of Peer-to-Peer DMA Between SSDs and GPUs Shai Bergman, Tanya Brokhman, Tzachi Cohen, Mark Silberstein. USENIX ATC, 2013.

GPUnet: Networking Abstractions for GPU Programs.
Sangman Kim, Seonggu Huh, Yige Hu, Xinya Zhang, and Emmett Witchel, Amir Wated and Mark Silberstein. OSDI 2014.

Thu Oct 24

16. Tensor Flow, both as a tool to control GPU or TPU computations, and as a more general programming language for distributed computing.

TensorFlow: A System for Large-Scale Machine Learning
Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. In Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation (OSDI'16). USENIX Association, Berkeley, CA, USA, 265-283.

Tensor flow is portrayed as a distributed systems tool, and it certainly supports useful forms of compositionality. Nonetheless, it turns out to be primarily used just on a single machine at a time to manage applications that talk to local hardware like GPU clusters or TPU accelerators. Beyond looking at the work as they present it in this paper and in slides, in this lecture we will discuss the (mixed) experiences people have had with tensor flow as a true distributed programming tool.

slides

TensorFlow emerged in part because Google has some skepticism about the CUDA + GPUfs + GPUnet concept.

Mostly, TensorFlow is used on some single computer to control a single attached GPU or TPU cluster. But it can also support fault tolerant distributed computing, in its own unique style. Nobody really knows how effective it is in that fancier style of use.

Tue Oct 29

17. Google paper on large-scale distributed learning.

Weijia Song will be teaching on this date because Ken will be up at SOSP 2019

Large Scale Distributed Deep Networks. Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc'aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Andrew Y. Ng. Neural Information Processing Systems (NIPS) 2012.

This paper predates the Tensor Flow papers by a few years, so it isn't necessarily an example of a system that could be implemented using Tensor Flow. But it does illustrate what Google has been doing in this setting.

Thu Oct 31

18. Berkeley's recent work on "Ray", a system for optimizing distributed machine learning tasks.

Ray: A Distributed Framework for Emerging AI Applications. Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica, UC Berkeley. OSDI 2018.

The U.C. Berkeley group wasn't convinced that Tensor Flow and RDDs solve every edge computing need. Ray is their recent proposal for an edge processing language oriented towards AI applications.

Tue Nov 5

19. Routing data through an FPGA: the Catapult model.

A reconfigurable fabric for accelerating large-scale datacenter services. Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray, Michael Haselman, Scott Hauck, Stephen Heil, Amir Hormati, Joo-Young Kim, Sitaram Lanka, James Larus, Eric Peterson, Simon Pope, Aaron Smith, Jason Thong, Phillip Yi Xiao, and Doug Burger. ISCA 2014. Published as SIGARCH Comput. Archit. News 42, 3 (June 2014), 13-24.

Version of the slides.

GPU and TPU clusters have the "advantage" of being basically similar to general purpose computers, except for supporting highly parallel operations in hardware (ones matched to the needs of graphics programming, or tensor transformations). But there are other interesting accelerators, too.

We'll look at FPGA, which is a kind of hardware "filter" and "transformation" unit you can place right on the wire.

Also interesting: Serving DNNs in Real Time at Datacenter Scale with Project Brainwave. Eric S. Chung, et. al. IEEE Micro 2018. DOI:10.1109/mm.2018.022071131

Thu Nov 7

20. Clusters of FPGAs and their relevance to ML/AI. Microsoft's datacenter of FPGAs model.

A cloud-scale acceleration architecture Adrian M. Caulfield; Eric S. Chung; Andrew Putnam; Hari Angepat; Jeremy Fowers; Michael Haselman; Stephen Heil; Matt Humphrey; Puneet Kaur; Joo-Young Kim; Daniel Lo; Todd Massengill; Kalin Ovtcharov; Michael Papamichael; Lisa Woods; Sitaram Lanka; Derek Chiou; Doug Burger 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)

Here is what we get with "grown up" FPGAs, but the topic is fairly complex. The key idea is that if you have enough FPGAs you can create big clusters that function as powerful hardware supercomputers for certain tasks, like audio (speech) and image (vision). People have been figuring out how to map deep neural networks into FPGA clusters.

The work is quite technical and we'll sort of skim it, with the goal of just being able to think about what an edge needs to look like if it will use tricks like this for "amazing performance."

Tue Nov 12

21. Clusters of FPGAs and their relevance to ML/AI. Microsoft's datacenter of FPGAs model.

Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers. SIGARCH Comput. Archit. News 43, 1 (March 2015), 223-238. DOI: https://doi.org/10.1145/2786763.2694347

Journal version of the same paper: here. This has a little more detail and includes some additional experiments, but the shorter conference version is probably fine unless you find something puzzling or incomplete and want to read a little more.

Thu Nov 14

22.In-class brainstorming session!

Instead of discussing a paper, we will discuss your ideas for the homework problem Ken assigned on Piazza (also summarized to the right)

We have seen that applications might want to configure some mixture of data paths (for RDMA, perhaps), GPUs with their own onboard memory (GMEM), FPGAs as bumps in the wire, FPGAs as chips with architectures that might be like GPU or even like a general purpose processor, etc. Each needs its own software installed before it can do whatever we might want to ask from it -- a GPU can't invert a matrix unless you preload the matrix inversion code, hit the proper reset sequence, load the data, and then press "run" (all done in software, of course, from the host machine). There are even ways to load code right into switches and routers and NICs.

But we haven't see a nice programming language to control the setup.

So: imagine that we have a cloud data center hosting multiple applications that share the hardware. Each application has access to this special-purpose "programming language" for telling the system what setup it wants in terms of the "wiring diagram" of network and devices, the code that needs to be loaded, the parameters to supply. We might also need elasticity: "for each instance of my X role, I require 1GPU and 2 FPGA chips configured as follows..."

Either show how this could be done with the standard JSON-style configuration files we have all run into many times, or invent a new specialized programmng language we can use for this task.

Email your proposal to Ken, and plan to discuss it in class on Thursday. There is no correct single answer. We are just looking for awesome ideas!

Tue Nov 19

23. Software as an out-of-band control plane for data flows. Barrelfish and Arrakis. iX.

The multikernel: a new OS architecture for scalable multicore systems. Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, Rebecca Isaacs, Simon Peter, Timothy Roscoe, Adrian Schupbach, and Akhilesh Singhania. 2009. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles (SOSP '09). ACM, New York, NY, USA, 29-44.

Arrakis: The Operating System Is the Control Plane. Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Doug Woos, Arvind Krishnamurthy, Thomas Anderson, and Timothy Roscoe. 2015. ACM Trans. Comput. Syst. 33, 4, Article 11 (November 2015), 30 pages.

Slides.

These are two famous operating systems papers that argue for new OS designs aimed at better management of modern hardware. We've been discussing this question in terms of how code on a general purpose CPU might control subsystems on FPGA or GPU on the same node. BarrelFish (the first paper) is about the resulting O/S structure. The second paper looks at how to seamlessly integrate RDMA into the mix.

Thu Nov 21

25. Machine Learning for Systems and Systems for Machine Learning.

The Case for Learned Index Structures. Tim Kraska, Alex Beutel, Ed Chi, Jeff Dean, Neoklis Polyzotis. Google Brain.

More Google Slides on this whole ML everywhere concept.

Last week we talked about the puzzle of optimally mapping tasks to hardware and how we would need to estimate things like job size and object sizes and so forth. Once you have a system like Barrelfish, those would seem like the next steps in using it. The slide set is from Jeff's keynote talk at NIPS 2017, while the paper (referenced in the slides) is the deep dive on one concrete topic he mentions. Seems like a good fit for us!

Tue Nov 26

26.Papers from SOSP 2019

PipeDream: Generalized Pipeline Parallelism for DNN Training Deepak Narayanan (Stanford University), Aaron Harlap (Carnegie Mellon University), Amar Phanishayee (Microsoft Research), Vivek Seshadri (Microsoft Research), Nikhil R. Devanur (Microsoft Research), Gregory R. Ganger (CMU), Phillip B. Gibbons (Carnegie Mellon University), Matei Zaharia (Stanford University) . SOSP 2019.

Slides

This is a paper on matching the graph representations used for training convolutional neural networks to the GPU hardware in order to maximize the utilization level of the GPU platforms. It led me to think about the way that we map applications to GPU and the sense in which this can be a bit like compiler optimizations for a programming language.

(Nov 27-Dec 1)

Thanksgiving holiday

Tue Dec 3

27.Papers from SOSP 2019

Niijima: Sound and Automated Computation Consolidation for Efficient Multilingual Data-Parallel Pipelines. Guoqing Harry Xu (UCLA), Margus Veanes (Microsoft Research), Michael Barnett (Microsoft Research), Madan Musuvathi (Microsoft Research), Todd Mytkowicz (Microsoft Research), Ben Zorn (Microsoft Research), Huan He (Microsoft), Haibo Lin (Microsoft). SOSP 2019.

Slides

This is a paper on optimizing applications that mix code from more than one language and in fact, more than one kind of computing system.

Thu Dec 5

28.Papers from SOSP 2019

Nexus: A GPU Cluster Engine for Accelerating DNN-Based Video Analysis. Haichen Shen (Amazon Web Services), Lequn Chen (University of Washington), Yuchen Jin (University of Washington), Liangyu Zhao (University of Washington), Bingyu Kong (Shanghai Jiao Tong University), Matthai Philipose (Microsoft Research), Arvind Krishnamurthy (University of Washington), Ravi Sundaram (Northeastern University). SOSP 2019.

Slides

This is a paper that looks at accelerated video analysis using GPU clusters.

Tue Dec 10

29.Papers from SOSP 2019

Lineage Stash: Fault Tolerance Off the Critical Path. Stephanie Wang (UC Berkeley), John Liagouris (ETH Zurich), Robert Nishihara (UC Berkeley), Philipp Moritz (UC Berkeley), Ujval Misra (UC Berkeley), Alexey Tumanov (UC Berkeley), Ion Stoica (UC Berkeley). SOSP 2019.

Slides

This paper looks at quick ways to recover missing chunks of results from complicated computations where rolling back to the start and redoing everything would be way too expensive.