CS6465: Emerging Cloud Technologies and Systems Challenges

CS6465 is a PhD-level class that will look closely at how technology shifts are creating new challenges and research opportunities in the cloud computing and distributed systems arena, and especially at how this is bringing the machine learning community together with the systems community. The intent is to to undertake an informed, fairly deep exploration of how hardware is reshaping the research prospects for applications that are dependent on hardware characteristics. Although the focus of the class will be on systems, the implications of these shifts reach into databases, Internet of Things, machine learning, NLP, robotics, and many other domains. The course is unusual in that we will have extensive class discussions that will look at emerging cloud application areas (such as "smart highways") and think about the technology shifts in terms of how such applications might use the new hardware and software models. Thus unlike a class that mostly reads papers, this class will mostly discuss papers from a fairly "applied" perspective, with these new use cases as the source of the applications.

Hardware we will consider includes new storage technologies, RDMA networking, FPGA computing, and GPUs. Also of interest are programmable network switches and routers, and network interface cards with customization capabilities. For completeness we will touch on Intel's SGX architecture.

Schedule and Readings/Slides

Date Topic Readings, other comments on the topic
Tue Aug 22 1. Overview of our topic: bringing machine learning to the edge. Our first meeting will center on breaking into groups and brainstorming about how one might build a smart highway system that captures data and uses machine learning to understand what the video cameras are seeing. U.C. Berkeley has a new center focused on this topic. Start by thinking about small numbers of cameras, still images, simple use case.
Thu Aug 24 2. Challenges of "online data filtering" for machine learning systems that capture large amounts of data. Brief lecture on the methods used for video analysis, followed by more discussion. We will expand on the still-image scenario from lecture 1. Now we have far more cameras, they capture video, not just still photos, and we are trying to do far more with them. What pressures does this put on the system infrastructure?
Tue Aug 29 3. Discussion of asynchronous machine learning techniques. Goal is just to understand the computational models favored in such systems. What patterns of communication do these induce?

Although we will not discuss them, Chris De Sa and Joey Gonzalez pointed to these and a few other papers that capture the general flavor of the algorithms they are working with:

Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. Feng Niu, Benjamin Recht, Christopher Ré and Stephen J. Wright. 2011.

Distributed Parallel Inference on Large Factor Graphs. Joseph E. Gonzalez Yucheng Low, Carlos Guestrin, David O'Hallaron. 2009.

Thu Aug 31 4. Distributed hash tables for scalability. Building DHT-hosted data pipelines that have strong consistency guarantees yet minimize pausing (locking, or other kinds of protocol-introduced wait states)

Chord: A scalable peer-to-peer lookup service for internet applications. Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Balakrishnan. 2001. In Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications (SIGCOMM '01). ACM, New York, NY, USA, 149-160. DOI=http://dx.doi.org/10.1145/383059.383071

In-class discussion on mapping our smart highway with video feeds into a sharded DHT representation and building a solution that is smart about its use of memory and communication. Could we use "sharding" to spread the work broadly in a smart memory?

Tue Sept 5 5. The Derecho System. Smart Memory concept over RDMA. Derecho overview, RDMC layer. Use of sharding / DHT model in Derecho.

Building Smart Memories and Scalable Cloud Services with Derecho. Sagar Jha, Jonathan Behrens, Theo Gkountouvas, Matthew Milano, Weija Song, Edward Tremel, Sydney Zink, Kenneth P. Birman, Robbert van Renesse. Submitted for publication, September 2017.

Thu Sept 7 6. Core of Derecho performance: Asynchronous protocols on the Derecho's SST layer. Agreement on ordering, terminating a disrupted epoch and switching to a new epoch.

Main focus here is on understanding the idea of monotonicity as used in the Derecho SST predicates.

In-class discussion topic: can a similar notion of asynchronous decision-making be created for our smart highway application? How does temporal urgency for decision-making play against the idea of receiver-oriented batching?

Tue Sept 12 7. State machine replication is one of several important models for consistency. The other really big one is the transactional model from databases. We will review the idea, with a focus on how transactions can be used in distributed machine learning systems, notably to maintain and update Bayesian graphs.

Most people in CS6465 will have learned about transactions in prior classes, but this lecture is just to get everyone onto the same page. To keep the discussion focused, we will look primarily at transactions on graphs, notably the forms of graphs seen in social-networking settings and in Bayesian machine learning systems.

In class discussion: can transactions of these kinds of graphs be implemented as non-blocking asynchronous flows? Or must they use blocking?

Thu Sept 14 8. The FaRM DHT (Microsoft)

FaRM: Fast Remote Memory. Aleksandar Dragojević, Dushyanth Narayanan, Orion Hodson, and Miguel Castro. . In Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation (NSDI'14). USENIX Association, Berkeley, CA, USA, 401-414. 2014

Tue Sept 19 9. CMU's Herd DHT

Using RDMA Efficiently for Key-Value Services. Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2014. In Proceedings of the 2014 ACM conference on SIGCOMM (SIGCOMM '14). ACM, New York, NY, USA, 295-306. DOI: https://doi.org/10.1145/2619239.2626299

Discussion: In Herd, the use of RDMA primitives differs, with a focus on in-lining and tradeoffs between one-sided and two-sided operations. But is this comparison fair, or have one of the two papers (FaRM versus Herd) "stacked the deck" by focusing on a case with non-standard structure?

Thu Sept 21 10. FaSST: A case for one-sided unreliable datagrams on RDMA

FaSST: Fast, Scalable and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2016. In Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation (OSDI'16). USENIX Association, Berkeley, CA, USA, 185-201.

Building on our discussion of Herd, this lecture will focus on the value of unreliable datagrams in RDMA settings. Again, the core question centers on the fairness of the problem setup and evaluation. Would this same set of claims be relevant to Derecho? Should Derecho be changed to run SST protocols and SSTMC over two-sided UD RDMA?

Tue Sept 26 11. Microsoft's architecture for leveraging FPGA in Azure

A cloud-scale acceleration architecture Adrian M. Caulfield; Eric S. Chung; Andrew Putnam; Hari Angepat; Jeremy Fowers; Michael Haselman; Stephen Heil; Matt Humphrey; Puneet Kaur; Joo-Young Kim; Daniel Lo; Todd Massengill; Kalin Ovtcharov; Michael Papamichael; Lisa Woods; Sitaram Lanka; Derek Chiou; Doug Burger 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)

Software/hardware support for FPGA programming in Azure. This is probably out of the best possible order and should move downward in the class topics in future offerings of cs6465.

Thu Sept 28 12. How Facebook represents social network graphs

TAO: Facebook's Distributed Data Store for the Social Graph. Nathan Bronson, Zach Amsden, George Cabrera, Prasad Chakka, Peter Dimov Hui Ding, Jack Ferris, Anthony Giardullo, Sachin Kulkarni, Harry Li, Mark Marchukov Dmitri Petrov, Lovro Puzar, Yee Jiun Song, Venkat Venkataramani. 2013 USENIX Annual Technical Conference (USENIX ATC '13).

Here we are back to looking at a social networking graph and its representation. Thinking back to the models used in FaRM and Herd, to what extent do their transactional evaluations actually match the intended use case, which would probably correspond to TAO? Would a smart highway generate the same style of graphs and transactions?

Tue Oct 3 13. Facebooks caching architecture, and its backend Haystack image storage layer.

An Analysis of Facebook Photo Caching. Qi Huang, Ken Birman, Robbert van Renesse, Wyatt Lloyd, Sanjeev Kumar, and Harry C. Li. 2013. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP '13). ACM, New York, NY, USA, 167-181. DOI: https://doi.org/10.1145/2517349.2522722

Finding a needle in Haystack: Facebook's photo storage. Doug Beaver, Sanjeev Kumar, Harry C. Li, Jason Sobel, and Peter Vajgel. 2010. In Proceedings of the 9th USENIX conference on Operating systems design and implementation (OSDI'10). USENIX Association, Berkeley, CA, USA, 47-60.

Continues on the theme from Thursday Sept 28: How does the Facebook infrastructure teach us things that would transfer to settings such as smart highway systems? Where are there notable differences (hint: think about real-time response requirements, data update patterns, the associated machine-learning model updates and where they need to be computed).

Thu Oct 5 14. Optimizing cache implementations to match properties of flash (SSD) storage hardware

RIPQ: Advanced Photo Caching on Flash for Facebook. Linpeng Tang, Qi Huang, Wyatt Lloyd, Sanjeev Kumar, and Kai Li. 2015. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST'15). USENIX Association, Berkeley, CA, USA, 373-386.

Here we look at how the advanced caching concepts simulated in the prior Facebook papers turned out to stumble when actually implemented on modern SSD, because of performance issues relating to the way that SSDs implement small random writes. Then the authors identify the best possible SSD write patterns and show how the same algorithms can be approximated in a hardware-friendly way.

Tue Oct 10 Fall break, no class Autumn in Vermont
Thu Oct 12 15. Leveraging network routers to obtain fast ordering for conflicting operations.

Just say NO to Paxos Overhead: Replacing Consensus with Network Ordering. Jialin Li, Ellis Michael, Naveen Kr. Sharma, Adriana Szekeres, and Dan R. K. Ports. 2016. In Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation (OSDI'16). USENIX Association, Berkeley, CA, USA, 467-483.

This paper argues that totally ordered updates can be accelerated by taking advantage of features associated with advanced data center networks and routers. Question to discuss: if we think back to our discussion of consistency for transactions on social networking graphs or Bayesian knowledge graphs, that property centers on coordination and ordering. But does the pattern of updates we would see match the requirements of this NO Paxos concept, or did the NO Paxos designers actually assume something that would be poorly matched to that case? Would NO Paxos be likely to offer a useful speedup for transactions in FaRM or Herd / FaSST?

Tue Oct 17 16. Getting RDMA technology to run over Ethernet (RoCE)

Congestion Control for Large-Scale RDMA Deployments. Yibo Zhu, Haggai Eran, Daniel Firestone, Chuanxiong Guo, Marina Lipshteyn, Yehonatan Liron, Jitendra Padhye, Shachar Raindel, Mohamad Haj Yahia, and Ming Zhang. 2015. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication (SIGCOMM '15). ACM, New York, NY, USA, 523-536. DOI: https://doi.org/10.1145/2785956.2787484

TIMELY: RTT-based Congestion Control for the Datacenter. Radhika Mittal, Vinh The Lam, Nandita Dukkipati, Emily Blem, Hassan Wassel, Monia Ghobadi, Amin Vahdat, Yaogong Wang, David Wetherall, and David Zats. 2015. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication (SIGCOMM '15). ACM, New York, NY, USA, 537-550. DOI: https://doi.org/10.1145/2785956.2787510

These two papers are the most famous ones for showing that RDMA can potentially run just as well on Ethernet (RoCE) as on Infiniband (which is like Ethernet, but has a credit-based permission to send mechanism that ensures that no message is ever transmitted unless the receiver has a buffer waiting to receive it into). The intriguing pattern here, shared with the RIPQ paper, centers on how the operating system software has to do part of the job: we have advanced hardware, but in order to "use it effectively" we need a special kind of partnership from the software side. Is this the main thing to expect in the future?

Thu Oct 19 17. Microsoft's experience with a large-scale RDMA on Ethernet (RoCE) deployment that uses DCQCN.

RDMA over Commodity Ethernet at Scale. Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitu Padhye, and Marina Lipshteyn. 2016. RDMA over Commodity Ethernet at Scale. In Proceedings of the 2016 ACM SIGCOMM Conference (SIGCOMM '16). ACM, New York, NY, USA, 202-215. DOI: https://doi.org/10.1145/2934872.2934908

This continues on the topic from Tuesday by looking at actual experience with DCQCN at large scale. The solution really works, but there were a lot of surprises that the DCQCN paper didn't anticipate. I am not aware of any large-scale experience with TIMELY yet, but it would be interesting to know whether that method also leaves some puzzles that will only show up when people really bet on it!

Tue Oct 24 18. Characteristics of NUMA computing systems. SOSP paper on how Linux kernel locking had to evolve in order to operate efficiently on NUMA platforms.

An Analysis of Linux Scalability to Many Cores. S. Boyd-Wickizer, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. 9th USENIX Symposium on Operating Systems Design and Implementation, OSDI10, 2010, Oct. 4-6 2010, Vancouver, BC, Canada.

There has been a lot of recent work on NUMA machines and the issues that arise when multicore (or even just multithreaded) applications use the locking hardware to manage concurrency. This paper has a nice review of the work and focuses on a number of Linux-hosted applications to ask how well they co-exist with NUMA locking. The key finding is that a fairly small number of changes are needed, but that "NUMA-blind" locking would indeed be quite slow.

Thu Oct 26 19. ASPLOS paper on efficiency of Corfu over NUMA architectures.

Black-box concurrent data structures for NUMA architectures.
Irina Calciu, Siddhartha Sen, Mahesh Balakrishnan, Marcos Aguilera.
To appear in ASPLOS 2017: 22nd ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Xi'an, China, April 2017.

This paper asks how to create a NUMA-friendly shared log (not exactly Corfu, but similar), and shows that there are some basic building blocks that make the task much easier. Not only does this yield a high performance logging tool, the primitives themselves are candidates to view as "operating systems help" for exploiting NUMA architectures. A question that arises in our "smart highways" use case is this: with very large data objects like videos, NUMA architectures, and RDMA, do additional complications enter the picture? Or would these kinds of building blocks be just what we want?

Tue Oct 31 20. [Matt] Software Transactional Memory, Herlihy's hardware accelerator concept

Transactional memory: architectural support for lock-free data structures. Maurice Herlihy and J. Eliot B. Moss. 1993. In Proceedings of the 20th annual international symposium on computer architecture (ISCA '93). ACM, New York, NY, USA, 289-300. DOI=http://dx.doi.org/10.1145/165123.165164

A classic paper for our overall topic. Matt will present it and lead a discussion on the connections to modern machine learning, but the core issue is this: could we somehow get correct behavior from programs that run with concurrency and yet ignore locking? The basic idea centers on having the compiler automate the locking needed to provide atomicity but with help from a special kind of hardware accelerator.

Thu Nov 2 21. [Matt] Software Transactional Memory, using the L2 cache as a simple way to sense transactional conflicts  
Tue Nov 7 22. Graphics processing units (GPU). Challenge of sharing GPUs in cloud computing data centers

Dandelion: a compiler and runtime for heterogeneous systems. Christopher J. Rossbach, Yuan Yu, Jon Currey, Jean-Philippe Martin, and Dennis Fetterly. 2013. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP '13). ACM, New York, NY, USA, 49-68. DOI: https://doi.org/10.1145/2517349.2522715

This paper proposes an operating system to help with GPU program and memory management, and measures many of the key performance-limiting paths.

Thu Nov 9 23. Difficulty of getting RDMA to work with GPU memory


GPUfs: the case for operating system services on GPUs.
Mark Silberstein, Bryan Ford, and Emmett Witchel. 2014. Commun. ACM 57, 12 (November 2014), 68-79. DOI: https://doi.org/10.1145/2656206

SPIN: Seamless Operating System Integration of Peer-to-Peer DMA Between SSDs and GPUs Shai Bergman, Tanya Brokhman, Tzachi Cohen, Mark Silberstein. USENIX ATC, 2013.

GPUnet: Networking Abstractions for GPU Programs.
Sangman Kim, Seonggu Huh, Yige Hu, Xinya Zhang, and Emmett Witchel, Amir Wated and Mark Silberstein. OSDI 2014.

So here we have three more papers looking at issues that arise when treating a GPU system as a first-class participant in a data center. The first explores challenges of creating a file system for GPU. The two others focus on networking GPUs to other kinds of datacenter resources. What interests me is that it turns out to be so challenging: from the Dandelion paper you might have thought these issues were all solved. But no, not at all... In fact GPU is a "poor fit" into a classic computing environment and isn't easy to integrate with Linux-based host applications.

Tue Nov 14 24. [Drew] Machine learning for tasks like speech understanding using hardware accelerators.

Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers. SIGARCH Comput. Archit. News 43, 1 (March 2015), 223-238. DOI: https://doi.org/10.1145/2786763.2694347

Drew will lead the discussion for this class. We'll be looking at how clusters of FPGA or GPU devices could yield breakthroughs for spoken language understanding.

Thu Nov 16 25. Spark/DataBricks

Spark: Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica. HotCloud 2010.

Improving MapReduce Performance in Heterogeneous Environments, M. Zaharia, A. Konwinski, A.D. Joseph, R. Katz and I. Stoica, OSDI 2008, December 2008.

Spark has been an incredible success for the machine-learning community. There are many Spark papers, but this one was sort of the classic. It looks at how Spark gains big speedups for Hadoop, which is the main "back end" system in many machine learning systems, by a mixture of smart use of caching plus smart scheduling. But I should probably emphasize "back end" because a big issue that arises concerns the relevance of Spark to our kinds of front-end real-time machine learning applications. Could a Smart Highway somehow leverage Spark as a real-time tool?

Tue Nov 21 26. Tensorflow.

TensorFlow: A System for Large-Scale Machine Learning
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. In Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation (OSDI'16). USENIX Association, Berkeley, CA, USA, 265-283.

The big topic these days is TensorFlow: a Python dialect for building applications that process images and other kinds of machine-learned content. We'll limit ourselves to a big-picture overview of the language itself. There are many distributed computing and fault-tolerance challenges for the TF community, but the topic gets really obscure if you go too far down that path.

Thu Nov 23 Thanksgiving break, no class Image result for turkey icon
Tue Nov 28 27.Hardware security technologies: Intel SGX. How Microsoft modified MapReduce to create an SGX-secured version.

 

Oblivious Multi-Party Machine Learning on Trusted Processors Olya Ohrimenko, Felix Schuster, Cédric Fournet, Aastha Mehta, Sebastian Nowozin, Kapil Vaswani, Manuel Costa, in The 25th USENIX Security Symposium 2016, USENIX, July 13, 2016,

I want to cover one paper on SGX, but also wanted it to tie into modern machine learning systems. Those, of course, often run Hadoop. So I settled on this nice result by Microsoft's Cambridge research group. They adapt Hadoop to run in SGX-secured enclaves.

Thu Nov 30

28. Disagregated Data Centers: What are the challenges?

Open discussion in the class: If the future data center isn't centered around conventional computers with standard O/S device separation but has shared devices, RDMA everywhere and with many "smart" devices, and the future application mix centers on machine learning under increasingly "online" use patterns, what kinds of open research questions do we need to tackle?

</