CS6465: Emerging Cloud Technologies and Systems Challenges

Hollister Hall Room 320, Tuesday/Thursday 1:25-2:40

CS6465 is a PhD-level class in systems that tracks emerging cloud computing technology, opportunities and challenges. It is unusual among CS graduate classes: the course is aimed at a small group of students, uses a discussion oriented style, and the main "topic" is actually an unsolved problem in computer systems. The intent is to think about how one might reduce that open problem to subproblems, learn about prior work on those, and extract exciting research questions.  The PhD focus centers on that last agenda element.

In this second offering, we plan to focus on issues raised by moving machine learning to the edge of the cloud. In this phrasing, edge computing still occurs within the data center, but for reasons of rapid response, involves smart functionality close to the client, under time pressure.  So you would think of an AI or ML algorithm written in as standard a way as possible (perhaps, Tensor Flow, or Spark/Databricks using Hadoop, etc).  But whereas normally that sort of code runs deep in the cloud, many minutes or hours from when data is acquired, the goal now is to keep the code unchanged (or minimally changed) and be able to run on the stream of data as it flows into the system, milliseconds after it was acquired.  We might also push aspects of machine learned behavior right out to the sensors.

This idea is a big new thing in cloud settings -- they call it "edge" computing or "intelligent" real-time behavior.  But today edge computing often requires totally different programming styles than back-end computing.  Our angle in cs6465 is really to try and understand why this is so: could we more or less "migrate" code from the back-end to the edge?  What edge functionality would this require?  Or is there some inherent reason that the techniques used in the back-end platforms simply can't be used in the edge, even with some sort of smart tool trying to help.

The goal of this focus on an intelligent edge is, of course, to motivate research on the topic.  As a systems person, Ken's group is thinking about how to build new infrastructure tools for the intelligent edge.  Those tools could be the basis of great research papers and might have real impact.  But others work in this area too, and we'll want to read papers they have written. 

Gaps can arise at other layers too.  For example, Tensor Flow is hugely popular at Google in the AI/ML areas, and Spark/Databricks plus Hadoop (plus Kafka, Hive, HBase, Zookeeper, not to mention plus MatLab, SciPy, Graphlab, Pregle, and a gazillion other tools) are insanely widely used.  So if we assume that someone is a wizard at solving AI/ML problems using this standard infrastructure, but now wants parts of their code to work on an intelligent edge, what exactly would be needed to make that possible?  Perhaps we would need some new knowledge representation, or at least some new way of storing knowledge, indexing it, and searching for it.  This would then point to opportunities for research at the AI/ML level as well as opportunities in databases or systems to support those new models of computing.

 CS6465 runs as a mix of discussions and short mini-lectures (mostly by the professor), with some small take-home topics that might require a little bit of out-of-class research, thinking and writing. Tthere won't be a required project, or any exams, and the amount of written material required will be small, perhaps a few pages to hand in per week. Grading will mostly be based on in-class participation.

CS6465 can satisfy the same CS graduate requirements (in the systems area) as any other CS6xxx course we offer.  Pick the course closest to your interests, no matter what you may have heard.  CS6410 has no special status.

Schedule and Readings/Slides

Date Topic Readings, other comments on the topic Thought questions
Thu Aug 23 1. Overview of our topic: bringing machine learning to the edge.  Just a get-to-know you meeting.  Ken will probably show some Microsoft slides from a recent MSR faculty summit where they told us about Azure IoT Edge and "Intelligent Edge".  
Tue Aug 28 2. Consistency requirements for distributed machine learning at the edge.  Microsoft's FaRM system.  The first part of this meeting will focus on a discussion of what consistency should mean for real-time distributed machine learning systems running close to the edge of the cloud. 

The second part will dive in and look at the FaRM paper, in part keeping in mind our idea of what forms of consistency are needed.  The link is here:

FaRM: Fast Remote Memory. Aleksandar Dragojević, Dushyanth Narayanan, Orion Hodson, and Miguel Castro. . In Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation (NSDI'14). USENIX Association, Berkeley, CA, USA, 401-414. 2014
 
Thu Aug 30 3. Is FaRM the ideal solution to the RDMA DHT problem?  HeRD and FASST.

We shouldn't take FaRM for granted.  So we'll look at the competitor!  But keep those questions about consistency in mind...  If you only read one, read the first of  these.

Using RDMA Efficiently for Key-Value Services. Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2014. In Proceedings of the 2014 ACM conference on SIGCOMM (SIGCOMM '14). ACM, New York, NY, USA, 295-306. DOI: https://doi.org/10.1145/2619239.2626299

FaSST: Fast, Scalable and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2016. In Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation (OSDI'16). USENIX Association, Berkeley, CA, USA, 185-201.

 
Tue Sept 4 4. A different view on consistency: Leslie Lamport's causal ordering model and the Chandy/Lamport concept of a consistent cut.  Kulkarni timestamps.  Slides.   [Taught by Theo Gkountouvas because Ken will be out of town] Again, if you don't have time to read several of these, read the first one, or the first and the second one.  The two are both classics!

Time, clocks, and the ordering of events in a distributed system. Leslie Lamport. Commun. ACM 21, 7 (July 1978), 558-565. DOI=http://dx.doi.org/10.1145/359545.359563.

Distributed snapshots: determining global states of distributed systems. K. Mani Chandy and Leslie Lamport. ACM Trans. Comput. Syst. 3, 1 (February 1985), 63-75. DOI=http://dx.doi.org/10.1145/214451.214456.

Logical Physical Clocks. KULKARNI, S. S., DEMIRBAS, M., MADAPPA, D., AVVA, B., AND LEONE, M. In Principles of Distributed Systems. Springer, 2014, pp. 17–32. 
 
Thu Sept 6 5.  Freeze Frame File System is built around the idea of offering Lamport's concept as the basis of a consistency model for stored files.  We'll see how this works.  Slides from Theo are here. [Taught by Theo Gkountouvas because Ken will be out of town]

The Freeze-Frame File System. Weijia Song, Theo Gkountouvas, Qi Chen, Zhen Xiao, Ken Birman. ACM Symposium on Operating Systems Principles (SOCC 2016). Santa Clara, CA, October 05 - 07, 2016.

 
Tue Sept 11 6. State Machine Replication and the Paxos model.  Introduction and overview.  Roles of Paxos in the Apache Hadoop "ecosystem".  Zookeeper model of how to make Paxos look like a file system.

Slides from Theo on Paxos protocols. 

Replication management using the state-machine approach. Fred B. Schneider. In Distributed systems (2nd Ed.), Sape Mullender (Ed.). ACM Press/Addison-Wesley Publishing Co., New York, NY, USA 169-197.

The Part-time Parliament. Lamport, L. ACM Trans. Comput. Syst. 16,2 (May1998), 133–169.

Paxos made Moderately Complex.  Robbert van Renesse and Deniz Altinbuken. ACM Comput. Surv. 47, 3, Article 42 (February 2015), 36 pages.

Not simple.. just think of Paxos as "2 1/2 phase commit used to deliver a message to every process, in order, with durability."  But keep in mind, this doesn't include extra phases that may be needed by the Proposer (leader) to resolve concurrency conflicts or to clean up after failures.
Thu Sept 13 7.   Zookeeper:  A deeper dive.  ZAB protocols.  Zookeeper API. A simple totally ordered broadcast protocol. Benjamin Reed and Flavio P. Junqueira. 2008. In Proceedings of the 2nd Workshop on Large-Scale Distributed Systems and Middleware (LADIS '08). ACM, New York, NY, USA, , Article 2 , 6 pages. DOI=http://dx.doi.org/10.1145/1529974.1529978

The life and times of a zookeeper. Flavio P. Junqueira and Benjamin C. Reed. 2009. In Proceedings of the 28th ACM symposium on Principles of distributed computing (PODC '09). ACM, New York, NY, USA, 4-4. DOI: https://doi.org/10.1145/1582716.1582721
ZooKeeper: Distributed Process Coordination. Flavio Junqueira and Benjamin Reed. 2017, O'Reilly. ISBN-13: 978-1449361303. ISBN-10: 1449361307 Apache Zookeeper Site: https://zookeeper.apache.org
Question: Is ZAB described in a clear and convincing way in this LADIS paper, or the PODC paper?

ZAB is a multicast protocol, not a durable storage protocol, but Zookeeper uses periodic checkpoints once every five seconds to provide persisted storage.  Does this policy actually solve Paxos?  If not, what are some of the ways that an application might notice the difference?
Tue Sept 18 8.   The two results we will talk about are: the FLP impossibility result and the weakest failure detector for guaranteeing progress. 

Some slides from Theo are here.
Impossibility of distributed consensus with one faulty process. Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. In Proceedings of the 2nd ACM SIGACT-SIGMOD symposium on Principles of database systems (PODS '83). ACM, New York, NY, USA, 1-7.

The weakest failure detector for solving consensus. Tushar Deepak Chandra, Vassos Hadzilacos, and Sam Toueg. 1996. J. ACM 43, 4 (July 1996), 685-722.
This lecture is about some very difficult, yet important, mathematics.  Think of it as a one-day theoretical "side tour" into the mathematics of the distributed computing area.  

Some CS6465 students lack the background to read these kinds of papers.  Don't panic!  Not everyone is prepared to read this sort of very difficult theory.  Even so, it can be useful to know about it.  If you can't follow the math, just try and understand the basic idea of what they are saying.

A thought question: what does "impossibility" mean in the FLP paper?
Thu Sept 20 9.  Corfu: An append-only log system that combines Paxos with Chain Replication.  CORFU: A distributed shared log. Mahesh Balakrishnan, Dahlia Malkhi, John D. Davis, Vijayan Prabhakaran, Michael Wei, and Ted Wobber. ACM Trans. Comput. Syst. 31, 4, Article 10 (December 2013), 24 pages. DOI=http://dx.doi.org/10.1145/2535930.

Chain Replication for Supporting High Throughput and Availability. Robbert van Renesse, Fred. B. Schneider. Sixth Symposium on Operating Systems Design and Implementation (OSDI 04). December 2004, San Francisco, CA.
The last lectures really focused on the idea of consensus, and on the idea that in some sense it can be exposed in several ways (the Paxos log, or as an atomic multicast like ZAB).

Corfu is back to the Paxos log, but has a very different way to implement it, using Paxos just for a kind of counter (the end of log pointer), and then using a simple copying method (chain replication) for fault-tolerance.  Is this "legal" or does it break the Paxos properties?
Tue Sept 25 10. vCorfu: A way of scaling Corfu up by using lots of logs ("sharding") and virtualizing the log individual applications deal with ("filtering"). vCorfu: A Cloud-Scale Object Store on a Shared Log.  Michael Wei, Amy Tai, Christopher J. Rossbach, Ittai Abraham, Maithem Munshed, Medhavi Dhawan, Udi Wieder, Scott Fritchie, Steven Swanson, Michael J. Freedman, Dahlia Malkhi.  NSDI 2017.
Corfu became popular and it forced the developers to scale far beyond what they originally had in mind.  They ended up with a concept for running Corfu with a lot of logs, not just one.  Additionally, they have a kind of "materialized view" of the global log for efficiency, something they call the "object stream".  The basic idea is to let Tango (their transactional layer) have a rapid and complete cached log and then keep the full log elsewhere, to avoid inefficient access patterns and "holes", which are a problem for them.

When you take this to the limit, is Corfu still a log?

Thought question: why is vCorfu not making more aggressive use of checkpoints?
Thu Sept 27 11. World's fastest Paxos solution: Derecho C++ library. Derecho: Fast State Machine Replication for Cloud Services.  Sagar Jha, Jonathan Behrens, Theo Gkountouvas, Matthew Milano, Weija Song, Edward Tremel, Sydney Zink, Kenneth P. Birman, Robbert van Renesse. Submitted for publication, September 2017, revised and resubmitted July 2018.

RDMC: A Reliable Multicast for Large Objects. Jonathan Behrens, Sagar Jha, Ken Birman, Edward Tremel.  To appear, IEEE DSN ’18, Luxembourg, June 2018.
Derecho looks at the mapping of Paxos to fast hardware: the modern RDMA technology.  What benefits does this bring?

We will also talk about virtual synchrony and the epoch model it uses.
Tue Oct 2 12. So, why is Derecho this fast?  Asynchronous flow programming and "refactoring" Paxos to match the hardware. (same papers) I want to devote one whole meeting to just understanding precisely why Derecho turns out to be so fast, because there is a "portable insight" here that applies to other systems.

So our topic will look at Derecho's speed, but with the goal of asking what applications need to do to leverage that speed.  This leads to an open research topic ("Zero copy software libraries and operating systems").
Thu Oct 4 13.  What abstractions will edge programmers actually want? Remote regions: A simple abstraction for remote memory. Marcos K. Aguilera, Nadav Amit, Irina Calciu, Xavier Deguillard, Jayneel Gandhi, Stanko Novakovic, Arun Ramanathan, Pratap Subrahmanyam, Lalith Suresh, Kiran Tati, Rajesh Venkatasubramanian, and Michael Wei.  To appear, NSDI 2018.

Slides from this talk.
I don't want to have three full lectures on Derecho, but my topic is still tied to Derecho: If our plan is to use Derecho near the edge, is the C++ library API a sensible way to expose it?  Or should it look like a file system (think back to Theo's lecture on FFFS), or like Zookeeper, or perhaps something else?

To bring a new perspective in, we'll also talk about work at VMWare that focuses on an ultra-simple remote memory idea, also built on RDMA.

The
(Oct 6 - Oct 9) Fall break, no class Autumn in Vermont Let's hope for amazingly colorful leaves!  (It somewhat depends on the timing of the first real frost: we need one or two nights of really cold weather to trigger a "flash" from green to bright colors, and that doesn't happen some years.)
Thu Oct 11 14. Continued discussion about roles a technology like the ones we've discussed in class up to now.

We will situate our discussion in the context of an edge infrastructure using modern function computing models.
Cloud functions" are a hot new model that seems to be the next big thing for programming cloud applications.  You can read about Azure Function Server, Function PaaS models, or Amazon AWS Lambda to see examples.

A cloud function is really just a short-running program triggered by some kind of event (think of a remote method invocation), that does anything you like, and then terminates.  These functions don't retain any local data (they are "stateless") but they can definitely write to the file system or to a database, etc.  They just don't create local data structures that would be used on the next event -- each event sees a "clean" initial state.

Functions can normally be coded in languages like Python, although Microsoft prefers C# .NET or F# .NET.   Functions run as programs inside container environments, and really are no different from other programs in private virtual machines.  But the model is intended to be very lightweight, with millisecond startup delays, and very elastic: "pay for cycles you actually use."

Functions run on very basic VMs, and hence don't have direct access to things like local access to GPU accelerators (one can definitely access a GPU accelerator from a container VM if the system is set up to allow that, but a function "server" wouldn't be configured that way).  Instead, think of a function as the director of an orchestra: it sends various tasks on their merry way, but does little direct work of its own.  So for GPU tasks, a function would typically hand objects like images off to GPU servers that have accelerators attached to the server nodes.  This is a source of delay, but common in today's solutions.

There isn't any special reading for this lecture: it basically continues on a topic we didn't have time to finish (we didn't even really start) on Thursday back one week ago, so we'll continue on the same subject today.

To avoid repetition, I'll run through Microsoft FarmBeats (a digital agriculture application) in the context of Azure IoT Edge, and then that will give us a bunch of example use cases to think about.  If you like, you can Google Microsoft FarmBeats to see some video demos and online materials.

Tue Oct 16 15.  Spark RDDs and file system caching performance.

Spark: Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica. HotCloud 2010.

Improving MapReduce Performance in Heterogeneous Environments, M. Zaharia, A. Konwinski, A.D. Joseph, R. Katz and I. Stoica, OSDI 2008, December 2008. 

Spark is the world champion for "big data" computing, but normally runs in a batch style and is viewed as a kind of back end layer of the cloud, doing big computations offline that you'll later draw on through massive files that represent the output (machine-learned models, precomputed indices, etc).  The RDD model is a cool and widely popular example of a different kind of function PaaS, even though they don't really pitch it that way.  In fact Spark RDDs could be of real interest near the edge, even without MapReduce (RDDs can be used from SciPy, GraphLab, MatLab, Mathmatica...)

Questions to think about: RDDs give Spark a big benefit for Hadoop jobs, but those are used mostly in the back-end of the data center for analytics.  Could there also be an edge opportunity?  What would reuse of RDDs at the edge require?
Thu Oct 18 16.  More on RDD programming

Same papers

There is more to this whole RDD concept than we can cover in one lecture, so we'll continue on the topic and look at some of the complexities of getting good RDD behavior.  It comes down to understanding (more or less) the way that Spark itself really works.
Tue Oct 23 17. The amazing power of GPUs and GPU clusters.  CUDA.  Dandelion: Programming tool for GPU management. Dandelion: a compiler and runtime for heterogeneous systems. Christopher J. Rossbach, Yuan Yu, Jon Currey, Jean-Philippe Martin, and Dennis Fetterly. 2013. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP '13). ACM, New York, NY, USA, 49-68. DOI: https://doi.org/10.1145/2517349.2522715 This topic is kind of a pivot for us.  We'll start to look at hardware accelerators we might want to attach to our edge computing infrastructure.  GPUs are normally programmed using a language called CUDA, but there is a perception that CUDA is a barrier to widespread exploitation of the technology.  Dandelion is one example of a response (not super successful, but very well explained).
Thu Oct 25 18. Challenges of integrating GPUs with other parts of the O/S stack, RDMA, etc.

GPUfs: the case for operating system services on GPUs. Mark Silberstein, Bryan Ford, and Emmett Witchel. 2014. Commun. ACM 57, 12 (November 2014), 68-79. DOI: https://doi.org/10.1145/2656206

SPIN: Seamless Operating System Integration of Peer-to-Peer DMA Between SSDs and GPUs Shai Bergman, Tanya Brokhman, Tzachi Cohen, Mark Silberstein. USENIX ATC, 2013.

GPUnet: Networking Abstractions for GPU Programs.
Sangman Kim, Seonggu Huh, Yige Hu, Xinya Zhang, and Emmett Witchel, Amir Wated and Mark Silberstein. OSDI 2014.

Once you are building the GPU service itself, you need ways to get to data, and to the network.  Here are a few projects that tackled those topics.
Tue Oct 30 19. Tensor Flow.  Using Tensor Flow to control GPU or TPU computations.

TensorFlow: A System for Large-Scale Machine Learning
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. In Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation (OSDI'16). USENIX Association, Berkeley, CA, USA, 265-283.

TensorFlow emerged in part because Google has some skepticism about the CUDA + GPUfs + GPUnet concept. 
Thu Nov 1 20. Using Tensor Flow to build distributed systems.

Tensor flow is portrayed as a distributed systems tool, but in fact overwhelmingly used just on a single machine at a time to manage applications that talk to local hardware like GPU clusters or TPU accelerators.  We'll look at experiences people have had with tensor flow both as an "function" language for the edge and also as a true distributed programming tool.

Mostly, TensorFlow is used on some single computer to control a single attached GPU or TPU cluster.  But it can also support fault tolerant distributed computing, in its own unique style.  Nobody really knows how effective it is in that fancier style of use.
Tue Nov 6 21. Berkeley's recent work on "Ray"

Ray: A Distributed Framework for Emerging AI Applications. Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica, UC Berkeley.  OSDI 2018.

The U.C. Berkeley group wasn't convinced that Tensor Flow and RDDs solve every edge computing need.  Ray is their recent proposal for an edge processing language oriented towards AI applications.
Thu Nov 8 22. Routing data through an FPGA: the Catapult model.

A reconfigurable fabric for accelerating large-scale datacenter services.  Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray, Michael Haselman, Scott Hauck, Stephen Heil, Amir Hormati, Joo-Young Kim, Sitaram Lanka, James Larus, Eric Peterson, Simon Pope, Aaron Smith, Jason Thong, Phillip Yi Xiao, and Doug Burger. ISCA 2014. Published as SIGARCH Comput. Archit. News 42, 3 (June 2014), 13-24.

GPU and TPU clusters have the "advantage" of being basically similar to general purpose computers, except for supporting highly parallel operations in hardware (ones matched to the needs of graphics programming, or tensor transformations).  But there are other interesting accelerators, too.

We'll look at FPGA, which is a kind of hardware "filter" and "transformation" unit you can place right on the wire.
Tue Nov 13 23. Clusters of FPGAs and their relevance to ML/AI.  Microsoft's datacenter of FPGAs model.

A cloud-scale acceleration architecture Adrian M. Caulfield; Eric S. Chung; Andrew Putnam; Hari Angepat; Jeremy Fowers; Michael Haselman; Stephen Heil; Matt Humphrey; Puneet Kaur; Joo-Young Kim; Daniel Lo; Todd Massengill; Kalin Ovtcharov; Michael Papamichael; Lisa Woods; Sitaram Lanka; Derek Chiou; Doug Burger 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)

Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers. SIGARCH Comput. Archit. News 43, 1 (March 2015), 223-238. DOI: https://doi.org/10.1145/2786763.2694347

Journal version of the same paper: here. This has a little more detail and includes some additional experiments, but the shorter conference version is probably fine unless you find something puzzling or incomplete and want to read a little more.

Here is what we get with "grown up" FPGAs, but the topic is fairly complex.  The key idea is that if you have enough FPGAs you can create big clusters that function as powerful hardware supercomputers for certain tasks, like audio (speech) and image (vision).  People have been figuring out how to map deep neural networks into FPGA clusters.

The work is quite technical and we'll sort of skim it, with the goal of just being able to think about what an edge needs to look like if it will use tricks like this for "amazing performance."
Thu Nov 15 24. Software as an out-of-band control plane for data flows.  Barrelfish and Arrakis.  iX.

The multikernel: a new OS architecture for scalable multicore systems.  Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, Rebecca Isaacs, Simon Peter, Timothy Roscoe, Adrian Schüpbach, and Akhilesh Singhania. 2009.  In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles (SOSP '09). ACM, New York, NY, USA, 29-44. 

Arrakis: The Operating System Is the Control Plane. Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Doug Woos, Arvind Krishnamurthy, Thomas Anderson, and Timothy Roscoe. 2015. ACM Trans. Comput. Syst. 33, 4, Article 11 (November 2015), 30 pages.

These are two famous operating systems papers that argue for new OS designs aimed at better management of modern hardware.
Tue Nov 20 25. Software controlled data centers.

The Rise of the Programmable Data Center. Michael Vizard. Dice.com, 2012.  An old article but it sets the stage.

The Software Defined Data Center -- In Depth.  VMWare technology white paper, 2016.

SDDC - software-defined data center. Webopedia article.

All three of these are short industy white papers (well, the third is a Wikipedia article but in that same style).  Today's topic isn't really technology, per se, but rather "mindset".  The 2012 paper is a tiny bit dated because by now, 6 years later, VMWare actually has a major product in this space, discussed in the second white paper.

Fundamentally: Is the future going to be dominated by a PaaS model based on microservices, or will it be dominated by a virtual private cloud model, in which you can chose to use microservices, but could equally well go with bare metal?
(Nov 21-25) Thanksgiving break, no class

Image result for turkey icon

 
Tue Nov 27 26. OS ideas for disaggregated data centers

LegoOS: A Disseminated, Distributed OS for Hardware Resource Disaggregation.  Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying Zhang.  OSDI 2018 (best paper award).   

 
Thu Nov 29 27. OS ideas for disaggregated data centers Sharing, Protection, and Compatibility for Reconfigurable Fabric with AmorphOS.  Ahmed Khawaja, Joshua Landgraf, and Rohith Prakash, Michael Wei and Eric Schkufza, Christopher J. Rossbach.  OSDI 2018  
Tue Dec 4 28. Identifying open research questions In this last lecture of the semester, I want to try and wrap up what we've explored through the course.  One dimension is to look at what turns out to make a research idea impactful.  Some comments on that question are in the column to the right.  But this is something we've discussed at length and we will just remind ourselves of the key elements.

In the main part of the lecture I want to discuss a few directions we didn't even get a chance to touch upon during the semester, particularly the implications of privacy for IoT deployments.  To ground this in a technical topic, I thought I might introduce  an idea I'll call "no trace left behind": Cloud computing systems that "help" an IoT system but rather than dynamically learning, kind of do the opposite and provably forget what they might have been transiently exposed to.  You could then use such a system in situations where private information needs to be classified.  As a concrete example of this form of computing, I'll share some aspects of the design of the Boeing 777 Safebus, a clever system from many years ago that actually includes several innovations in this respect.

Once we all understand the concept of "no trace" computing, I want to use the rest of the lecture to think about how this idea can be turned into contextualized research concepts in various situations.  It isn't the only direction one can go with research on the smart edge, but would be illustrative of a single idea and how you can turn it into various research agendas.  Computing with time limits (real-time) would be a different single idea that could go in various directions, and of course the main theme we've touched on repeatedly has been this issue of what the computing model needs to be at the edge, and how machine-learned data would be hosted (especially if it changes rapidly, or is used as configuration inputs for sensors).
The core dimensions we've touched on are these:  Given some research idea, you want to ask "What is the context for the idea?  Who would read this paper or use this technique, and will the paper actually reach that kind of person?  What perspectives does the work need to emphasize to be successful with that community?"

Then you can ask if the "expression" of an idea is right for the likely "users" of the idea.  Is the API natural?  Does it make sense to use that API in the context where the developer would want the functionality?  (With Derecho, we were critical of the idea of using a C++ API in a setting where developers often customize by adding C#, Java or Python code -- code that runs in containers in contexts where a DLL using a C++ API wouldn't be natural at all.)

There are other dimensions too: does it really work better than prior solutions, are the scenarios fair, did the new idea bring overheads or limitations that the authors perhaps swept under the rug, etc.  Those are valid concerns too.  But to appreciate impact, the first thing to be asking centers on those other questions: the big-picture ones.

</