CS6465: Emerging Cloud Technologies and Systems Challenges

Hollister Hall Room 320, Tuesday/Thursday 1:25-2:40

CS6465 is a PhD-level class in systems that tracks emerging cloud computing technology, opportunities and challenges. It is unusual among CS graduate classes: the course is aimed at a small group of students, uses a discussion oriented style, and the main "topic" is actually an unsolved problem in computer systems. The intent is to think about how one might reduce that open problem to subproblems, learn about prior work on those, and extract exciting research questions.  The PhD focus centers on that last agenda element.

In this second offering, we plan to focus on issues raised by moving machine learning to the edge of the cloud. In this phrasing, edge computing still occurs within the data center, but for reasons of rapid response, involves smart functionality close to the client, under time pressure.  So you would think of an AI or ML algorithm written in as standard a way as possible (perhaps, Tensor Flow, or Spark/Databricks using Hadoop, etc).  But whereas normally that sort of code runs deep in the cloud, many minutes or hours from when data is acquired, the goal now is to keep the code unchanged (or minimally changed) and be able to run on the stream of data as it flows into the system, milliseconds after it was acquired.  We might also push aspects of machine learned behavior right out to the sensors.

This idea is a big new thing in cloud settings -- they call it "edge" computing or "intelligent" real-time behavior.  But today edge computing often requires totally different programming styles than back-end computing.  Our angle in cs6465 is really to try and understand why this is so: could we more or less "migrate" code from the back-end to the edge?  What edge functionality would this require?  Or is there some inherent reason that the techniques used in the back-end platforms simply can't be used in the edge, even with some sort of smart tool trying to help.

The goal of this focus on an intelligent edge is, of course, to motivate research on the topic.  As a systems person, Ken's group is thinking about how to build new infrastructure tools for the intelligent edge.  Those tools could be the basis of great research papers and might have real impact.  But others work in this area too, and we'll want to read papers they have written. 

Gaps can arise at other layers too.  For example, Tensor Flow is hugely popular at Google in the AI/ML areas, and Spark/Databricks plus Hadoop (plus Kafka, Hive, HBase, Zookeeper, not to mention plus MatLab, SciPy, Graphlab, Pregle, and a gazillion other tools) are insanely widely used.  So if we assume that someone is a wizard at solving AI/ML problems using this standard infrastructure, but now wants parts of their code to work on an intelligent edge, what exactly would be needed to make that possible?  Perhaps we would need some new knowledge representation, or at least some new way of storing knowledge, indexing it, and searching for it.  This would then point to opportunities for research at the AI/ML level as well as opportunities in databases or systems to support those new models of computing.

 CS6465 runs as a mix of discussions and short mini-lectures (mostly by the professor), with some small take-home topics that might require a little bit of out-of-class research, thinking and writing. Tthere won't be a required project, or any exams, and the amount of written material required will be small, perhaps a few pages to hand in per week. Grading will mostly be based on in-class participation.

CS6465 can satisfy the same CS graduate requirements (in the systems area) as any other CS6xxx course we offer.  Pick the course closest to your interests, no matter what you may have heard.  CS6410 has no special status.

Schedule and Readings/Slides

The following schedule is just a conceptual overview.  It still has more blank slots than actual plan and the plan itself will evolve.

We will be reading a lot of papers from the main conferences, but part of the puzzle here is to figure out which papers are relevant to our topic.  So this may look like a series of lectures by Ken on standard cloud stuff, but in practice we will often be discussing one or more published papers germane to our main topic.  The class will work as a team to identify those papers -- a good experience for later in your studies when you will be doing literature searches on your own.  So at least some classes will devote a fair amount of time to discussing what papers we need to read, and everyone will need to find an example or two and make the case for looking at it. 

Which conferences are the ones most relevant to "intelligent edge" computing?  This is a curious question too.  Offhand, "intelligent computing" leads to NIPS and KDD.  But "edge computing" would focus us more on SOCC, NSDI, SOSP, OSDI.  Then there are networking conferences like SIGCOMM, and "broad agenda" conferences like DSN, ICDCS, Eurosys, ATC, LADIS (some of these are ACM conferences, some are from IEEE, and a few are from USENIX).  The data conferences could have relevant papers too: VLDB, SIGMOD, and the same with the real-time conferences, like RTSS.  So there are a lot of "candidate" conferences.  We'll probably focus mostly on papers that appeared in the past five years.  There may be some interesting papers in journals too: TOCS, TOPLAS...   Still, five years times perhaps 15 conferences and perhaps a further 5 journals would give us maybe 100 "venues" to scan, with maybe an average of 20 papers each per year, hence 2000 or so candidate papers.  We only really plan to read one or two per lecture.

The biggest issues for the edge, as opposed to a normal backend cloud, is that:

  1. Ease of programming.  There is a HUGE machine learning and data analytics community by now, all centered on the big cloud platforms like Apache.  Obviously these include Hadoop (a verison of MapReduce) but you also find systems that are built around MatLab, GraphLab, SciPy, etc.  Then there are big existing applications, like five or ten Neural Network classifiers, each of which has a whole mini-ecosystem of its own, but hosted on those standard platforms.  And then there are HPC systems integrated into that same world.  So there are just an incredible number of powerful, popular, existing technologies. 

    Our edge programmer will be someone familiar with that world and able to deploy those tools to build intelligent AI and ML tools that can train from the proper data sets and then be used "offline".  Now he or she wants to migrate some functionality to the edge.  How can we reduce the barriers to performing this task?  So that's the single, overarching, really-big-deal question.
  2. It probably needs to support disconnected operations.  And security, but we won't really focus on IoT security because we don't have enough scheduled meetings to do every imaginable thing.
  3. Edge applications will be forced to operate under a unique mix of bandwidth and latency stresses.  In some ways the edge will be a real-time ecosystem.  How does this impact everything?
  4. It won't have unlimited resources, whereas the cloud really has no limits at all (none that we would care about).
  5. But it needs to integrate smoothly with the cloud, and to leverage it, at least when connected to it.

Then we have a different kind of issue to think about:

  1. As researchers, our job is to publish (lest we otherwise perish).
  2. What constitutes a publishable research topic in this space, in the main conferences people care about, or the important journals?  Not every practical question is also a publishable one!

Even this list betrays a bias: as a "platform" person, Ken's bias is a little bit towards systems.  The machine learning applications that run on those systems are important -- the client generates the workload.  But even so, we want to think of our client applications in pretty general, black-box terms.  The other puzzle is that while we know a lot about the successful back-end ecosystems (we'll focus on Apache in CS6465 but in fact Amazon, Azure, Google and others all have elaborate specialized ones), this concept of a smart edge is nascent and hence there is little detail because it has yet to be invented.  Figuring out what the technology "roles" will be is a good place to start, and then we can ask what candidates exist for populating those roles.

Date Topic Readings, other comments on the topic
Thu Aug 23 1. Overview of our topic: bringing machine learning to the edge.  Just a get-to-know you meeting.  Ken will probably show some Microsoft slides from a recent MSR faculty summit where they told us about Azure IoT Edge and "Intelligent Edge".
Tue Aug 28 2. Today's edge ecosystem: microservice frameworks.  Background: We'll learn in a very quick overview about the evolution of the cloud from its early days as a large infrastructure to host Web servers and services to a microservice framework.
Thu Aug 30 3. Edge deep-dive: Amazon's AWS edge.

 Continuing our backgrounder on the edge, we'll look at the kinds of things AWS supports.

Tue Sept 4 4. Edge deep-dive: Azure IoT edge.

 [Ken remote?]  Ken will be on the road but perhaps will run the class remotely.  Tentatively, we'll look at the Azure IoT Edge in a bit more detail to understand what they seem to have in mind.

Thu Sept 6 5. Edge challenges: High level perspective

 [Ken remote?]  Ken still out of town.  Tentatively, we'll pull the threads together to try and think about how a real system (we might focus on an intelligent platform for agriculture called Farmbeats, from Microsoft) would generate edge workloads.

Tue Sept 11 6. Today's back-end ecosystem: Apache Hadoop, YARN, HDFS, Hive, HBASE, Zookeeper.

This is a bit like our AWS deep dive, but this time looking at the most popular of the MapReduce infrastructures, namely the Apache Hadoop ecosystem.

Thu Sept 13 7. Deeper study of Zookeeper/ZAB, roles it plays. Zookeeper has such a central role that it deserves close study by itself.  The edge will need something analogous.
Tue Sept 18 8. Spark and Databricks: RDD concept and caching. A major Berkeley cloud laboratory project of the early 2000's yielded Spark, a platform for accelerating the Apache Hadoop ecosystem.  This is fascinating work and we'll look closely at it.  First, an overview.
Thu Sept 20 9. RDDs: Deeper dive. We saw RDDs briefly on Tuesday, but today will really look hard at "RDD programming".
Tue Sept 25 10. Tensor Flow. RDDs and Hadoop were perceived as frustrating and limiting by people at Google, who created Tensor Flow as their response.  It quickly became very popular.  We'll look at the overall model.
Thu Sept 27 11. Edge programming challenges. In some sense this is a repeat of lecture 5.  Having seen more of the backend we loop back to revisit the edge, asking what a comprehensive Edge ecosystem might really require.
Tue Oct 2 12.

 

Thu Oct 4 13. 

 

(Oct 6 - Oct 9) Fall break, no class Autumn in Vermont
Tue Oct 9 14.

 

Thu Oct 11 15.  
Tue Oct 16 16.

 

Thu Oct 18 17.

 

Tue Oct 23 18.

 

Thu Oct 25 19.

 

Tue Oct 30 20.


Thu Nov 1 21.

 

Tue Nov 6 22.  
Thu Nov 8 23.

 

Tue Nov 13 24.


Thu Nov 15 25.

 

Tue Nov 20 26.

 

(Nov 21-25) Thanksgiving break, no class

Image result for turkey icon

Tue Nov 27 27.  
Thu Nov 29

28.

 

Tue Dec 4 29.  

</