CS5412 Spring 2012 Syllabus

Videonotes Link (for Cornell students only http://www.cs.cornell.edu/videonote/cornell

 

Date Topic pdf pptx Comments Readings

Tue, Jan 24

Cloud virtualization

pdf pptx (*25) Guest lecture: Hakim Weatherspoon. Normally would have been lecture 25. Hakim will talk about the role virtualization plays in modern cloud settings and also some of his recent research, which looks at how virtual machine migration can be used to let an application access sensitive files that you don’t trust the cloud to host

XenBlanket paper

Thu, Jan 26

Introduction

pdf

pptx

(1) Overview of the entire course, topics we’ll cover, the way grading will work, etc. Book: Preface & Intro

Tue, Jan 31

How it works

pdf

pptx

(2) How cloud client systems work. We’ll look at some typical cloud applications (music or video players, Shazzam) and ask how they do it. Book: Chapter 2, 3
Thu, Feb 2 Network routing

pdf

pptx

(3) BGP, high availability routers, future secure networks. Routing has a key role to play in the cloud. But how does the Internet do routing? What options exist for controlling routing from within a cloud application? We’ll look mostly at BGP but will also discuss the resilient overlay network idea (RON), from MIT. Then we’ll ask how one might make the Internet robust in other dimensions such as security, and how it could support cloud-based routing control.

Book: Chapter 4.

Routers for the Cloud. Can the Internet Achieve 5-Nines Availability? Agapi et. al, IEEE Internet Computing. Volume 15. Issue 5. pp.72 - 77. September, October 2011.

The League of SuperNets, Ken Birman, IEEE Internet Computing, vol. 7, no. 5, 2003, pp. 92-96.

Tue Feb 7 P2P file sharing and search

pdf

pptx

(4) Peer-to-peer protocols have transformed the Internet by making it easy for people who don’t know one another to still cooperate to share music or other kinds of files. Underlying this are two aspects: the Internet file transfer protocol (which is trivial) and the ability to search for things in a network where you only know about a few of the members. This latter functionality is often called a “key,value” store and is frequently implemented using a “distributed hash table”. We’ll study CAN and Chord, two widely known examples of DHTs. Book, Chapter 4
Thu Feb 9 Adaptive Overlays

pdf

pptx

(5) In Lecture 5 we ended by pointing to some issues that can arise when using Chord as a DHT. Lecture 6 explores the idea of using dynamic adaptation to compensate for skewed access distributions, skewed node-to-node latencies, etc. We’ll look at Beehive (a modified version of Chord that achieves O(c) lookup times for constant c), Pastry (a Plaxton-tree DHT structure that creates flexibility as to which nodes a given node should peer with) and Kelips (a gossip-based DHT with O(1) lookup times). Book, Chapter 4
Tue, Feb 14 Torrents and Tit-for-Tat Incentives.  BAR Gossip

pdf

pptx

(6) In this lecture, we’ll look closely at BitTorrent, to see exactly how its “tit for tat” concept of fairness works. This will lead us to look at an idea from UT Austin for a scheme they call “BAR” gossip

Book: Chapter 5

BAR Gossip

Thu Feb 16

Anatomy of a cloud. CAP theorem

pdf

pptx

(7) Modern data centers are structured into tiers. We’ll spend most of this lecture understanding the breakdown, their connection to the Web Services architecture, and the very different properties and constraints that apply in each tier. At every level of the cloud we struggle with deep tradeoffs: rapid response and scalability tend to fight against techniques that provide strong guarantees. Berkeley’s Eric Brewer captured one of these in his CAP theorem, which says you can have just two of Consistency, Availability, Fault or Partition Tolerance. We’ll see how CAP applies and how it was proved by Gilbert and Lynch. Although the proof applies only in WAN settings, we’ll also find that CAP has implications in the first tier of a typical cloud computing data center. Book, Sec. 5.3
Tue, Feb 21

BASE methodology versus ACID model

pdf

pptx

(8) Many cloud systems are rejecting the database ACID guarantees in favor of what has come to be known as the BASE methodology, which weakens consistency to enhance responsiveness, but can make mistakes because of the weaker properties that must later (“eventually”) be repaired.

We will also have a guest on February 21: Kate Jenkins will be visiting from Akamai and will give a short (half-hour) talk about how Akamai manages state in their cloud.
Book (BASE is discussed in several chapters, but this lecture draws mostly on the material in the Preface, Introduction and Chapter 5) BASE: An Acid Alternative - ACM Queue. Dan Pritchett, eBar. ACM Queue.

Overcoming the “D” in CAP: Using Isis2 To Build Locally Responsive Cloud Services. K. P. Birman, Q. Huang, D. Freedman. IEEE Internet Computing. Volume 12. pp. 50-58. February 2012.
Thu Feb 23

Continued review of cloud services

pdf

pptx

(9) We use the term “tier 2” to talk about services that are nearly as elastic and scalable as the ones in tier 1, but have roles focused on caching. Best-known examples include Memcached and Dynamo. We’ll also look briefly at GFS, BigTable, Chubby, Zookeeper, Sinfonia, MapReduce and other services and will see where each lives. We’ll only get detailed for a few of these (it would take too long to cover all of them), but we will try and identify common themes. Sec 5.7
Tue Feb 28

Gossip and Self-Stabilization. Code Partitioning Gossip.

pdf pptx (*19) Guest lecture: Lonnie Princehouse. Normally would have been lecture 19. Lonnie will remind us how gossip works (we first saw it in connection with services like BitTorrent and BAR gossip), and show how it extends into an area called “self-stabilization”. Lonnie has been designing a system that compiles Java code into scalable services of these kinds. MiCA paper (link works for Cornell students only)
Thu Mar 1 Consistency and time.  Logical and Vector clocks.  Consistent cuts. pdf pptx (10) To reason about stronger forms of consistency, we’ll need some abstractions. One tool that turns be especially useful was created by Leslie Lamport and offers a way to model time in terms of causal relationships. We’ll see how this leads to notions of logical clocks that sometimes make more sense that real clocks. Book, Sec 10.1 and 10.2
Thu Mar 6 2 and 3-phase commit pdf pptx (11)_We’ll look at the famous two and three phase commit protocols, which play a wide range of roles in many kinds of systems. These are used as building blocks, but are also for their own sake. Oddly, it turns out that neither protocol is able to guarantee that progress will occur, although 3PC does better than 2PC. This motivates looking more closely at just when fault-tolerance can be achieved. Section 10.3
Thu Mar 8 Consensus and the FLP theorem pdf pptx (12) We’ll look at a theoretical result most people refer to as the “FLP theorem”. FLP asserts that fault-tolerant consensus is “impossible” in asynchronous systems. What do these terms mean? How is FLP proved? What does FLP mean in practice? Section 11.3
Thu Mar 13 Paxos pdf pptx (13) This lecture will present the “basic” Paxos protocol. Paxos is a powerful protocol for managing replicated data and widely used in cloud computing systems. The Chubby lock service mentioned in lecture 7 runs on top of Paxos. Chapter 11
Thu Mar 15 Ordered reliable multicast (Isis2) pdf pptx (14) We’ll review the virtual synchrony model, a less stringent way to implement replication that brings complexity not seen in Paxos, but also gains performance advantages in doing so. Isis2 employs the virtual synchrony model, but also supports Paxos (called SafeSend within the system). Chapters 11-14
Tue Mar 27 How much durability is needed? pdf pptx (15) We’ll look at the question of weak versus strong durability as it arises when applying Isis2 to replicate data in tier-1 services. Here, we’ll look at the question of how durable a multicast needs to be if the corresponding data is managed in-memory and there isn’t any way to preserve state across node failures and restarts (i.e. there is only soft-state). Book, Section 14.3.

You may also want to remind yourself of this paper, which we discussed on February 21: Overcoming the “D” in CAP: Using Isis2 To Build Locally Responsive Cloud Services. K. P. Birman, Q. Huang, D. Freedman. IEEE Internet Computing. Volume 12. pp. 50-58. February 2012
Thu Mar 29 A short tour of Isis2 pdf pptx (16) We’ll take a short guided tour of the Isis2 system. As a starting point we'll see that in addition to various durability properties, Isis2 offers many ordering options. Not every system needs the strongest and most costly ordering guarantees, and by picking the right protocol to match a use case, one can gain big speedups: you ideally want to pay for what you need. In the time remaining, we'll then review many of the other features of the system.  All of this adds up to. our theme: "too many options"!  But we'll also see that you can use the system without using more than a tiny subset of the features. Isis2 is a substantial system, like an operating system for distributed computing, and the fancy features are really for advanced users. Chapter 15
Tue, Apr 3 Transactional model, ACID pdf pptx (17) We’ll look more closely at the ACID model mentioned in Lecture 10 and how it works. ACID is the standard way to think about database consistency. Chapter 20
Thu Apr 5 Implementing transactional services pdf pptx

(18) How can ACID-based systems be implemented? We’ll look at some basic mechanisms and will also touch upon snapshot isolation, an approach popular in cloud settings.

 
Tue Apr 10 Bimodal Multicast.
Astrolabe. Notion of “information diameter” of a system
pdf pptx (20) This starts a series of “single topic” lectures (the numbering skips because we covered lecture 19 early). Bimodal Multicast is a reliable multicast built using gossip. Astrolabe is a technology used at Amazon.com for scalable monitoring, data mining and aggregation in cloud settings. An amusing bug that was introduced by a well-intentioned optimization leads us to think about the “information distance” between nodes in the system and sheds light on how gossip mechanisms work when scaled up. Chapter 21
Thu Apr 12 Building complex overlay networks using Gossip pdf pptx (21) We'll look at the work on the T-Man system at the University of Bologna.  T-Man slide set. (and as a pdf)
Tue Apr 17 Challenges of extreme scale:(1) Elastic Replication.  (2) Gossip Objects.    

Guest lectures by Hussam Abu-Libdeh and Qi Huang.  Topics: Hussam will talk about Elastic Replication, a new technique for building scalable failure-tolerant strongly consistent systems. Elastic Replication makes simple replication protocols robust to inaccurate failure detection. Additionally, it leverages horizontal partitioning in data centers to support scalable, reconfigurable, systems with no centralized configuration manager. Qi will talk about GO (Gossip Objects) a per-node gossip platform that allows nodes to join multiple gossip groups without losing the appealing fixed bandwidth guarantee. Our heuristic is based on the observations that multiple rumors can often be squeezed into a single IP packet, and that indirect routing of rumors can speed up delivery.

Elastic Replication paper.

Gossip Objects paper
Thu Apr 19 The cloud value proposition. pdf pptx (22) The cloud "Value Proposition".  We'll discuss some of the major business models for cloud enterprises to try and understand what determines value in the cloud.  This turns out to be a much trickier question than you might have expected.  
Tue Apr 24

Security concerns for consolidated systems. Synthetic diversity.

pdf pptx (23) One issue we encountered in lecture 23 is that cloud security worries are a big barrier to adoption.  Virtualization creates a new kind of security risk, but also some benefits; we’ll discuss both. The paper we’ll read was written by Professors Schneider and Birman as a report on a study looking at these tradeoffs. Dangers of monocultures paper.
Thu Apr 26 Clouds under attack! pdf pptx

(25) We’ll conclude the course by looking at the broader spectrum of attacks possible on cloud systems.   Whereas lecture 24 focuses on the dangers of consolidating a lot of copies of something into one vulnerable data center, lecture 25 steps back and looks at the bigger picture and at the whole range of attacks that are feasible.  We'll also look at other remedies, beyond the synthetic diversity approach favored in lecture 24.

Chapter 19
Tue May 1 Summary, final quiz pdf pptx (26) A recap of the semester and an attempt to predict the future of cloud computing.  We'll also have the last of our in-class quizzes in the last 15 minutes of the class.  As usual 16 points, T/F, closed book.  The style and difficult shouldn't be much different from any of our other quizzes  
Thu May 3 Demos day: Please sign up for a good time when you can show us your work.  We would like each project or team to have a poster on the overall system and also be shown to us running on whatever machines you find convenient to use.  We can be a little flexible about the day for demos; no problem if you prefer Wednesday or Friday, and we could consider other dates too.  Talk to the TAs if you have a special need to do the demo on some non-standard day.  We normally need 15 minutes per project and want all team members to please attend.