Spring 2022 Syllabus

Attending lectures, in person, is mandatory.   All lectures and recitations are recorded for offline review or asynchronous streaming in situations where synchronous participation wasn't feasible, but you still need to attend the actual lecture in person unless: (1) you were sick and have an SDS letter telling us to excuse you from class; (2) Cornell asked students to attend remotely due a Covid spike; (3) you are using up one of your 3 permitted "didn't attend" days.  Sick but no letter?  Then it counts towards one of these 3 days.  If you plan to interview in the spring, please keep in mind that Cornell doesn't excuse you from class to interview.  Interviews count against the 3 days, too.

  Date Topic Remarks, Recommended reading (optional, see note above)
The first group of lectures are concerned with some basic principles that govern cloud computing and scalability.  The cloud is built from Linux servers but isn't the same as just logging into a Linux system and starting to build programs like you may have created in prior classes or jobs.  You need to use the vendor-provided tools, and you need to be unusually aware of the way the hardware works -- otherwise your systems perform poorly or are very costly compared to the best possible (and at cloud scale, costs mount rapidly!)
1. 1/25 [Cloud Overview.  Internet of Things and the Cloud IoT Edge]

Overview of the course.  Azure IoT model: Sensors, Azure IoT Edge roles, Azure Intelligent Edge and IoT Hub, u-services model, data center file system and database infrastructures, big-data analytics infrastructures.

We focus on Azure just for coherency, but Amazon AWS has completely analogous components.  If you know one cloud, you'll easily be able to adapt to any other cloud!

Slides: pptx  pdf  video
The first five lectures are really to help everyone get situated and onto the same page in terms of terminology and mindset.  In lecture one we look at an end-to-end perspective on how a smart farm would work in Microsoft Azure from data collection all the way back to data storage and big-data analytics.  The technical depth will be kind of shallow.

Azure.microsoft.com:  Home page for all of Azure and Azure IoT.  This is actually quite a useful resource for finding more details on the topics of the first few lectures.

Some of the examples in the lecture draw on work done by Professor Delimitrou in Cornell's ECE department.  A paper on her Seer system can be found here: Seer: Leveraging Big Data to Navigate the Complexity.  Seer depended on a suite of tools for benchmarking microservices discussed here: Benchmarking Microservices

One example discussed in Lecture 1 is Microsoft's smart farms project.  Read more at:  FarmBeats: AI & IoT for Agriculture.
One big challenge with cloud computing is learning to master so much complexity.  Read more about this "crisis of understandability" here!  In CS5412, we will be learning basic building blocks that make the cloud understandable despite all of this complexity.  By the end of the semester, you'll find that even very elaborate cloud architectures are approachable.
  1/26
(recitation)

7:30
In this first recitation meeting we will talk about how homework and projects are handled.  CS5412 will have 2 or 3 graded coding assignments (the first one will be assigned on the first day of classes!).  Alicia will explain how to access CMS, how to prepare uploads and what to do to actually upload them, and how team formation will occur for the larger project that runs in the last 8 or 9 weeks of the semester.  Alicia will be running the recitations.  Some are focused on class material and reviews of things that might have been confusing.  A few focus on the project.  .
Scalability through a technique called "key-value sharding" is foundational in the cloud.  We'll be using this extensively and testing on it, and every student needs to become proficient both in the concept and in finding clever ways to apply this even when the "match" to the problem statement may not be obvious.  Understanding the tradeoffs matters too: in key-value sharding some data structures end up split over many shards, and that can introduce costs if your application is naive about how it accesses them.
2. 1/27 [Scalability and Key-Value Sharding]

Introduction to cloud scalability techniques: hierarchy, point of presence mini-datacenters, full datacenters, (key-value) sharding and simple fault-tolerance techniques, use of a DHT plus notifications to implement a publish-subscribe message bus, a DDS, or a message queue.  Putting it all together: Akamai CDN and Facebooks massive content delivery infrastructure.

Note that although Azure has a blob service (a key-value sharded store for binary large objects), and a sharded data store (Cosmos) and even has a NoSQL layer on top of Cosmos (CosmosDB) with all sorts of computing capability, we won't be discussing those in the main lectures -- it would take too long to look at all the options and details.  So you will either see such things in recitation, as demos by the TAs, or might have to learn them on your own, from the huge amounts of online materials Azure and Amazon AWS offer.  You find examples and basically copy and then customize them: in CS5412 this isn't considered to be cheating, but is the normal way people build things.  But you do need to learn to find that documentation on your own, and to follow the demos and recommendations on your own.

Slides: pptx  pdf video
Continuing our broad but shallow review, lecture two looks at ways of breaking large data sets into what are called sharded key-value stores.

Much of what we discuss in Lecture 2 can be found on Wikipedia in the key-value database entry.  (In fact they go beyond what we will be talking about and look at the whole question of treating entire databases in a key-value manner, but in CS5412 we won't tackle the full question.)

The two papers we'll specifically cover are concerned with Facebook's caching policy, and the RIPQ mechanism they used to adapt S4LRU to work on flash SSD.  But you are only responsible for understanding the overall approach -- not the details.
Containers play a huge role in the cloud.  We'll look at the container concept, at how containers are managed by the function server, and how the function/container model differs from a cloud concept called a "microservice", which is also containerized but is managed in a different way, by the "App Server".
3. 2/1 [Pool model for Microservices]  In this lecture, we will look at the issues raised by this idea of managed microservices that live in elastic pools.  Some of the issues involve what is called stateless programming (we'll explain; it doesn't mean what the word sounds like).  Another issue is virtualization using a hypervisor such as Xen versus container virtualization using Kubnetes with help from an OS like Mesos.

In a nutshell, when we put a program on a cloud, we also place files it might use as inputs and can even tell the cloud to run multiple instances as a pool.  Those pool members could talk to one-another if you like.  So we'll learn about how to hand a program to the cloud in the form of a container with any other resources it needs, and how network virtualization creates a private world in which your processes can find one-another without being disrupted by other cloud users.

Slides: pptx pdf video
We used the term "micro-service", but where did this idea come from?  What does a typical micro-service do?  We'll also look at a few of the more important micro-services found in Azure.  Then we will look at the puzzle of how these are typically implemented, and will see that some are stateless (an easier case) and some are stateful (a much harder case).  Normally, stateless systems sit in front of stateful ones.  Finally, we will see that the App Service that manages these pools needs a simple way to deal with the whole application as a kind of "package", and does this using Container Virtualization. 

A lot of terminology, but the ideas are surprisingly simple.  For example, a container is just a way to fool a normal program into thinking it is running inside a virtual machine, but without the high costs of a true virtual machine.

  2/2 (recitation)

7:30
Discussion of cloud computing projects: Everything you could possibly want to know.  You will even learn about why it might be valuable to do face recognition on cows in a milking stall! In this recitation we will be joined by Professor Giordano who will introduce some of his research and then talk about the joint projects with students from his class (you get extra credit for teaming with them, but he only has 24 students so only some of our cloud computing teams can do this).  We will also talk about other kinds of project opportunities, including self-defined projects, projects focused on cloud infrastructure, and projects focused on drones based on the Microsoft Farmbeats technology.
4. 2/3 [Implementing a smart farm.] 

This lecture focuses on the puzzle of actually using DHTs and other cloud platform ideas to implement a smart farming system.  A big part of the challenge is that we end up with an event-by-event computing model, implemented using cloud functions (small pieces of code that the platform runs for you when the event occurs), but sometimes we have logic that requires some form of state that spans multiple events.  The function tier is stateless... so we use a DHT to hold the state!  We'll see one clever idea as part of this: if the DHT put API allows you to specify which version of a (key,value) object you think you are creating, we can avoid locking entirely and still get consistency.

Slides: pptx  pdf  video
Here we will be looking at the surprisingly small changes that were needed to take the same cloud architecture used for human interactions through web applications (the model we looked at in lectures 1, 2 and 3) so that IoT sensors and actuators can also talk to the cloud.  Basically, we replace the first tier web page builder system with a new first tier that securely binds to the sensor or actuator, and then we can cusatomize it by introducing new containerized programs called "functions".  The "function service" manages them.  The relevant Wikipedia page is here, but is actually more general than what we discuss in CS5412.  Wikipedia has a whole serverless computing story (implemented by functions); we focus on a somewhat more limited case.
5. 2/8 [More on DHTs:  How can we store complicated data sets in a DHT?]

This lectures shifts back to our DHT concept and drills down on some challenges of actually squeezing complicated data into a DHT structure for scaling (the aggregated capacity of the DHT can be huge) or for speed (the shards operate independently, plus if we are careful, all the data can be held in memory).

Slides: pptx  pdf video
In a nutshell, although one could probably implement any application in any way you like, doing so often forces us to think hard about what will work well in the cloud and not collapse as soon as it becomes successful.  The puzzle is then this: we know how DHTs scale and why they perform well, but how can we use them for things that don't look like (key,value) data?
  2/9
(recitation)
 7:30pm
DHT topics, performance, other practical challenges.  Progress on finding project topics. We originally had a plan to invite a guest lecturer for this recitation, but the ice and snow caused a tree to fall on his internet cable, so we need to wait until Zoom works for him again.  Alicia has plenty of material to review and discuss.
6. 2/10 [IoT sensor registration.  Risk of sensor inaccuracy.]

The Azure IoT hub and the concept of a secure sensor with a managed life-cycle.  Sensor properties.  Fault-tolerance.  The META system and its model of fault-tolerance for IoT devices.

Slides: pptx  pdf  video
To start drilling down, we'll look closely at how end-users connect devices like cameras, drones, microphones (Cortana/Siri/Alexa) and so forth to the cloud.  Azure IoT Hub is a microservice for secure sensor management.

Then we will study an example of a case where an IoT sensor malfunctions to start thinking about what this even means, how we could compensate, and what corrective actions might be appropriate.

Tools for Distributed Application Management. K. Marzullo, M. Wood, K. Birman and R. Cooper. IEEE Computer, Aug. 1991, 24(8):42-51.


Also valuable from this lecture is the training material for Azure IoT Edge, mentioned on slide 21 and 22.  The Microsoft documentation can be accessed via this GitHub site (scroll down, past the index of folders and files).
State Machine Replication Model, Paxos, Derecho.  These lectures look at how we can start with the idea of a shard that replicates data and formalize that model, then build optimal solutions to implement the best protocols.
7. 2/15 [The State Machine Replication Model]

Sharded data often must be replicated for higher availability.  We will discuss various ways that systems have approached this, starting with really simple ways and will also see some of the potential issues, such as inconsistency or data loss.  This will lead us to define the gold standard computing model for keeping backups: state machine replication.  Chain replication is one way to solve this problem.

Slides: pptx pdf video
One big puzzle with a system split between sensors at the edge, cloud-hosted middle services, and then perhaps back-end computing on massive data sets, is that sooner or later elements will definitely fail and restart.  This lecture looks at the best ways to have your system keep running even after a stumble.

Optional reading:

Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial.  Fred B. Schneider. ACM Comput. Surv. 22, 4 (Dec. 1990), 299–319. DOI:https://doi.org/10.1145/98163.98167

Chain replication for supporting high throughput and availability
.Robbert van Renesse and Fred B. Schneider. In Proceedings of the 6th conference on Symposium on Operating Systems Design & Implementation - Volume 6 (OSDI’04). USENIX Association, USA, 7.
  2/16
(recitation)
7:30pm
This recitation will focus on state machine replication(SMR), a commonly used method for fault-tolerance in datacenter. We will look at PAXOS protocol, and discuss in detail how it works.   
8. 2/17 [Paxos and Derecho]

In our Tuesday lecture we only scratched the surface on state machine replication and fault-tolerance concepts. Today we will discuss Paxos and Cornell's Derecho system, which is a cutting-edge version of the Paxos idea, but restructured to match better with fast datacenter networking.  You'll also hear about how we are creating Cascade, a new cloud DHT, using Derecho.

In the suggested readings (pane to the right) you'll find a paper on understanding Paxos, a really nice paper comparing 10 different replication protocols (including Derecho, which we covered in class), and then the Derecho paper itself.

Slides: pptx pdf video
Paxos Made Moderately Complex. R Van Renesse and D Altinbuken. ACM Comput. Surv. 47, 3, Article 42 (February 2015), DOI: 10.1145/2673577

Odyssey: The Impact of Modern Hardware on Strongly-Consistent Replication Protocols.
Vasilis Gavrielatos, Antonios Katsarakis, Vijay Nagarajan . EuroSys ’21, April 26–28, 2021, Online, United Kingdom

Derecho: Fast State Machine Replication for Cloud Services. Sagar Jha, Jonathan Behrens, Theo Gkountouvas, Matthew Milano, Weija Song, Edward Tremel, Sydney Zink, Kenneth P. Birman, Robbert van Renesse.  ACM Transactions on Computing Systems (TOCS), 2019.
The next set of lectures look at how higher level concepts map to the cloud.  We'll focus on cloud storage and will talk about object oriented systems, fault-tolerance, concepts for dealing with time. 
9. 2/22 [Time and Causality - Actual Clocks, Logical Clocks]

Timestamped data.  Clocks and clock synchronization. 
Causal ordering and causal clocks.  Snapshots and consistent cuts

Slides: pptx pdf video
Deep dive into underlying technology by looking at the issue of temporality in modern IoT settings, where sensors might have some form of clock. 

Time, clocks, and the ordering of events in a distributed system. L. LamportCommun. ACM 21, 7 (July 1978), 558-565.

Distributed snapshots: determining global states of distributed systems. K. Mani Chandy and Leslie Lamport. ACM Trans. Comput. Syst. 3, 1 (February 1985), 63-75.
  2/23
(recitation)
7:30pm
This recitation will continue the discussion of State Machine Replication (SMR), focusing on the proof of Paxos protocol, and introducing the chain replication protocol.  
10. 2/24 [Challenges of Dealing with Timestamped IoT Data]

How Cascade implements the fancy snapshots we saw in lecture 8.  Indexing into DHT data.  Issues of heavy temporal computing and some of the solutions people are starting to propose.

Slides:  pptx pdf video
In this lecture we will look at how Cascade was able to provide accurate snapshots, but also at some of the challenges time and causality create when we build IoT systems and want to do things like searching for a repeated pattern over a long period of time.
February break: 2/26-3/2
11. 3/3 [Strongly Consistent Geoscale Computing]

Availability zones.  WAN replication.  Mirroring versus active update models.  Google's Spanner system.  5G mobility.

Slides: pptx pdf video
If you depend on the cloud, clearly you need your cloud to be reliable.  Yet datacenters do fail.  An availability zone is a set of 2 or 3 side-by-side cloud datacenters that the vendor manages to ensure that (if possible) at most 1 would be down at any time.  Because the distances are so tiny, latencies are similar to intra-datacenter delays. 

WAN replication arises when datacenters are located at very long distances, maybe even globally.  Yet we can still do strongly consistent data replication even at that scale, as Google's Spanner demonstrates.

Spanner: Google’s Globally Distributed Database. James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. 2013. ACM Trans. Comput. Syst. 31, 3, Article 8 (August 2013), 22 pages.
12. 3/8 [Tracking state in big data-center systems]

This lecture will look at the concept of gossip communication protocols, where we use little one-to-one data exchanges to track information like the total storage capacity available on the Amazon S3 storage servers.  We'll discuss the concept of gossip, and then will look at two examples of how gossip can be used.  One of them was invented by the CTO of Amazon when he was still at Cornell, and the S3 system was originally based on it (over time, of course, it evolved and became different).

Slides: pptx, pdf video
When systems get huge, even the overheads of keeping track of load and capacity can be a burden.  Gossip is appealing because it imposes very low, controlled costs.

The best package I know of for using Gossip in large datacenter settings is Lonnie Princehouse's open-source MiCA platform.
  3/9
(recitation)
7:30pm
Invited speaker: Ranveer Chandra.  (Zoom video limited to access by Cornell CS people only) Ranveer Chandra, Head of Research in Networking, Research for Industry, and Chief Scientist for Azure Global has offered to join us for this evening recitation.  A Cornell PhD, Ranveer is widely known as a pioneer of new wireless communication technologies.  He went on to commercialize that concept as a Microsoft product, shifted his group to explore new software-controlled battery concepts and integrate them into new drones.  In  his role as the visionary and leader for Azure Global he is one of the key people who will decide the Microsoft strategy for cloud IoT. 
13. 3/10 [Case studies in disaster: How datacenter gossip can go wrong!]

The idea of tracking storage capacity in S3 using gossip was a big success at Amazon... but it also revealed some really bizarre and unexpected issues with gossip that required more work in their deployment.  We'll talk about some of the stories that became public and how Amazon fixed the issues.

Slides: pptx, pdf video
 
14. 3/15 [BlockChains]

Definitions. Anonymity, Byzantine DDoS attacks.  Using Ethereum or Hyperledger to encode smart contracts.   Permissionless and permissioned models, and how they differ.    Gossip in wide-area environments.    Proof of work, proof of stake, proof of elapsed time.

Slides: pptx pdf video
You can read more about BlockChains of both permissioned and non-permissioned flavor on Wikipedia.   Datacenter blockchains are permissioned, but because they run in a known datacenter with known membership, are actually more like Paxos replicated logs with crash failures.  What makes something a permissioned blockchain rather than a Paxos log is that we run it in a wide-area setting, and anticipate Byzantine attacks.

You can read about one issue with Blockchains (use by criminals) here.  In class we will be discussing the ethics of cryptocurrency and blockchain research.  There are always two perspectives to any story -- and sometimes more than two.  Just the same, it is important to be informed and to make decisions based on reality, so learning about reality is valuable no matter where you end up on the issue.
  3/16
(recitation)
7:30pm
Introducing Scalable Peer-to-Peer Protocols and Blockchain Applications  
These next lectures drill down on Blockchain, but not cryptocurrency -- we will focus on how blockchains work and how the cloud uses them, especially for IoT.
15. 3/17 [BlockChain Puzzles and Concerns]

Vegvisir.  Open questions: BlockChain has been adopted so enthusiastically that early users are seemingly ignoring a great many puzzles.  We'll discuss a few of them.

Slides: pptx pdf video
The main paper we will discuss is this:

Vegvisir: A Partition-Tolerant Blockchain for the Internet-of-Things, Kolbeinn Karlsson ; Weitao Jiang ; Stephen Wicker ; Danny Adams ; Edwin Ma ; Robbert van Renesse ; Hakim Weatherspoon. 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS), Vienna, 20
16. 3/22 [Blockchain with multiple organizations.]

A puzzle for big enterprises is that many activities span more than one company, and each might need to keep its own blockchain.  We'll look at how this makes access control and querying more difficult than it would have been with just one company using the blockchain.

Slides: pptx pdf  video
 
  3/23
(recitation)
7:30pm
For this recitation we were joined by Microsoft education specialist Vanessa Villa.  The materials she shared are not currently public, so for now the slides and video are by Cornell netid access only, on the Cornell box server.


Slides: link.  Video: link.

Vanessa discussed the major components of building an Azure cloud application:

  • How to build an end-to-end solution

  • How to connect devices to the cloud

  • Sanitize and Store data

  • Extrapolate insights

  • Data visualizations

As she introduced this end-to-end solution, she used Azure IoT Edge, IoT Hub, Storage, Azure functions, ML, and other services. In each case she:

  • Demonstrated to us what each of these service does

  • Explained where to find documentation, demos and examples for customizing each stage

  • Showed how to use these services and configure to connect them

Knowing what services to use, and how to customize them to the applications are very helpful to the final projects. This talk can save you a lot of time of reading the documentations and large codebase on Azure websites, and help to pin down the components needed for your project.

Privacy in the cloud,  This is really a security topic, and Cornell has entire classes on security.  We'll just have a single lecture on it, focused on a tool you can easily download and use.
17. 3/24 [Privacy in Cloud Computing]

Privacy isn't our main topic, but lecture 16 touched on it (HiPPA issues for electronic health care records).  What's the best we can do on a cloud platform?

Slides: pptx pdf  video
The main tool we will be talking about is CryptDB, an open source platform you can download from http://GitHub.com/CryptDB

The work was done at MIT by Ralucca Popa, who is now a professor at UC Berkeley
** 3/29
[External speaker: Anna Blasiak, Akamai]

Load balancing at geoscale.

Slides:  pdf  video

Anna is a Cornell PhD who went on to become Principal Architect at Akamai, the leading company for web content and app hosting.  She will be telling us about some of the challenges seen in this domain, and how her group at Akamai solves them.

As for Akamai itself, you may recall that we mentioned this company briefly in Lecture 2, when we looked at how Facebook manages to be so fast for image and video retrieval in your feed.  Akamai was the second access pathway at the bottom on the Facebook architecture slide.  The second path happened to be shut down when we did the Facebook caching study I told you about -- Facebook temporarily pauses use of Akamai when doing major upgrades and this allowed us to trace every cache access (otherwise we would not have known how the accesses via Akamai were handled).  But that was an unusual situation.  Facebook only does major upgrades a few times per year, for each data center.  Normally, Akamai "soaks up" a huge part of the load for Facebook, and for a great many other cloud companies too.
  3/30
(recitation)
7:30pm
Tbd  
Collections are a valuable tool for accessing data in a highly parallel way.  We'll see the concept, but then we'll look at a particularly tricky case involving huge social-networking graphs and applications that run on those.
18. 3/31 [Accessing collections from modern programming languages]

In this lecture we will look at technologies for accessing databases or other kinds of collections from programming languages like Python or C++.  Specifically, we'll look first at Pandas, which is a Python add-on package for doing database accesses right in your program.  Then we'll pivot to Azure and will look at LINQ, a general framework available from every programming language offered by Microsoft.

Slides: pptx pdf  video
A big chunk of this lecture will be just looking at documentation web pages that show how Pandas and LINQ are used.  The ones we will focus on are these

https://docs.microsoft.com/en-us/dotnet/csharp/linq/query-expression-basics

https://docs.microsoft.com/en-us/dotnet/csharp/linq/query-expression-basics

Spring Break: April 2-10
19. 4/12 [Making The Cloud Friendlier for Object-Oriented Computing]

Many modern systems are object oriented, yet Linux was born in a world of record-oriented databases and files containing things like binaries or text data.  As a result there has been a push to make the cloud more object-friendly.  We saw this in lecture 18 when we discussed the LINQ tools for embedding database access directly into programming languages by extending the concept of a collection with a variety of OO primitives.  Today we will see that this extends into settings like Apache. 

This is a two-part lecture.  First, we will discuss Ceph, an open-source file system designed using the Apache architecture that has a number of optimizations for object oriented computing (Ceph is not a component of Apache "per se" but is very often used in Apache applications).  Next, we will look at an overhead issue that can arise when an object oriented application buys fully into this always-distributed, multiple-component mindset, and also at how that issue was solved in a specific setting (an air traffic control system in Europe). 

Slides: pptx pdf  video
While many big-data systems start with unstructured data (like web pages), there are growing needs to work with higher-level "objects" through file system APIs.  Ceph is a new and very popular file system that scales super well, has HPC extensions for people doing supercomputing research, and with a built-in layer for "object" storage that bypasses the POSIX file system API.

Ceph: A Scalable High-Performance Distributed File System.  Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, and Carlos Maltzahn. 2006.  In Proceedings of the 7th symposium on Operating systems design and implementation (OSDI '06). USENIX Association, Berkeley, CA, USA, 307-320.

Ceph Object storage
  4/13
(recitation)
7:30pm
Tbd  
20. 4/14 [Apache Ecosystem.]

In ML settings, we often need to train ML models on really big sharded data sets.  This motivated a very famous approach called the MapReduce pattern.  Today we'll see what MapReduce does, but then will learn about the ecosystem of tools for MapReduce programming in the Apache platform (Hadoop and associated mechanisms)

Slides: pptx pdf video

21. 4/19 [Spark RDD concept.]

The clever idea in Spark is to package LINQ-style logic as small containerized functions that can have "names" and yield cacheable results, which are simple files stored as objects into the HDFS or Ceph file system but that can be cached in memory too.  RDDs are the name Spark introduced for this kind of object.  They can be recomputed if needed, cached, and saved on a disk file to avoid recomputing them, if they need to be evicted from cache and it would be costly to recreate the contents.

Slides: pptx pdf video
The Hadoop version of MapReduce used to be slow until a Berkeley project called Spark came up with a clever new caching concept centered on resilient distributed data objects or RDDs.  We'll look at how these work, and how they can talk to temporal data from sensors.

Spark: Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica. HotCloud 2010.

Improving MapReduce Performance in Heterogeneous Environments, M. Zaharia, A. Konwinski, A.D. Joseph, R. Katz and I. Stoica, OSDI 2008, December 2008. 

  4/20
(recitation)
7:30pm
Tbd  
22. 4/21 More Apache big-data tools

Slides: pptx, pdf video.
In this lecture we will review a few more of the Apache infrastructure tools.  You'll want to check out awesome video about Kafka!
23. 4/26 [FLP Result.  Write-Once Data and Rollback/Redo Fault Tolerance]

Fault-tolerance: Hard limits on what can be achieved.  A strange theorem, known as the FLP result: "fault tolerance is impossible!"  How they work around this, as a practical matter, in MapReduce/Hadoop.  Why Hadoop's style of computing only requires file appends, not general updates or replacement. 

Slides: pptx pdf  video
Many people are surprised to learn that even though Hadoop's HDFS file system can be used more or less like a normal file system, in fact Hadoop only allows programs to append to files, not to do arbitrary updates.  Why did they impose this rule?  We'll see that it comes down to fault-tolerance in Hadoop.

In this talk we discussed the Fischer, Lynch and Patterson impossibility result.  The paper is not simple to read, although it is short.  Here is a pointer to it, and then a pointer to a much easier to follow paper about some other limitations on fault-tolerance that might interest you:

Impossibility of distributed consensus with one faulty process. Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. J. ACM 32, 2 (April 1985), 374-382. DOI=http://dx.doi.org/10.1145/3149.214121

Easy impossibility proofs for distributed consensus problems. Michael J. Fischer, Nancy A. Lynch, and Michael Merritt. In Proceedings of the fourth annual ACM symposium on Principles of distributed computing (PODC '85), Michael Malcolm and Ray Strong (Eds.). ACM, New York, NY, USA, 59-70.1985.  DOI=http://dx.doi.org/10.1145/323596.323602
Hardware accelerators do most or all of the heavy lifting for scaled-out cloud applications.  We'll see how this is done, why adoption of accelerators has posed big challenges, how you use them in settings where they are available, and how accelerators are reshaping the cost-of-computing story.  But we will also see that accelerators can be hard to leverage without "special sophistication".
  4/27
(recitation)
7:30pm
[Blue Origin Space Exploration.  Amazon Robotics]

IoT shows up in many robotics and robotic-like settings.  In this evening recitation lecture our special guest will be Larry Felser, who was a technical leader at the Engineering CAD company Autodesk in Ithaca when Jeff Bezos recruited him to head the Engineering effort at Blue Origin, the company Jeff created to extend humanity into space.  After many years leading at Blue Origin, Larry then moved to Amazon, where he co-directed their robotics effort, which is doing stuff like drones to deliver packages to your door, and robotic warehouses that automate all the movement of products onto shelves and then into packages for delivery.   Now Larry is back in the Ithaca area but still engaged with Amazon remotely.

Video recording.
FAQ: Yes, you could also reach out to him about how to get a job at one of these awesome companies -- maybe even a job that would center using the cloud for IoT in outer space, or in business enterprise settings!
24. 4/28 [Social networking data: How the cloud deals with huge graphs]

We will look at one example of an existing big data infrastructure (Facebook TAO) and how modern systems access social networking graphs.

Slides: pptx pdf video            
TAO: Facebook's Distributed Data Store for the Social Graph. Nathan Bronson, Zach Amsden, George Cabrera, Prasad Chakka, Peter Dimov Hui Ding, Jack Ferris, Anthony Giardullo, Sachin Kulkarni, Harry Li, Mark Marchukov Dmitri Petrov, Lovro Puzar, Yee Jiun Song, Venkat Venkataramani. 2013 USENIX Annual Technical Conference (USENIX ATC '13).
25. 5/3 [Hardware accelerators]

These days, anyone who follows the cloud literature sees endless rave reviews of hardware devices like RDMA, NVMe, GPU and GPU clusters, TPU and TPU clusters, FPGA. How important are accelerators for cloud intelligence?  How do you get access to them, and can you use them without learning obscure languages like Verilog?

Slides: pptx pdf  video          
In the cloud accelerators matter, a lot.  Many kinds of cloud intelligence applications center on very costly computations, and we have to find ways to do them quickly and cost-effectively.  But this dimension of the cloud centers on its ability to leverage highly specialized hardware.  We'll do a mile-high review of the most important accelerators.  You don't normally access these directly: instead, you use u-services that already are integrated with them.  But there are exceptions: GPU and TPU are sometimes accessible to users, and there are many software layers that have special permission to access other devices, too.   This drives us towards u-services: there just isn't any other way to get the needed performance at reasonable cost.

TensorFlow: A System for Large-Scale Machine Learning
Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. In Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation (OSDI'16). USENIX Association, Berkeley, CA, USA, 265-283.
  5/4
(recitation)
7:30pm
Tbd  
26. 5/5 [The challenge (nightmare?) of rolling out new hardware]

Why can't we just use hardware accelerators everywhere for computing, and use RDMA for all our data movements, and avoid "copying"?  We'll focus on the RDMA version of this question.  First we should understand why copying is such a costly operation, and why zero-copy nonetheless remains a holy grail.  Then we'll look at the challenges of introducing RDMA into big data centers (and how to view those challenges as a "warning" for other future accelerators that people may want to deploy at scale!)

Slides: pptx pdf video
 
27. 5/10 [Wrapup: How the Business of Cloud Computing Shapes the Future of Cloud Computing]

This lecture will shift gears and focus on questions about how the cloud area has evolved as a line of business.  What we will see is that very big investment decisions tend ot be determined by very big business opportunities and revenue streams: Cloud companies don't just randomly wander in various directions.  Today's cloud is already a trillion dollar industry and investment.

Some of the questions we will explore are: Up to now, how did big business ideas shape big data and cloud computing?  Where is the financial centrism for today's cloud, and how is it changing?  Can we assess IoT and edge computing using insights from the cloud business perspective? 

Slides: pptx pdf video
 
CS5412 final was May 4, with a makeup exam (different exam, similar difficulty, comprehensive covering all lectures) on May 17, 7:30pm in G01