Spring 2021 Syllabus

Attending lectures is mandatory.   We encourage synchronous participation, but all lectures and recitations are recorded for offline review or asynchronous streaming in situations where synchronous participation wasn't feasible.

  Date Topic Remarks, Recommended reading (optional, see note above)
The first group of lectures are concerned with some basic principles that govern cloud computing and scalability.  The cloud is built from Linux servers but isn't the same as just logging into a Linux system and starting to build programs like you may have created in prior classes or jobs.  You need to use the vendor-provided tools, and you need to be unusually aware of the way the hardware works -- otherwise your systems perform poorly or are very costly compared to the best possible (and at cloud scale, costs mount rapidly!)
1. 2/9 [Cloud Overview.  Internet of Things and the Cloud IoT Edge]

Overview of the course.  Azure IoT model: Sensors, Azure IoT Edge roles, Azure Intelligent Edge and IoT Hub, u-services model, data center file system and database infrastructures, big-data analytics infrastructures.

We focus on Azure just for coherency, but Amazon AWS has completely analogous components.  If you know one cloud, you'll easily be able to adapt to any other cloud!

Slides: pptx  pdf
The first five lectures are really to help everyone get situated and onto the same page in terms of terminology and mindset.  In lecture one we look at an end-to-end perspective on how a smart farm would work in Microsoft Azure from data collection all the way back to data storage and big-data analytics.  The technical depth will be kind of shallow.

Azure.microsoft.com:  Home page for all of Azure and Azure IoT.  This is actually quite a useful resource for finding more details on the topics of the first few lectures.

Some of the examples in the lecture draw on work done by Professor Delimitrou in Cornell's ECE department.  A paper on her Seer system can be found here: Seer: Leveraging Big Data to Navigate the Complexity.  Seer depended on a suite of tools for benchmarking microservices discussed here: Benchmarking Microservices

One example discussed in Lecture 1 is Microsoft's smart farms project.  Read more at:  FarmBeats: AI & IoT for Agriculture.

In this first recitation meeting we will talk about how homework and projects are handled.  CS5412 will have 3 or 4 graded coding assignments (the first one will be assigned on the first day of classes!).  Alicia and Yifang will explain how to access CMS, how to prepare uploads and what to do to actually upload them, and how team formation will occur for the larger project that runs in the last 8 or 9 weeks of the semester.  Alicia Yang and Yifan Wang will be running the recitations.  Some are focused on class material and reviews of things that might have been confusing.  A few focus on the project.  .
Scalability through a technique called "key-value sharding" is foundational in the cloud.  We'll be using this extensively and testing on it, and every student needs to become proficient both in the concept and in finding clever ways to apply this even when the "match" to the problem statement may not be obvious.  Understanding the tradeoffs matters too: in key-value sharding some data structures end up split over many shards, and that can introduce costs if your application is naive about how it accesses them.
2. 2/11 [Scalability and Key-Value Sharding]

Introduction to cloud scalability techniques: hierarchy, point of presence mini-datacenters, full datacenters, (key-value) sharding and simple fault-tolerance techniques, use of a DHT plus notifications to implement a publish-subscribe message bus, a DDS, or a message queue.  Putting it all together: Akamai CDN and Facebooks massive content delivery infrastructure.

Note that although Azure has a blob service (a key-value sharded store for binary large objects), and a sharded data store (Cosmos) and even has a NoSQL layer on top of Cosmos (CosmosDB) with all sorts of computing capability, we won't be discussing those in the main lectures -- it would take too long to look at all the options and details.  So you will either see such things in recitation, as demos by the TAs, or might have to learn them on your own, from the huge amounts of online materials Azure and Amazon AWS offer.  You find examples and basically copy and then customize them: in CS5412 this isn't considered to be cheating, but is the normal way people build things.  But you do need to learn to find that documentation on your own, and to follow the demos and recommendations on your own.

Slides: pptx  pdf
Continuing our broad but shallow review, lecture two looks at ways of breaking large data sets into what are called sharded key-value stores.

Much of what we discuss in Lecture 2 can be found on Wikipedia in the key-value database entry.  (In fact they go beyond what we will be talking about and look at the whole question of treating entire databases in a key-value manner, but in CS5412 we won't tackle the full question.)

The two papers we'll specifically cover are concerned with Facebook's caching policy, and the RIPQ mechanism they used to adapt S4LRU to work on flash SSD.  But you are only responsible for understanding the overall approach -- not the details.
Containers play a huge role in the cloud.  We'll look at the container concept, at how containers are managed by the function server, and how the function/container model differs from a cloud concept called a "microservice", which is also containerized but is managed in a different way, by the "App Server".
3. 2/16 [Pool model for Microservices]  In this lecture, we will look at the issues raised by this idea of managed microservices that live in elastic pools.  Some of the issues involve what is called stateless programming (we'll explain; it doesn't mean what the word sounds like).  Another issue is virtualization using a hypervisor such as Xen versus container virtualization using Kubnetes with help from an OS like Mesos.

In a nutshell, when we put a program on a cloud, we also place files it might use as inputs and can even tell the cloud to run multiple instances as a pool.  Those pool members could talk to one-another if you like.  So we'll learn about how to hand a program to the cloud in the form of a container with any other resources it needs, and how network virtualization creates a private world in which your processes can find one-another without being disrupted by other cloud users.

Slides: pptx pdf
We used the term "micro-service", but where did this idea come from?  What does a typical micro-service do?  We'll also look at a few of the more important micro-services found in Azure.  Then we will look at the puzzle of how these are typically implemented, and will see that some are stateless (an easier case) and some are stateful (a much harder case).  Normally, stateless systems sit in front of stateful ones.  Finally, we will see that the App Service that manages these pools needs a simple way to deal with the whole application as a kind of "package", and does this using Container Virtualization. 

A lot of terminology, but the ideas are surprisingly simple.  For example, a container is just a way to fool a normal program into thinking it is running inside a virtual machine, but without the high costs of a true virtual machine.

  2/17 (recitation)

Discussion of cloud computing projects: Everything you could possibly want to know.  You will even learn about why it might be valuable to do face recognition on cows in a milking stall! In this recitation we will be joined by Professor Giordano who will introduce some of his research and then talk about the joint projects with students from his class (you get extra credit for teaming with them, but he only has 24 students so only some of our cloud computing teams can do this).  We will also talk about other kinds of project opportunities, including self-defined projects, projects focused on cloud infrastructure, and projects focused on drones based on the Microsoft Farmbeats technology.
4. 2/18 [Implementing a smart farm.] 

This lecture focuses on the puzzle of actually using DHTs and other cloud platform ideas to implement a smart farming system.  A big part of the challenge is that we end up with an event-by-event computing model, implemented using cloud functions (small pieces of code that the platform runs for you when the event occurs), but sometimes we have logic that requires some form of state that spans multiple events.  The function tier is stateless... so we use a DHT to hold the state!  We'll see one clever idea as part of this: if the DHT put API allows you to specify which version of a (key,value) object you think you are creating, we can avoid locking entirely and still get consistency.

Slides: pptx  pdf
Here we will be looking at the surprisingly small changes that were needed to take the same cloud architecture used for human interactions through web applications (the model we looked at in lectures 1, 2 and 3) so that IoT sensors and actuators can also talk to the cloud.  Basically, we replace the first tier web page builder system with a new first tier that securely binds to the sensor or actuator, and then we can cusatomize it by introducing new containerized programs called "functions".  The "function service" manages them.  The relevant Wikipedia page is here, but is actually more general than what we discuss in CS5412.  Wikipedia has a whole serverless computing story (implemented by functions); we focus on a somewhat more limited case.
5. 2/25 [More on DHTs:  How can we store complicated data sets in a DHT?]

This lectures shifts back to our DHT concept and drills down on some challenges of actually squeezing complicated data into a DHT structure for scaling (the aggregated capacity of the DHT can be huge) or for speed (the shards operate independently, plus if we are careful, all the data can be held in memory).

Slides: pptx  pdf
In a nutshell, although one could probably implement any application in any way you like, doing so often forces us to think hard about what will work well in the cloud and not collapse as soon as it becomes successful.  The puzzle is then this: we know how DHTs scale and why they perform well, but how can we use them for things that don't look like (key,value) data?
DHT topics, performance, other practical challenges.  Progress on finding project topics. We originally had a plan to invite a guest lecturer for this recitation, but the ice and snow caused a tree to fall on his internet cable, so we need to wait until Zoom works for him again.  Alicia has plenty of material to review and discuss.
6. 2/27 [IoT sensor registration.  Risk of sensor inaccuracy.]

The Azure IoT hub and the concept of a secure sensor with a managed life-cycle.  Sensor properties.  Fault-tolerance.  The META system and its model of fault-tolerance for IoT devices.

Slides: pptx  pdf
To start drilling down, we'll look closely at how end-users connect devices like cameras, drones, microphones (Cortana/Siri/Alexa) and so forth to the cloud.  Azure IoT Hub is a microservice for secure sensor management.

Then we will study an example of a case where an IoT sensor malfunctions to start thinking about what this even means, how we could compensate, and what corrective actions might be appropriate.

Tools for Distributed Application Management. K. Marzullo, M. Wood, K. Birman and R. Cooper. IEEE Computer, Aug. 1991, 24(8):42-51.
State Machine Replication Model, Paxos, Derecho.  These lectures look at how we can start with the idea of a shard that replicates data and formalize that model, then build optimal solutions to implement the best protocols.
7. 3/2 [The State Machine Replication Model]

Sharded data often must be replicated for higher availability.  We will discuss various ways that systems have approached this, starting with really simple ways and will also see some of the potential issues, such as inconsistency or data loss.  This will lead us to define the gold standard computing model for keeping backups: state machine replication.  Chain replication is one way to solve this problem.

Slides: pptx pdf 
One big puzzle with a system split between sensors at the edge, cloud-hosted middle services, and then perhaps back-end computing on massive data sets, is that sooner or later elements will definitely fail and restart.  This lecture looks at the best ways to have your system keep running even after a stumble.

Optional reading:

Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial.  Fred B. Schneider. ACM Comput. Surv. 22, 4 (Dec. 1990), 299–319. DOI:https://doi.org/10.1145/98163.98167

Chain replication for supporting high throughput and availability
.Robbert van Renesse and Fred B. Schneider. In Proceedings of the 6th conference on Symposium on Operating Systems Design & Implementation - Volume 6 (OSDI’04). USENIX Association, USA, 7.
Getting started on Azure.  Help with team formation if anyone is still looking for a partner.   
8. 3/4 [Paxos and Derecho]

In our Tuesday lecture we only scratched the surface on state machine replication and fault-tolerance concepts. Today we will discuss Paxos and Cornell's Derecho system, which is a cutting-edge version of the Paxos idea, but restructured to match better with fast datacenter networking.  You'll also hear about how we are creating Cascade, a new cloud DHT, using Derecho.

Slides: pptx pdf 
Paxos Made Moderately Complex. R Van Renesse and D Altinbuken. ACM Comput. Surv. 47, 3, Article 42 (February 2015), DOI: 10.1145/2673577

Derecho: Fast State Machine Replication for Cloud Services. Sagar Jha, Jonathan Behrens, Theo Gkountouvas, Matthew Milano, Weija Song, Edward Tremel, Sydney Zink, Kenneth P. Birman, Robbert van Renesse.  ACM Transactions on Computing Systems (TOCS), 2019.
The next set of lectures look at how higher level concepts map to the cloud.  We'll focus on cloud storage and will talk about object oriented systems, fault-tolerance, concepts for dealing with time. 
  3/9 Wellness day.  No lecture  
Wellness day.  No lecture  
9. 3/11 [Time and Causality - Actual Clocks, Logical Clocks]

Timestamped data.  Clocks and clock synchronization. 
Causal ordering and causal clocks.  Snapshots and consistent cuts

Slides: pptx pdf 
Deep dive into underlying technology by looking at the issue of temporality in modern IoT settings, where sensors might have some form of clock. 

Time, clocks, and the ordering of events in a distributed system. L. LamportCommun. ACM 21, 7 (July 1978), 558-565.

Distributed snapshots: determining global states of distributed systems. K. Mani Chandy and Leslie Lamport. ACM Trans. Comput. Syst. 3, 1 (February 1985), 63-75.
10. 3/16 [Challenges of Dealing with Timestamped IoT Data]

How Cascade implements the fancy snapshots we saw in lecture 8.  Indexing into DHT data.  Issues of heavy temporal computing and some of the solutions people are starting to propose.

Slides:  pptx pdf
In this lecture we will look at how Cascade was able to provide accurate snapshots, but also at some of the challenges time and causality create when we build IoT systems and want to do things like searching for a repeated pattern over a long period of time.
[Smart Homes]

IoT isn't just a cloud story.  Smart homes are using IoT heavily, but because of security, it makes more sense to try and keep most of the data in the home, right where we use it.  To see how this plays out, we will be joined for a guest lecturer by Dr. Ashutosh Saxena, Founder and CEO of Caspar.ai.
In this lecture, we'll have a guest speaker from a company called Caspar.ai.  Ashutosh Saxena used to be a Cornell faculty member in the robotics / database area, and then left to launch his company, which looks at IoT within the home.

Wikipedia Article on Smart Devices, Smart Homes, Smart Highways, Smart Grid.

Caspar.ai is working on smart homes, but they are in the middle of this ecosystem.  They are rapidly growing and Ashutosh has a lot of openings for people who are really good at the kind of scalable computing we focus on in the class.

Caspar.ai web site
11. 3/18 [Strongly Consistent Geoscale Computing]

Availability zones.  WAN replication.  Mirroring versus active update models.  Google's Spanner system.  5G mobility.

Slides: pptx pdf 
If you depend on the cloud, clearly you need your cloud to be reliable.  Yet datacenters do fail.  An availability zone is a set of 2 or 3 side-by-side cloud datacenters that the vendor manages to ensure that (if possible) at most 1 would be down at any time.  Because the distances are so tiny, latencies are similar to intra-datacenter delays. 

WAN replication arises when datacenters are located at very long distances, maybe even globally.  Yet we can still do strongly consistent data replication even at that scale, as Google's Spanner demonstrates.

Spanner: Google’s Globally Distributed Database. James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. 2013. ACM Trans. Comput. Syst. 31, 3, Article 8 (August 2013), 22 pages.
12. 3/23 [Tracking state in big data-center systems]

This lecture will look at the concept of Gossip communication protocols, where we use little one-to-one data exchanges to track information like the total storage capacity available on the Amazon S3 storage servers.  We'll discuss the concept of gossip, and then will look at two examples of how gossip can be used.  One of them was invented by the CTO of Amazon when he was still at Cornell, and the S3 system was originally based on it (over time, of course, it evolved and became different).

Slides: pptx, pdf.
When systems get huge, even the overheads of keeping track of load and capacity can be a burden.  Gossip is appealing because it imposes very low, controlled costs.

The best package I know of for using Gossip in large datacenter settings is Lonnie Princehouse's open-source MiCA platform.
Topic tbd  
13. 3/25 [Case studies in disaster: How datacenter gossip can go wrong!]

The idea of tracking storage capacity in S3 using gossip was a big success at Amazon... but it also revealed some really bizarre and unexpected issues with gossip that required more work in their deployment.  We'll talk about some of the stories that became public and how Amazon fixed the issues.

Slides: pptx, pdf.
14. 3/30 [BlockChains]

Definitions. Anonymity, Byzantine DDoS attacks.  Using Ethereum or Hyperledger to encode smart contracts.   Permissionless and permissioned models, and how they differ.    Gossip in wide-area environments.    Proof of work, proof of stake, proof of elapsed time.

Slides: pptx pdf 
You can read more about BlockChains of both permissioned and non-permissioned flavor on Wikipedia.   Datacenter blockchains are permissioned, but because they run in a known datacenter with known membership, are actually more like Paxos replicated logs with crash failures.  What makes something a permissioned blockchain rather than a Paxos log is that we run it in a wide-area setting, and anticipate Byzantine attacks.
[Blue Origin Space Exploration.  Amazon Robotics]

IoT shows up in many robotics and robotic-like settings.  In this evening recitation lecture our special guest will be Larry Felser, who was a technical leader at the Engineering CAD company Autodesk in Ithaca when Jeff Bezos recruited him to head the Engineering effort at Blue Origin, the company Jeff created to extend humanity into space.  After many years leading at Blue Origin, Larry then moved to Amazon, where he co-directed their robotics effort, which is doing stuff like drones to deliver packages to your door, and robotic warehouses that automate all the movement of products onto shelves and then into packages for delivery.   Now Larry is back in the Ithaca area but still engaged with Amazon remotely.

Larry brings AMAZING videos from his experiences at Blue Origin and Amazon Robotics.  This is a don't-miss event!
Yes, you could also reach out to him about how to get a job at one of these awesome companies -- maybe even a job that would center using the cloud for IoT in outer space, or in business enterprise settings!
These next lectures drill down on Blockchain, but not cryptocurrency -- we will focus on how blockchains work and how the cloud uses them, especially for IoT.
15. 4/1 [BlockChain Puzzles and Concerns]

Vegvisir.  Open questions: BlockChain has been adopted so enthusiastically that early users are seemingly ignoring a great many puzzles.  We'll discuss a few of them.

Slides: pptx pdf 
The main paper we will discuss is this:

Vegvisir: A Partition-Tolerant Blockchain for the Internet-of-Things, Kolbeinn Karlsson ; Weitao Jiang ; Stephen Wicker ; Danny Adams ; Edwin Ma ; Robbert van Renesse ; Hakim Weatherspoon. 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS), Vienna, 20
16. 4/6 [Blockchain with multiple organizations.]

A puzzle for big enterprises is that many activities span more than one company, and each might need to keep its own blockchain.  We'll look at how this makes access control and querying more difficult than it would have been with just one company using the blockchain.

Slides: pptx pdf 
Topic tbd  
Privacy in the cloud,  This is really a security topic, and Cornell has entire classes on security.  We'll just have a single lecture on it, focused on a tool you can easily download and use.
17. 4/8 [Privacy in Cloud Computing]

Privacy isn't our main topic, but lecture 16 touched on it (HiPPA issues for electronic health care records).  What's the best we can do on a cloud platform?

Slides: pptx pdf 
The main tool we will be talking about is CryptDB, an open source platform you can download from http://GitHub.com/CryptDB

The work was done at MIT by Ralucca Popa, who is now a professor at UC Berkeley
18. 4/13
[Accessing collections from modern programming languages]

In this lecture we will look at technologies for accessing databases or other kinds of collections from programming languages like Python or C++.  Specifically, we'll look first at Pandas, which is a Python add-on package for doing database accesses right in your program.  Then we'll pivot to Azure and will look at LINQ, a general framework available from every programming language offered by Microsoft.

Slides: pptx pdf 
A big chunk of this lecture will be just looking at documentation web pages that show how Pandas and LINQ are used.  The ones we will focus on are these



Collections are a valuable tool for accessing data in a highly parallel way.  We'll see the concept, but then we'll look at a particularly tricky case involving huge social-networking graphs and applications that run on those.
19. 4/15 [Making The Cloud Friendlier for Object-Oriented Computing]

Many modern systems are object oriented, yet Linux was born in a world of record-oriented databases and files containing things like binaries or text data.  As a result there has been a push to make the cloud more object-friendly.  We saw this in lecture 18 when we discussed the LINQ tools for embedding database access directly into programming languages by extending the concept of a collection with a variety of OO primitives.  Today we will see that this extends into settings like Apache. 

This is a two-part lecture.  First, we will discuss Ceph, an open-source file system designed using the Apache architecture that has a number of optimizations for object oriented computing (Ceph is not a component of Apache "per se" but is very often used in Apache applications).  Next, we will look at an overhead issue that can arise when an object oriented application buys fully into this always-distributed, multiple-component mindset, and also at how that issue was solved in a specific setting (an air traffic control system in Europe). 

Slides: pptx pdf  
While many big-data systems start with unstructured data (like web pages), there are growing needs to work with higher-level "objects" through file system APIs.  Ceph is a new and very popular file system that scales super well, has HPC extensions for people doing supercomputing research, and with a built-in layer for "object" storage that bypasses the POSIX file system API.

Ceph: A Scalable High-Performance Distributed File System.  Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, and Carlos Maltzahn. 2006.  In Proceedings of the 7th symposium on Operating systems design and implementation (OSDI '06). USENIX Association, Berkeley, CA, USA, 307-320.

Ceph Object storage
20. 4/20 [Apache Ecosystem.]

In ML settings, we often need to train ML models on really big sharded data sets.  This motivated a very famous approach called the MapReduce pattern.  Today we'll see what MapReduce does, but then will learn about the ecosystem of tools for MapReduce programming in the Apache platform (Hadoop and associated mechanisms)

Slides: pptx pdf 

21. 4/22 [Spark RDD concept.]

The clever idea in Spark is to package LINQ-style logic as small containerized functions that can have "names" and yield cacheable results, which are simple files stored as objects into the HDFS or Ceph file system but that can be cached in memory too.  RDDs are the name Spark introduced for this kind of object.  They can be recomputed if needed, cached, and saved on a disk file to avoid recomputing them, if they need to be evicted from cache and it would be costly to recreate the contents.

Slides: pptx pdf 
The Hadoop version of MapReduce used to be slow until a Berkeley project called Spark came up with a clever new caching concept centered on resilient distributed data objects or RDDs.  We'll look at how these work, and how they can talk to temporal data from sensors.

Spark: Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica. HotCloud 2010.

Improving MapReduce Performance in Heterogeneous Environments, M. Zaharia, A. Konwinski, A.D. Joseph, R. Katz and I. Stoica, OSDI 2008, December 2008. 

22. 4/27
[Microsoft vision for Azure, Digital Farming, and IoT]

Guest Lecturer from Microsoft Azure IoT senior leadership team.
Ranveer Chandra, Chief Scientist for Azure Global and head of research for Microsoft's precision farming effort (Farmbeats) and for Microsoft's networking research group has offered to join us for this lecture.  A Cornell PhD, Ranveer is widely known as a pioneer of new wireless communication technologies.  He went on to commercialize that concept as a Microsoft product, shifted his group to explore new software-controlled battery concepts and integrate them into new drones.  In  his role as the visionary and leader for Azure Global he is one of the key people who will decide the Microsoft strategy for cloud IoT. 
    Topic skipped in 2021:  More Apache big-data tools

Due to the Covid semester schedule, we had to skip one lecture on Apache tools like HBASE, Pig, Hive, RocksDB, etc. 

The slides we used to use for that topic are here:
pptx, pdf.

This material will not be required for any quizzes, but Professor Birman or either of the TAs should be able to answer questions you might have about the subject. 
As you can see, there was basically a choice between having Ranveer's guest lecture and covering this additional big-data tools material.  We decided that having a chance to meet and hear from the chief scientist for one of the main cloud computing companies was just way more exciting!
23. 4/29 [Write-Once Data and Rollback/Redo Fault Tolerance]

Fault-tolerance in MapReduce/Hadoop.  Why Hadoop's style of computing only requires file appends, not general updates or replacement. 

Slides: pptx pdf 
Many people are surprised to learn that even though Hadoop's HDFS file system can be used more or less like a normal file system, in fact Hadoop only allows programs to append to files, not to do arbitrary updates.  Why did they impose this rule?  We'll see that it comes down to fault-tolerance in Hadoop.

In this talk we discussed the Fischer, Lynch and Patterson impossibility result.  The paper is not simple to read, although it is short.  Here is a pointer to it, and then a pointer to a much easier to follow paper about some other limitations on fault-tolerance that might interest you:

Impossibility of distributed consensus with one faulty process. Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. J. ACM 32, 2 (April 1985), 374-382. DOI=http://dx.doi.org/10.1145/3149.214121

Easy impossibility proofs for distributed consensus problems. Michael J. Fischer, Nancy A. Lynch, and Michael Merritt. In Proceedings of the fourth annual ACM symposium on Principles of distributed computing (PODC '85), Michael Malcolm and Ray Strong (Eds.). ACM, New York, NY, USA, 59-70.1985.  DOI=http://dx.doi.org/10.1145/323596.323602
Hardware accelerators do most or all of the heavy lifting for scaled-out cloud applications.  We'll see how this is done, why adoption of accelerators has posed big challenges, how you use them in settings where they are available, and how accelerators are reshaping the cost-of-computing story.  But we will also see that accelerators can be hard to leverage without "special sophistication".
24. 5/4 [Social networking data: How the cloud deals with huge graphs]

We will look at one example of an existing big data infrastructure (Facebook TAO) and how modern systems access social networking graphs.

Slides: pptx pdf             
TAO: Facebook's Distributed Data Store for the Social Graph. Nathan Bronson, Zach Amsden, George Cabrera, Prasad Chakka, Peter Dimov Hui Ding, Jack Ferris, Anthony Giardullo, Sachin Kulkarni, Harry Li, Mark Marchukov Dmitri Petrov, Lovro Puzar, Yee Jiun Song, Venkat Venkataramani. 2013 USENIX Annual Technical Conference (USENIX ATC '13).
25. 5/6 [Hardware accelerators]

These days, anyone who follows the cloud literature sees endless rave reviews of hardware devices like RDMA, NVMe, GPU and GPU clusters, TPU and TPU clusters, FPGA. How important are accelerators for cloud intelligence?  How do you get access to them, and can you use them without learning obscure languages like Verilog?

Slides: pptx pdf            
In the cloud accelerators matter, a lot.  Many kinds of cloud intelligence applications center on very costly computations, and we have to find ways to do them quickly and cost-effectively.  But this dimension of the cloud centers on its ability to leverage highly specialized hardware.  We'll do a mile-high review of the most important accelerators.  You don't normally access these directly: instead, you use u-services that already are integrated with them.  But there are exceptions: GPU and TPU are sometimes accessible to users, and there are many software layers that have special permission to access other devices, too.   This drives us towards u-services: there just isn't any other way to get the needed performance at reasonable cost.

TensorFlow: A System for Large-Scale Machine Learning
Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. In Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation (OSDI'16). USENIX Association, Berkeley, CA, USA, 265-283.
26. 5/11 [Accessing hardware accelerators from normal programs]

There are a few options for leveraging a hardware accelerator.  You can program it directly (for example, by writing CUDA code for a GPU), use an existing library already installed in the GPU, or use a higher level language that compiles into a mix of host code and accelerator code.  We'll talk about this different approaches.

Slides: pptx pdf
27. 5/13 [The challenge (nightmare?) of rolling out new hardware]

Why can't we just use hardware accelerators everywhere for computing, and use RDMA for all our data movements, and avoid "copying"?  We'll focus on the RDMA version of this question.  First we should understand why copying is such a costly operation, and why zero-copy nonetheless remains a holy grail.  Then we'll look at the challenges of introducing RDMA into big data centers (and how to view those challenges as a "warning" for other future accelerators that people may want to deploy at scale!)

Slides: pptx pdf 
CS5412 will not have a final in Spring 2021.  The "exam" portion of your grade will be based on take-home quizzes.