Spring 2020 Syllabus

Notice that there are links to the slides in the syllabus, so you can make notes right on a copy of the slides if you wish.

Attending lectures is mandatory. However, when the syllabus points to additonal readings, keep in mind that those are encouraged, and yet optional. They are a good way to learn more, or to gain clarity if you want to go back over a lecture topic that you weren't completely comfortable with from lecture (or if you miss a lecture entirely). To learn the material in CS5412 it would usually be a good idea to at least look at these recommended readings. Just the same, you won't ever be "tested" on something that requires knowing things that were never covered in class, or asked about optional readings when presenting your project (you might be asked about topics covered in class that seem relevant to the project).

Some lectures are really part of a larger group of lectures with multiple readings, covered on multiple class meetings. In those cases the same reading won't be repeated again and again, yet we might refer back to things that the prior lecture and reading explored.

Ken sometimes notices typos while lecturing, and will post updated slides for each lecture within a day or two if that happens. Below, the lecture ordering differs from what we did in past years. Ken will be updating his slide sets before these lectures are given, but if you visit a slide set before he has had time to get to it, you might notice that it seems out of place or has the wrong title or some other oddity. In fact as of December 2019, none of the slide sets have been updated. They should all be properly numbered and so forth by mid February.

Notice that we have a few guest lectures during recitation sessions and/or in class. The ones in class are required and you could see a question about them on a quiz or the final. We marked the corresponding date fields in green if a speaker would be in the evening, during recitation. Recitation attendence is optional but strongly encouraged.

Date

Topic

Remarks, Recommended reading (optional, see note above)

The first group of lectures are concerned with some basic principles that govern cloud computing and scalability. The cloud is built from Linux servers but isn't the same as just logging into a Linux system and starting to build programs like you may have created in prior classes or jobs. You need to use the vendor-provided tools, and you need to be unusually aware of the way the hardware works -- otherwise your systems perform poorly or are very costly compared to the best possible (and at cloud scale, costs mount rapidly!)

1/21

[Cloud Overview. Internet of Things and the Cloud IoT Edge]

Overview of the course. Azure IoT model: Sensors, Azure IoT Edge roles, Azure Intelligent Edge and IoT Hub, u-services model, data center file system and database infrastructures, big-data analytics infrastructures.

We focus on Azure just for coherency, but Amazon AWS has completely analogous components except with less focus (as of today) on IoT.

Slides: pptx pdf

The first five lectures are really to help everyone get situated and onto the same page in terms of terminology and mindset. In lecture one we look at an end-to-end perspective on how a smart farm would work in Microsoft Azure from data collection all the way back to data storage and big-data analytics. The technical depth will be kind of shallow.

Azure.microsoft.com: Home page for all of Azure and Azure IoT. This is actually quite a useful resource for finding more details on the topics of the first few lectures.

Some of the examples in the lecture draw on work done by Professor Delimitrou in Cornell's ECE department. A paper on her Seer system can be found here: Seer: Leveraging Big Data to Navigate the Complexity. Seer depended on a suite of tools for benchmarking microservices discussed here: Benchmarking Microservices

One example discussed in Lecture 1 is Microsoft's smart farms project. Read more at: FarmBeats: AI & IoT for Agriculture.

1/22
(recitation)

7:30 G01

In this first recitation meeting we will talk about how homework and projects are handled. CS5412 will have 3 or 4 graded coding assignments (the first one will be assigned on the first day of classes!). Sagar will explain how to access CMS, how to prepare uploads and what to do to actually upload them, and how team formation will occur for the larger project that runs in the last 8 or 9 weeks of the semester. The homeworks, in total, count for 1/6 of your total grade. You project counts for 2/6'ths. The remaining 50% of your grade is based on in-class quizzes and the final exam.

Sagar Jha will be running the recitations. Some are focused on class material and reviews of things that might have been confusing. A few focus on the project. And there are also some guest lecturers who join us in the main class or in recitation, starting on January 29 and then again on February 5.

Scalability through a technique called "key-value sharding" is foundational in the cloud. We'll be using this extensively and testing on it, and every student needs to become proficient both in the concept and in finding clever ways to apply this even when the "match" to the problem statement may not be obvious. Understanding the tradeoffs matters too: in key-value sharding some data structures end up split over many shards, and that can introduce costs if your application is naive about how it accesses them.

1/23

[Scalability and Key-Value Sharding]

Introduction to cloud scalability techniques: hierarchy, point of presence mini-datacenters, full datacenters, (key-value) sharding and simple fault-tolerance techniques, use of a DHT plus notifications to implement a publish-subscribe message bus, a DDS, or a message queue. Putting it all together: Akamai CDN and Facebooks massive content delivery infrastructure.

Note that although Azure has a blob service (a key-value sharded store for binary large objects), and a sharded data store (Cosmos) and even has a NoSQL layer on top of Cosmos (CosmosDB) with all sorts of computing capability, we won't be discussing those in the main lectures -- it would take too long to look at all the options and details. So you will either see such things in recitation, as demos by the TAs, or might have to learn them on your own, from the huge amounts of online materials Azure and Amazon AWS offer. You find examples and basically copy and then customize them: in CS5412 this isn't considered to be cheating, but is the normal way people build things. But you do need to learn to find that documentation on your own, and to follow the demos and recommendations on your own.

Slides: pptx pdf

Continuing our broad but shallow review, lecture two looks at ways of breaking large data sets into what are called sharded key-value stores.

Much of what we discuss in Lecture 2 can be found on Wikipedia in the key-value database entry. (In fact they go beyond what we will be talking about and look at the whole question of treating entire databases in a key-value manner, but in CS5412 we won't tackle the full question.)

The two papers we'll specifically cover are concerned with Facebook's caching policy, and the RIPQ mechanism they used to adapt S4LRU to work on flash SSD. But you are only responsible for understanding the overall approach -- not the details.

Containers play a huge role in the cloud. We'll look at the container concept, at how containers are managed by the function server, and how the function/container model differs from a cloud concept called a "microservice", which is also containerized but is managed in a different way, by the "App Server".

1/28

[Pool model for Microservices] In this lecture, we will look at the issues raised by this idea of managed microservices that live in elastic pools. Some of the issues involve what is called stateless programming (we'll explain; it doesn't mean what the word sounds like). Another issue is virtualization using a hypervisor such as Xen versus container virtualization using Kubnetes with help from an OS like Mesos.

In a nutshell, when we put a program on a cloud, we also place files it might use as inputs and can even tell the cloud to run multiple instances as a pool. Those pool members could talk to one-another if you like. So we'll learn about how to hand a program to the cloud in the form of a container with any other resources it needs, and how network virtualization creates a private world in which your processes can find one-another without being disrupted by other cloud users.

Slides: pptx pdf

We used the term "micro-service", but where did this idea come from? What does a typical micro-service do? We'll also look at a few of the more important micro-services found in Azure. Then we will look at the puzzle of how these are typically implemented, and will see that some are stateless (an easier case) and some are stateful (a much harder case). Normally, stateless systems sit in front of stateful ones. Finally, we will see that the App Service that manages these pools needs a simple way to deal with the whole application as a kind of "package", and does this using Container Virtualization.

A lot of terminology, but the ideas are surprisingly simple. For example, a container is just a way to fool a normal program into thinking it is running inside a virtual machine, but without the high costs of a true virtual machine.

1/29 (recitation)

7:30 G01

Discussion of cloud computing projects: Everything you could possibly want to know. You will even learn about why it might be valuable to do face recognition on cows in a milking stall!

In this recitation we will be joined by Professor Giordano who will introduce some of his research and then talk about the joint projects with students from his class (you get extra credit for teaming with them, but he only has 24 students so only some of our cloud computing teams can do this). We will also talk about other kinds of project opportunities, including self-defined projects, projects focused on cloud infrastructure, and projects focused on drones based on the Microsoft Farmbeats technology.

1/30

[IoT Edge, IoT Cloud, Function servers.]

Connecting a cloud to some form of device, like a drone or a camera. IoT Edge server concept.

IoT Hub concept. Function server. Customization of event handlers. Typical micro-services: Message queuing, message bus, storage, data compression, image segmentation and tagging, other data transformations. Concept of event-driven state machines in the function-server setting. Stateless functions, and where to save state.

Slides: pptx pdf

Here we will be looking at the surprisingly small changes that were needed to take the same cloud architecture used for human interactions through web applications (the model we looked at in lectures 1, 2 and 3) so that IoT sensors and actuators can also talk to the cloud. Basically, we replace the first tier web page builder system with a new first tier that securely binds to the sensor or actuator, and then we can cusatomize it by introducing new containerized programs called "functions". The "function service" manages them. The relevant Wikipedia page is here, but is actually more general than what we discuss in CS5412. Wikipedia has a whole serverless computing story (implemented by functions); we focus on a somewhat more limited case.

2/4

[More on DHTs]

This lectures shifts back to our DHT concept and drills down on some challenges of actually squeezing complicated data into a DHT structure for scaling (the aggregated capacity of the DHT can be huge) or for speed (the shards operate independently, plus if we are careful, all the data can be held in memory).

Slides: pptx pdf

In a nutshell, although one could probably implement any application in any way you like, doing so often forces us to think hard about what will work well in the cloud and not collapse as soon as it becomes successful. The puzzle is then this: we know how DHTs scale and why they perform well, but how can we use them for things that don't look like (key,value) data?

2/5
(recitation. 7:30pm in Gates G01)

[Robotics at Blue Origin and Amazon]

Guest lecturer from Blue Origin / Amazon AWS.

For this recitation, our speaker will be Larry Felser, the past CTO of Blue Orgin ("Chief Technology Officer", who ran the engineering division of Blue Origin for its first ten years. This person then shifted to work at AWS on their robotic systems for home delivery and warehouse management. He happens to live in Enfield now, working remotely for AWS, and we are reaching out to see if he could join us.

The course has an emphasis on IoT, so we'll learn about the IoT Hub, which links sensors to the function server, which in turn talk to microservices.

2/6

[IoT sensor registration. Risk of sensor inaccuracy.]

The Azure IoT hub and the concept of a secure sensor with a managed life-cycle. Sensor properties. Fault-tolerance. The META system and its model of fault-tolerance for IoT devices.

Slides: pptx pdf

To start drilling down, we'll look closely at how end-users connect devices like cameras, drones, microphones (Cortana/Siri/Alexa) and so forth to the cloud. Azure IoT Hub is a microservice for secure sensor management.

Then we will study an example of a case where an IoT sensor malfunctions to start thinking about what this even means, how we could compensate, and what corrective actions might be appropriate.

Tools for Distributed Application Management. K. Marzullo, M. Wood, K. Birman and R. Cooper. IEEE Computer, Aug. 1991, 24(8):42-51.

2/11

[The challenge of "always sharded" computing]

Very large-scale systems depend on never putting all the data or all the work at a single place -- you lose scalability otherwise. This gave us the DHT storage abstraction. But how can we maximize parallelism? We'll start by looking at the famous MapReduce computing pattern for big data. But then we'll take a big leap and ask the obvious question: Can we do anything like that for edge IoT situations, with huge numbers of sensors?

Slides: pptx pdf

The next set of lectures look at how higher level concepts map to the cloud. We'll focus on cloud storage and will talk about object oriented systems, fault-tolerance, concepts for dealing with time.

2/13

[Fault-tolerance]

Fault-tolerance concepts. Split-brain concept. "Stateless" computing with replicated persistent data. Chain replication. State machine replication and Paxos. Derecho.

Slides: pptx pdf

One big puzzle with a system split between sensors at the edge, cloud-hosted middle services, and then perhaps back-end computing on massive data sets, is that sooner or later elements will definitely fail and restart. This lecture looks at the best ways to have your system keep running even after a stumble.

Optional reading:

Chain replication for supporting high throughput and availability.Robbert van Renesse and Fred B. Schneider. In Proceedings of the 6th conference on Symposium on Operating Systems Design & Implementation - Volume 6 (OSDI’04). USENIX Association, USA, 7.

Paxos Made Moderately Complex. R Van Renesse and D Altinbuken. ACM Comput. Surv. 47, 3, Article 42 (February 2015), DOI: 10.1145/2673577

Derecho: Fast State Machine Replication for Cloud Services. Sagar Jha, Jonathan Behrens, Theo Gkountouvas, Matthew Milano, Weija Song, Edward Tremel, Sydney Zink, Kenneth P. Birman, Robbert van Renesse. To appear, ACM Transactions on Computing Systems (TOCS), 2019.

2/18

Time and Causality]

Timestamped data. Clocks and clock synchronization. Sensor time, platform time. Causal ordering and causal clocks.

Slides: pptx pdf

We will start a somewhat deeper dive into underlying technology by looking at the issue of temporality in modern IoT settings, where sensors might have some form of clock. We will end by looking at Lamport's definitions for causality and consistent cuts.

Time, clocks, and the ordering of events in a distributed system. L. Lamport. Commun. ACM 21, 7 (July 1978), 558-565.

Distributed snapshots: determining global states of distributed systems. K. Mani Chandy and Leslie Lamport. ACM Trans. Comput. Syst. 3, 1 (February 1985), 63-75.

10.

2/20

[Temporality and causality in storage systems]

File systems with concepts of temporal data and causal consistency.

Slides: pptx pdf

This lecture shows how the ideas we learned about in lecture 6 can be useful in understanding some of the issues that arise in modern file systems, which turn out to be pretty terrible at handling real-time data, or causality. But there are file systems that do better (including one we created here at Cornell!) and more broadly, there are ways IoT application developers can overcome the limitations and problems. We'll see how that might work for a file system, and then for a key-value "object store".

The Freeze-Frame File System. Weijia Song, Theo Gkountouvas, Ken Birman, Qi Chen, and Zhen Xiao. In Proceedings of the Seventh ACM Symposium on Cloud Computing (SoCC '16), Marcos K. Aguilera, Brian Cooper, and Yanlei Diao (Eds.). ACM, New York, NY, USA, 307-320.

As a side remark, machine learning systems often use the ARIMA model when accessing temporal data.

2/25

February break, no classes.

Have fun! We won't go on holiday as a class, but if you might be tempted, this is a photo from Lake Placid's "Whiteface" mountain. That area, or the famous ski areas in Vermont and New Hampshire, are just a few hours from Ithaca by car. And guess what? They offer great lessons and have beginner slopes too... Just bring warm clothes and drive carefully: the roads can be slippery up in the frosty north...

11.

2/27

[Smart Homes]

IoT isn't just a cloud story. Smart homes are using IoT heavily, but because of security, it makes more sense to try and keep most of the data in the home, right where we use it. To see how this plays out, we will be joined for a guest lecturer by Dr. Ashutosh Saxena, Founder and CEO of Caspar.ai.

In this lecture, we'll have a guest speaker from a company called Caspar.ai. Ashutosh Saxena used to be a Cornell faculty member in the robotics / database area, and then left to launch his company, which looks at IoT within the home.

Wikipedia Article on Smart Devices, Smart Homes, Smart Highways, Smart Grid.

Caspar.ai is working on smart homes, but they are in the middle of this ecosystem. They are rapidly growing and Ashutosh has a lot of openings for people who are really good at the kind of scalable computing we focus on in the class.

Caspar.ai web site

12.

3/3

[Strongly Consistent Geoscale Computing]

Availability zones. WAN replication. Mirroring versus active update models. Google's Spanner system. 5G mobility.

Slides: pptx pdf

If you depend on the cloud, clearly you need your cloud to be reliable. Yet datacenters do fail. An availability zone is a set of 2 or 3 side-by-side cloud datacenters that the vendor manages to ensure that (if possible) at most 1 would be down at any time. Because the distances are so tiny, latencies are similar to intra-datacenter delays.

WAN replication arises when datacenters are located at very long distances, maybe even globally. Yet we can still do strongly consistent data replication even at that scale, as Google's Spanner demonstrates.

Spanner: Google’s Globally Distributed Database. James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. 2013. ACM Trans. Comput. Syst. 31, 3, Article 8 (August 2013), 22 pages.

13.

3/5

[Making File Systems Friendly for Object-Oriented Computing]

Many modern systems are object oriented, and the cloud has dealt with this by introducing object oriented file systems. In this particular lecture we'll focus on a file system called Ceph (the paper is linked on the right). Ceph is very widely used and popular.

Slides: pptx pdf

While many big-data systems start with unstructured data (like web pages), there are growing needs to work with higher-level "objects" through file system APIs. Ceph is a new and very popular file system that scales super well, has HPC extensions for people doing supercomputing research, and with a built-in layer for "object" storage that bypasses the POSIX file system API.

Ceph: A Scalable High-Performance Distributed File System. Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, and Carlos Maltzahn. 2006. In Proceedings of the 7th symposium on Operating systems design and implementation (OSDI '06). USENIX Association, Berkeley, CA, USA, 307-320.

Ceph Object storage

These next lectures drill down on Blockchain, but not cryptocurrency -- we will focus on how blockchains work and how the cloud uses them, especially for IoT.

14.

3/10

[BlockChains]

Definitions. Anonymity, Byzantine DDoS attacks. Using Ethereum or Hyperledger to encode smart contracts. Permissionless and permissioned models, and how they differ. Proof of work, proof of stake, proof of elapsed time.

Slides: pptx pdf

You can read more about BlockChains of both permissioned and non-permissioned flavor on Wikipedia.

15.

3/12

[BlockChain Puzzles and Concerns]

Vegvisir. Open questions: BlockChain has been adopted so enthusiastically that early users are seemingly ignoring a great many puzzles. We'll discuss a few of them.

Slides: pptx pdf recorded version (Cornell only)

The main paper we will discuss is this:

Vegvisir: A Partition-Tolerant Blockchain for the Internet-of-Things, Kolbeinn Karlsson ; Weitao Jiang ; Stephen Wicker ; Danny Adams ; Edwin Ma ; Robbert van Renesse ; Hakim Weatherspoon. 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS), Vienna, 20

16.

4/7

[Blockchain with multiple organizations.]

A puzzle for big enterprises is that many activities span more than one company, and each might need to keep its own blockchain. We'll look at how this makes access control and querying more difficult than it would have been with just one company using the blockchain.

Slides: pptx pdf recorded version (Cornell only)

Privacy in the cloud, This is really a security topic, and Cornell has entire classes on security. We'll just have a single lecture on it, focused on a tool you can easily download and use.

17.

4/8
optional in 2020sp

[Privacy in Cloud Computing]

Privacy isn't our main topic, but lecture 16 touched on it (HiPPA issues for electronic health care records). What's the best we can do on a cloud platform?

Slides: pptx pdf recorded version (Cornell only)

The main tool we will be talking about is CryptDB, an open source platform you can download from http://GitHub.com/CryptDB

The work was done at MIT by Ralucca Popa, who is now a professor at UC Berkeley

Collections are a valuable tool for accessing data in a highly parallel way. We'll see the concept, but then we'll look at a particularly tricky case involving huge social-networking graphs and applications that run on those.

18.

4/9

[Accessing collections from modern programming languages]

In this lecture we will look at technologies for accessing databases or other kinds of collections from programming languages like Python or C++. Specifically, we'll look first at Pandas, which is a Python add-on package for doing database accesses right in your program. Then we'll pivot to Azure and will look at LINQ, a general framework available from every programming language offered by Microsoft.

Slides: pptx pdf recorded version (Cornell only)

A big chunk of this lecture will be just looking at documentation web pages that show how Pandas and LINQ are used. The ones we will focus on are these

https://docs.microsoft.com/en-us/dotnet/csharp/linq/query-expression-basics

19.

4/14

[Apache Ecosystem.]

A few weeks ago, we first learned about the MapReduce pattern. Today we'll learn about the ecosystem of tools for MapReduce programming in the Apache platform (Hadoop and associated mechanisms)

Slides: pptx pdf recorded version (Cornell only)

20.

4/15
optional in 2020sp

[Spark RDD concept.]

The clever idea in Spark is to package LINQ-style logic as small containerized functions that can have "names" and yield cacheable results. RDDs are the name Spark introduced for this kind of object. They can be recomputed if needed, cached, and saved on a disk file to avoid recomputing them, if they need to be evicted from cache and it would be costly to recreate the contents.

Slides: pptx pdf recorded version (Cornell only)

The Hadoop version of MapReduce used to be slow until a Berkeley project called Spark came up with a clever new caching concept centered on resilient distributed data objects or RDDs. We'll look at how these work, and how they can talk to temporal data from sensors.

Spark: Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica. HotCloud 2010.

Improving MapReduce Performance in Heterogeneous Environments, M. Zaharia, A. Konwinski, A.D. Joseph, R. Katz and I. Stoica, OSDI 2008, December 2008.

21.

4/16

[Write-Once Data and Rollback/Redo Fault Tolerance]

Fault-tolerance in MapReduce/Hadoop. Why Hadoop's style of computing only requires file appends, not general updates or replacement.

Slides: pptx pdf recorded version (Cornell only)

Many people are surprised to learn that even though Hadoop's HDFS file system can be used more or less like a normal file system, in fact Hadoop only allows programs to append to files, not to do arbitrary updates. Why did they impose this rule? We'll see that it comes down to fault-tolerance in Hadoop.

In this talk we discussed the Fischer, Lynch and Patterson impossibility result. The paper is not simple to read, although it is short. Here is a pointer to it, and then a pointer to a much easier to follow paper about some other limitations on fault-tolerance that might interest you:

Impossibility of distributed consensus with one faulty process. Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. J. ACM 32, 2 (April 1985), 374-382. DOI=http://dx.doi.org/10.1145/3149.214121

Easy impossibility proofs for distributed consensus problems. Michael J. Fischer, Nancy A. Lynch, and Michael Merritt. In Proceedings of the fourth annual ACM symposium on Principles of distributed computing (PODC '85), Michael Malcolm and Ray Strong (Eds.). ACM, New York, NY, USA, 59-70.1985. DOI=http://dx.doi.org/10.1145/323596.323602

22.

4/21

[Social networking data: How the cloud deals with huge graphs]

We will look at one example of an existing big data infrastructure (Facebook TAO) and how modern systems access social networking graphs.

Slides: pptx pdf recorded version (Cornell only).

TAO: Facebook's Distributed Data Store for the Social Graph. Nathan Bronson, Zach Amsden, George Cabrera, Prasad Chakka, Peter Dimov Hui Ding, Jack Ferris, Anthony Giardullo, Sachin Kulkarni, Harry Li, Mark Marchukov Dmitri Petrov, Lovro Puzar, Yee Jiun Song, Venkat Venkataramani. 2013 USENIX Annual Technical Conference (USENIX ATC '13).

Hardware accelerators do most or all of the heavy lifting for scaled-out cloud applications. We'll see how this is done, why adoption of accelerators has posed big challenges, how you use them in settings where they are available, and how accelerators are reshaping the cost-of-computing story. But we will also see that accelerators can be hard to leverage without "special sophistication".

23.

4/23

[Hardware accelerators]

These days, anyone who follows the cloud literature sees endless rave reviews of hardware devices like RDMA, NVMe, GPU and GPU clusters, TPU and TPU clusters, FPGA. How important are accelerators for cloud intelligence? How do you get access to them, and can you use them without learning obscure languages like Verilog?

Slides: pptx pdf recorded version (Cornell only)

In the cloud accelerators matter, a lot. Many kinds of cloud intelligence applications center on very costly computations, and we have to find ways to do them quickly and cost-effectively. But this dimension of the cloud centers on its ability to leverage highly specialized hardware. We'll do a mile-high review of the most important accelerators. You don't normally access these directly: instead, you use u-services that already are integrated with them. But there are exceptions: GPU and TPU are sometimes accessible to users, and there are many software layers that have special permission to access other devices, too. This drives us towards u-services: there just isn't any other way to get the needed performance at reasonable cost.

TensorFlow: A System for Large-Scale Machine Learning
Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. In Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation (OSDI'16). USENIX Association, Berkeley, CA, USA, 265-283.

24.

4/28

[Microsoft vision for Azure, Digital Farming, and IoT]

Guest Lecturer from Microsoft Azure IoT senior leadership team.

recording (Cornell only)

Ranveer Chandra, Chief Scientist for Azure Global, joined us for this lecture (it was presented remotely, and we do not yet have permission to post the slides). A Cornell PhD, Ranveer is widely known as a pioneer of new wireless communication technologies. He went on to commercialize that concept as a Microsoft product, shifted his group to explore new software-controlled battery concepts and integrate them into new drones, and recently was promoted to a new role as the visionary and leader for Azure Global.

25.

4/30

[Accessing hardware accelerators from normal programs

There are a few options for leveraging a hardware accelerator. You can program it directly (for example, by writing CUDA code for a GPU), use an existing library already installed in the GPU, or use a higher level language that compiles into a mix of host code and accelerator code. We'll talk about this different approaches.

Slides: pptx pdf recorded version (Cornell only)

26.

5/5

[Why isn't RDMA everywhere?]

Why can't we just use hardware accelerators everywhere for computing, and use RDMA for all our data movements, and avoid "copying"? We'll focus on the RDMA version of this question. First we should understand why copying is such a costly operation, and why zero-copy nonetheless remains a holy grail. Then we'll look at the challenges of introducing RDMA into big data centers (and how to view those challenges as a "warning" for other future accelerators that people may want to deploy at scale!)

Slides: pptx pdf recorded version (Cornell only)

27.

5/7

[Programability of the Network]

We learned that RDMA NICs are basically small CPUs. In this lecture we'll learn more about the huge effort to make networks more customizable and to even download code that will run "inside" the network, on programmable routers and NICs. This is basically in addition to RDMA -- you probably would also have RDMA in a system of this kind.

Slides: pptx pdf recorded version (Cornell only)

The most obvious readings here relate to the P4 processor that Cornell's very own Nate Foster has been using as a target for his PL research! Here's a Wikipedia link.

28.

5/12

[The Future of the Cloud]

If the cloud is ultimately shaped by the flow of money, what can we learn about how the market for large-scale computing is evolving, and what does this tell us about the future of the cloud? The class will be as data-driven as possible (I plan to hunt for public materials about the evolution of the cloud business model and market, and we'll see what insights we can glean from the charts and predictions).

Slides: pptx pdf recorded version (Cornell only)

We voted that the final will be optional... if anyone wishes to take a final, let Ken know and he'll work out a day and time for you.