Kenneth P. Birman
N. Rama Rao Professor of Computer Science
435 Gates Hall, Cornell University
Ithaca, New York 14853
W: 607-255-9199; M: 607-227-0894; F: 607-255-9143
CV: Jan 2016
During fall 2016 and spring 2017 I won't be teaching (I plan to devote
full time to my research projects).
Current Research (full
publications list). At the end of this subsection is a list of links
to recent recorded videos.
project looks at ways of leveraging remote DMA (RDMA) technologies to
move large data objects at wire speeds.
The download site is (derecho.codeplex.com).
is the world's fastest Paxos library: it outperforms Corfu (the prior
record-setting system) and also scales to very large numbers of
also has an in-memory multicast mode that runs several orders of magnitude
faster than any previous atomic multicast, and is more than 15,000x faster than our
own Vsync library (embarassing to admit, but the fact is that RDMA is just
surreal and will change our world, so if someone is going to eat my lunch,
it might as well be me!). A very cool feature is that Derecho
automatically manages subgroups of a group (you associate them with
classes in your code, so there might be a load-balancer subgroup, a cache
layer, a back-end). Plus those can be sharded: the cache could be
1000 nodes in subgroups of 2 each, for example. We currently do this
as a one-level flat hierarchy but the protocols can actually support a
kind of recursive application of the same ideas. Moreover, Derecho
is highly efficient: if an application as a whole has 1025 members in the
top-level group but most of the traffic involves multicasts in little
shards of 2 or 3 members each, Derecho offers strong consistency for those
multicasts (atomic, or Paxos) and yet only the participants in a
particular exchange of bytes see those bytes or incur any load at
all. This is unlike prior multicast systems where subgroups involved
sending to everyone and then discarding unwanted data.
Derecho also scales in a fantastically
impressive way. We've tested with up to 256 replicas. For
smaller levels of replication, it comes nearly for free: you can make one
replica at 12.5GB/s for example, or you can make 4 for essentially the
identical delay. You'll see some slight increase in delay and loss
of throughput by 64 replicas, and we are down by a factor of 3 or so in
the 256 case, but we think this could actually go away once we begin to
use a new feature of the verbs package called cross-channel
Core papers: both under submission, so
cite these as technical reports (preprints), dated Sept 21, 2016.
1) The Derecho
subgroup and sharding model.
Derecho Paxos and atomic multicast protocol.
Derecho is built from two components, one
to replicate data at RDMA speeds (RDMC) and one to build strong
Paxos-style protocols over the replication substrate (SST). Over
these components, we offer a very clean C++ 17 group communication
is our underlying data replication (accessible stand-alone but normally
used throughg the Derecho API, which offers RDMC as its "raw"
mode). It can copy (replicate) huge blocks of memory (even
gigabytes) or files at very close to the full data rate of an optical
network. For example, on a 100Gbps Mellanox RDMA technology (more
than 3x faster than memory-to-memory copying with memcpy on a fast
server!) we can make 2 replicas of data in source memory at 112Gbps, and
in scalability experiments, we found that we could make 512 replicas for
about 3x or 4x the time needed to make just one, which is kind of crazy if
you think about it. With small events we are sending about 300k per
second in groups of these sizes. It is almost embarassing to point
out that these speeds are about 10,000x faster than Vsync or any version
of Paxos you've ever heard of.
The second component,
which we call SST, is used to transform RDMC into
a strong consistency model, much like the one used in Vsync: The
novelty here centers on a concept we are calling monotonic properties and
logic, which (as the name suggests) is all about things that stay true
once a program learns them (stable in a formal
sense), and ways to build up stronger protocols over these basic
A talk on Derecho is here.
Key people: The PhD students playing lead
roles on this work are Sagar Jha, Matt Milano and Edward Tremel.
Robbert van Renesse and I work closely with them. We are treating
Derecho as an open-source project and hoping to create a real community around
it that would include industry as well as other academic groups. Why
not join us in creating an RDMA revolution? The download site is
Derecho.codeplex.com. But I should warn that some parts of the code
base are still evolving as of fall 2016. We expect it to be stable
by the beginning of the winter.
Frame File System. We also have a cool
new file system for real-time applications. It soaks up updates
from time-synchronized sensors or other sources and captures the data into
an in-memory data structure. Then you can read this data at any instant in time that you wish, with very good temporal precision and also with
logical consistency guarantees. We attached our new file system to
Spark and it actually outperforms the Berkeley Spark+HDFS setup, and we
think we can also support the widely popular Kafka API, which would make
the solution immediately useful to two large user communities. So
our vision is that perhaps you capture real-time data from some kind of
IoT setting, or from smart cars, or the smart grid. Then you flag
"events of interest" for instant analysis, and the analytic
program (running on Hadoop or on MPI) might want to visit snapshots of the
past -- maybe lots of them, at fine temporal resolution. We can
materialize that data instantly for you, on demand, replicate it onto a
cluster of nodes running Spark, and give all sorts of guarantees... a very
cool new option!
The two lead authors on this work are
Weijia Song and Theo Gkountouvas. Weijia will present out first
paper on FFFS at ACM Symposium on Cloud Computing (SOCC) 2016.
computing, Vsync. My longer term research (talking now about the past 25 years!) is
mostly concerned with what people are calling "Cloud Computing."
Of course, Cloud Computing is just the buzzword of the day: Back before
these terms gained wide use, we would have said that I work on reliable
distributed computing, applications involving reliable information
collection or dissemination, and problems associated with security in
complex distributed systems. Over the years one of my main activities has
been to create software libraries for people who want to build this kind
of system, I've always been a bit of a hacker, and writing code keeps me
Until we created Derecho, my most recent such
library was the one now called Vsync, and I've recorded a series of
detailed videos explaining precisely how to use it. You'll find links to
the videos (and even suggestions for projects you could do) on vsync.codeplex.com and hopefully,
with this help, the library itself will be easy for you to install and start
using. I maintain it myself and welcome email with bug reports,
requests for help, suggestions, etc. Vsync includes simple ways to
replicate data and build fault-tolerant services, implementations of state
machine replication and virtual synchrony, a version of the famous Paxos
protocol, and more elaborate mechanisms like distributed locking and
key-value storage, real-time data aggregation and other features.
People work with it mostly from C#, C++/CLI (a version of C++ with some
extra .NET features to deal with managed memory) and IronPython (one of
the very popular versions of Python).
even though Derecho has an API a lot like Vsync, be aware that Vsync is
painfully slow compared to Derecho. In some sense, Vsync is a toy
system once you compare it with this sort of cutting edge option.
- Connection to Paxos. These days there is intense interested in Paxos,
which makes it interesting to note that there is a core protocol called gbcast (short for
"group atomic broadcast") in Vsync that dates back to 1983 and
was probably the first of the "Paxos-equivalent" protocols to
have become widely used -- Leslie Lamport's Paxos paper was first released
in TR form in 1990. Gbcast doesn't actually look like Paxos but in
fact it does bi-simulate classic Paxos and can even be transformed into
Paxos, or Paxos into it through a series of relatively simple
steps. (In this use, bisimulate means that every run Gbcast
could produce is a run Paxos could produce and vice versa). This
said, I don't think of Gbcast as the first Paxos protocol, mostly because
(1) Gbcast looks nothing like Paxos, (2) Tommy Joseph and I proved Gbcast
correct in a 1983 paper, but our proof wasn't the kind of proof by refinement
for which Leslie Lamport's work became so famous, and (3) The detailed
handling of durability, namely recovery from a complete outage, is
different from that used in Paxos: Paxos defines durability in terms of a
complete log that captures and saves protocol state, but in virtual
synchrony we focus more on application durability.
These days my view is that you really need
both. Virtual synchrony is the right model for partition-free
membership management in a group communication system (no "split
brain"), but the Paxos ordering and durability properties are
important too ("no rollback, state machine replication").
You simply want both. In virtual synchrony we struggled with speed
and ended up with an optimistic early delivery mode and a flush that you
have to call before taking actions with external consequences. This
is clumsy, and in Derecho we do away with it. But the Paxos log
makes little sense unless you actually want to use the log as the entire
store of application state. So you want an atomic multicast, and sometimes
you want true durable Paxos. To me this insight was a long time
arriving. I think Leslie Lamport actually agrees with me on all of
these points, by the way: they aren't massive topics for heated debate.
Today there are many protocols that are
equivalent to Paxos in this same bisimulation sense. Leslie Lamport
likes to call our version "virtually synchronous Paxos," and
while at first I found that a bit odd, now I like it more and more.
Of course there is a mind-boggling causality issue raised, but the
universe doesn't seem to have ended or anything like that. Calling
the gbcast protocol an early Paxos implementation might help people
appreciate that virtual synchrony isn't somehow in opposition to
Using all of this technology for IoT applications,
notably the Smart Electric Power Grid. Under a new grant from the Department of Energy GENI program and with
further support from NSF and DARPA, we’re applying our insights concerning high
assurance cloud computing and Vsync to the challenge of controlling the smart
power grid. We're currently working to transition the GridCloud system
(we built it using Vsync but also incorporating best-of-breed electric power
technologies from our colleagues at Washington State University in Pullman) for
experimental use by the New England and New York ISOs and NYPA, a major
TO. Beyond the smart grid, we believe that this style of embedded
cloud-hosted real-time system could have many other applications. Some of
the more exciting components of GridCloud include CloudMake, a new management
tool, and the forensic file system mentioned above, for capturing real-time
data streams. Theo Gkountovas leads on CloudMake, and he and Weijia Song
have collaborated on the file system. Z Teo's work on IronStack is also
very promising: He has a way to use SDN network routers in very demanding
settings like weather-challenged environments. And Edward Tremel is
looking at options for privacy-preserving data mining in large deploy
- A talk I
gave at MesosCon on our new RDMA research (Derecho and the Freeze Frame
File System) here.
- SOSP '15 History Day talk
on fault-tolerance and consistency, the CATOCS controversy, and the
modern-day CAP conjecture. My video is here and an
accompanying essay is here.
van Renesse and me discussing how we got into this area of research: here.
videos about learning to use the Vsync library (previously known as
Isis2): On the documentation page here or on
YouTube (tagged "Vsync"). I'm planning to revise these
into a series of short 10-minute segments. We also plan to do a set
of short segments on using Derecho.
work. I've really worked in Cloud Computing for most of my career, although it
obviously wasn't called cloud computing in the early days. As a result, our
papers in this area date back to 1985. Some examples of mission-critical
systems on which my software was used in the past include the New York Stock
Exchange and Swiss Exchange, the French Air Traffic Control system, the AEGIS
warship and a wide range of applications in settings like factory process
control and telephony. In fact, every stock quote or trade on the NYSE from
1995 until early 2006 was reported to the overhead trading consoles through
software I personally implemented - a cool (but also scary) image, for me at
least! During the ten years this system was running, many computers crashed
during the trading day, and many network problems have occurred - but the
design we developed and implemented has managed to reconfigure itself
automatically and kept the overall system up, without exception. They didn't
have a single trading disruption during the entire period. As far as I know,
the other organizations listed above have similar stories to report.
Today, these kinds of ideas are gaining
"mainstream" status. For example, IBM's Websphere 6.0 product
includes a multicast layer used to replicate data and other runtime state for
high-availability web service applications and web sites. Although IBM developed
its own implementation of this technology, we've been told by the developers
that the architecture was based on Cornell's Horus and Ensemble systems,
described more fully below. The CORBA architecture includes a fault-tolerance
mechanism based on some of the same ideas. And we've also worked with Microsoft
on the technology at the core of the next generation of that company's
clustering product. So, you'll find Cornell's research not just on these web
pages, but also on web sites worldwide and in some of the world's most
ambitious data centers and high availability computing systems.
In fact we still have very active dialogs
with many of these companies: Cisco, IBM, Intel, Microsoft, Amazon, and others.
An example of an ongoing dialog is this: we recently worked with Cisco to
invent a new continuous availability option for their core Internet routers,
the CRS-1 series. You can read about this work here.
My group often works with vendors and
industry researchers. We maintain a very active dialog with the US government
and military on research challenges emerging from a future generation
communication systems now being planned by organizations like the Air Force and
the Navy. We've even worked on new ways of controlling the electric power grid,
but not in time to head off the big blackout in 2003! Looking to the future, we
are focused on needs arising in financial systems, large-scale military
systems, and even health-care networks. (In this connection, I should perhaps
mention that although we do get research support from the government and the US
military, none of our research is classified or even sensitive, and all of it
focuses on widely used commercial standards and platforms. Most of our software
is released for free, under open source licenses.)
I'm just one of several members of a group
in this area at Cornell. My closest colleagues and co-leaders of the group are
Robbert van Renesse and Hakim Weatherspoon. We also collaborate with Gun Sirer,
Paul Francis, Al Demers and Johannes Gehrke, as well as with other systems
faculty members at Cornell: Andrew, Fred, Rafael, Joe, etc. The systems group
is close-knit, and many of our students are jointly advised by other faculty
members in the systems area. Werner Vogels worked with us until September 2004,
when he joined Amazon.com as Vice President and Director for Systems Research.
Four generations of reliable distributed systems research! Overall, our group has developed three generations of technology and is
now working on a fourth generation system: The Isis Toolkit, developed mostly
during 1987-1993, the Horus system, developed starting in 1990 until around
1995, the Ensemble system, 1995-1999. Right now we're developing a number of
new systems including Isis2, Gradient, and the reliable TCP solution
mentioned above, and working with others to integrate those solutions into
settings where reliability, security, consistency and scalability are
Older Research web pages:
Live Objects, Quicksilver, Maelstrom, Ricochet and Tempest projects
Isis Toolkit (really
old stuff! This is from the very first version of Isis).
A collection of papers on Isis, edited by
myself with Robbert van Renesse, may still be available -- it was called Reliable
Distributed Computing with the Isis Toolkit and was in the IEEE Press
Computer Science series.
Studies in Computer Science at Cornell: At this time of the year, we get large numbers of inquiries about our
PhD program. I want to recommend that people interested in the program not contact
faculty members like me directly with routine questions like "can your
research group fund me". As you'll see from the web page, Cornell does
admissions by means of a committee, so individual faculty members don't
normally play a role. This is different from many other schools -- I realize
that at many places, each faculty member admits people into her/his own group.
But at Cornell, we admit you first, then you come here, and then you affiliate
with a research group after a while. Funding is absolutely guaranteed for
people in the MS/PhD program during the whole time they are at Cornell. On the
other hand, students in the MEng program generally need to pay their own way.
Obviously, some people have more direct,
specific questions, and there is no problem sending those to me or to anyone
else. But as for the generic "can I join your research group?" the
answer is that while I welcome people into the group if they demonstrate good
ideas and talent in my area, until you are here and take my graduate course and
spend time talking with me and my colleagues, how can we know if the match is
good? And most such inquiries are from people who haven't yet figured out quite
how many good projects are underway at Cornell. Perhaps, on arrival, you'll
take Andrew Myer's course in language based security and will realize this is
your passion. So at Cornell, we urge you to take time to find out what areas we
cover and who is here, to take some courses, and only then affiliate with a
research group. But please knock on my door any time you like! I'm more than
happy to talk to any student in the department about anything we're doing here!