Map-reduce is a programming model that has its roots in functional programming. In addition to often producing short, elegant code for problems involving lists or collections, this model has proven very useful for large-scale highly parallel data processing. Here we will think of map and reduce as operating on lists for concreteness, but they are appropriate to any collection (sets, etc.). Map operates on a list of values in order to produce a new list of values, by applying the same computation to each value. Reduce operates on a list of values to collapse or combine those values into a single value (or more generally some number of values), again by applying the same computation to each value.
For large datasets, it has proven particularly valuable to think about processing in terms of the same operation being applied independently to each item in the dataset, either to produce a new dataset, or to produce summary result for the entire dataset. This way of thinking often works well on parallel hardware, where each of many processors can handle one or more data items at the same time. The Connection Machine, a massively parallel computer developed in the late 1980's made heavy use of this programming paradigm. More recently, Google, with their large server farms, have made very effective use of it. Two Google fellows, Dean and Ghemawat, have reported some of these uses in a 2004 paper at OSDI (the Operating System Design and Implementation Conference) and a 2008 paper in CACM (the monthly magazine of the main Computer Science professional society). Much of the focus in those papers is on separating the fault tolerance and distributed processing issues of operating on large clusters of machines from the programming language abstractions. At Google they have found map-reduce to be a highly useful way of separating out the conceptual model of large-scale computations on big collections of data, using mapping and reducing operations, from the issue of how that computation is implemented in a reliable manner on a large cluster of machines.
In the introduction of their 2008 paper, Dean and Ghemawat write "Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages. We realized that most of our computations involved applying a map operation to each logical record in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key in order to combine the derived data appropriately." In the paper they discuss a number of applications that are simplified by this functional programming view of massive parallelism. To give a sense of the scale of the processing done by these programs, they note that over ten thousand programs using map-reduce have been implemented at Google since 2004, and that in September 2007 over 130,000 machine-months of processing time at Google were used by map-reduce, processing over 450PB (450,000 TB) of data.
For our purposes here in a programming course, it is illustrative to see what kinds of problems Google found useful to express in the map-reduce paradigm. Counting the number of occurrences of each word in a large collection of documents is a central computational issue for indexing large document collections. This can be expressed as mapping a function that returns the count of a given word in each document across a document collection. Then the result is reduced by summing all the counts together. So if we have a list of strings, the map returns a list of integers with the count for each string. The reduce then sums up those integers. Many other counting problems can be implemented similarly. For instance, counting the number of occurrences of some pattern in log files, such as the number of occurrences of a given user query or a particular url. Again there are many log files on different hosts, this can be viewed as a large collection of strings, with the same map and reduce operations as for document word counts.
Reversing the links of the web graph is another problem that can be viewed this way. The web is a set of out-links, from a given page to the pages that it links to. A map function can output target-source pairs for each source page, and a reduce function can collapse these into a list of source pages corresponding to each target page (i.e., links in to pages).
An inverted index is a map from a term to all the documents containing that term. It is an important part of a search engine, as the engine must be able to quickly map a term to relevant pages. In this case a map function returns pairs of terms and document identifiers for each document. The reduce collapses the result into the list of document ID's for a given term.
In the Google papers they report that re-implementing the production indexing system resulted in code that was simpler, smaller, easier to understand and modify, and resulted in a service that was easier to operate (ie failure diagnosis, recovery, etc.), yet the approach results in fast enough code to be used for a key part of the service.
First lets look in more detail at the map operation. Map applies a
specified function f to each element of a list to produce a resulting
list. That is, each element of the result is obtained by applying f
to the corresponding element of the input list. We will consider the
curried form of map. In OCaml the built-in function
produces the following value for a list of three elements:
map f [a; b; c] = [f a; f b; f c]
Recall in last lecture we introduced the
type 'a list_ = Nil_ | Cons_ of ('a * 'a list_)
The map operation for this type can be written as:
let rec map (f: 'a->'b) (x: 'a list_): 'b list_ = match x with Nil_ -> Nil_ | Cons_(h,t) -> Cons_(f(h), map f t)
Note the type signature of
map which is
('a -> 'b) -> 'a list_ -> 'b list_
f is a function from the element type of
the input list
'a to the element type of the output
Using map we can define a function to make a copy of
list_ (using an anonymous function),
let copy l = map (fun x -> x) l
Similarly we can create a
string list_ from an
map string_of_int Cons_(1,Cons_(2,Cons_(3,Nil_)))
Now lets consider the reduce operation, which like map applies a function to every element of a list, but in doing so accumulates a result rather than just producing another list. Thus in comparison with map, the reduce operator takes an additional argument of an accumulator. As with map, we will consider the curried form of reduce.
There are two versions of reduce, based on the nesting of the
applications of the function f in creating the resulting value. In
OCaml there are built-in reduce functions that operate
on lists are called
List.fold_left. These functions produce
the following values:
fold_right f [a; b; c] r = f a (f b (f c r)) fold_left f r [a; b; c] = f (f (f r a) b) c
From the forms of the two results it can be seen why the functions
fold_right which uses a right-parenthesization
of the applications of
which uses a left-parenthesization of the applications
f. Note that the formal parameters of the two
functions are in different orders, in fold_right the accumulator is to
the right of the list and in fold_left the accumulator is to the left
of the list.
Again using the
list_ type we can define these two
functions as follows:
let rec fold_right (f:'a -> 'b -> 'b) (lst: 'a list_) (r:'b): 'b = match lst with Nil_ -> r | Cons_(hd,tl) -> f hd (fold_right f tl r)
let rec fold_left (f: 'a -> 'b -> 'a) (r: 'a) (lst: 'b list_): 'a = match lst with Nil_ -> r | Cons_(hd,tl) -> fold_left f (f r hd) tl
Note the type signature of
fold_right which is
('a -> 'b -> 'b) -> 'a list_ -> 'b -> 'b
f is a function from the element type of
the input list
'a and the type of the
'b to the type of the accumulator. The type
signature is analogous for
fold_left,except the order of
the parameters to both
f and to
itself are reversed compared with
Given these definitions, operations such as summing all of the
elements of a list of integers can naturally be defined using
fold_right (fun x y -> x+y) il 0 fold_left (fun x y -> x+y) 0 il
Folding is a very powerful operation. We can write many other list functions in terms of fold.
map, while it initially sounded quite different from fold
can naturally be defined using
fold_right, by accumulating a result that is a list. Continuing with our
let mapp f l = (fold_right (fun x y -> Cons_((f x),y)) l Nil_)
The accumulator function simply applies
f to each element and
builds up the resulting list, starting from the empty list.
The entire map-reduce paradigm can thus actually be implemented using
fold_right. However, it is often conceptually
useful to think of map as producing a list and of reduce as producing a value.
What about using
fold_left instead to
map? In this case we get a function that not only
does a map but also produces an output that is in reverse order of the
input list. Note that
fold_left takes its arguments in a
different order than
fold_right (the order of the list
and accumulator are swapped), it also requires a
f that takes its arguments in the opposite order
f used in
let maprev f l = fold_left (fun x y -> Cons_((f y),x)) Nil_ l
This resulting function can also be quite useful, particularly as it is tail recursive.
Another useful variation on mapping is filtering, which selects a subset of a list according to some Boolean criterion,
let filter f l = (fold_right (fun x y -> if (f x) then Cons_(x,y) else y) l Nil_)
f takes just one argument, the predicate for
determining membership in the resulting list. Now we can easily filter a list of
integers for the even ones:
filter (fun x -> (x / 2)*2 = x) Cons_(1,Cons_(2,Cons_(3,Nil_)))
Note that if we define a function that filters for even elements of a list:
let evens l = filter (fun x -> (x / 2)*2 = x) l;;
then type of the parameter and result are restricted to
int list_ rather than the more general
list of the underlying
filter, because the
anonymous function takes an integer parameter and returns an integer
Determining the length of a list is another operation that can easily be defined in terms of folding.
let length l = fold_left (fun x _ -> 1 + x) 0 l
What about using
fold_right for this?
You should try writing some functions using
fold_right. These primitives can be incredibly useful.
Even in languages that do not have these operations built-in, they are useful
ways of thinking about structuring many computations.