Map-reduce is a programming model that has its roots in functional programming. In addition to often producing short, elegant code for problems involving lists or collections, this model has proven very useful for large-scale highly parallel data processing. Here we will think of map and reduce as operating on lists for concreteness, but they are appropriate to any collection (sets, etc.). Map operates on a list of values in order to produce a new list of values, by applying the same computation to each value. Reduce operates on a list of values to collapse or combine those values into a single value (or some number of values), again by applying the same computation to each value.
For large datasets, it has proven particularly valuable to think about processing in terms of the same operation being applied independently to each item in the dataset, either to produce a new dataset, or to produce summary result for the entire dataset. This way of thinking often works well on parallel hardware, where each of many processors can handle one or more data items at the same time. The Connection Machine, a massively parallel computer developed in the late 1980's, made heavy use of this approach.
More recently, Google, with their large server farms, have made very effective use of the map-reduce paradigm. Two Google fellows, Dean and Ghemawat, have reported some of these uses in a 2004 paper at OSDI (the Operating System Design and Implementation Conference) and a 2008 paper in CACM (the monthly magazine of the main Computer Science professional society). Much of the focus in those papers is on separating the fault tolerance and distributed processing issues of operating on large clusters of machines from the programming language abstractions. They found map-reduce to be a successful way of separating out the conceptual model of large-scale computations on big collections of data, using mapping and reducing operations, from the issue of how that computation is implemented in a reliable manner on a large cluster of machines.
In the introduction of their 2008 paper, Dean and Ghemawat write, "Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages. We realized that most of our computations involved applying a map operation to each logical record in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key in order to combine the derived data appropriately." In the paper they discuss a number of applications that are simplified by this view of massive parallelism. To give a sense of the scale of the processing done by these programs, they note that over ten thousand programs using map-reduce have been implemented at Google since 2004, and that in September 2007 over 130,000 machine-months of processing time at Google were used by map-reduce, processing over 450 PB (450,000 TB) of data.
For our purposes here in this programming course, it is illustrative to see what kinds of problems Google found useful to express in the map-reduce paradigm. Counting the number of occurrences of each word in a large collection of documents is a central computational issue for indexing large document collections. This can be expressed as mapping a function that returns the count of a given word in each document across a document collection. The result is then reduced by summing the counts. So if we have a list of strings, the map returns a list of integers with the count for each string, and the reduce then sums up those integers. Many other counting problems can be implemented similarly. For instance, counting the number of occurrences of some pattern in log files, such as the number of occurrences of a given user query or a particular url. There may be many log files on different hosts, and this can be viewed as a large collection of strings, with the same map and reduce operations as for document word counts.
Reversing the links of the web graph is another problem that can be viewed this way. The web is a set of out-links, from a given page to the pages that it links to. A map function can output target-source pairs for each source page, and a reduce function can collapse these into a list of source pages corresponding to each target page (i.e., links in to pages).
An inverted index is a map from a term to all the documents containing that term. It is an important part of a search engine, as the engine must be able to quickly map a term to relevant pages. In this case a map function returns pairs of terms and document identifiers for each document. The reduce collapses the result into the list of document ID's for a given term.
In the Google papers, they report that re-implementing the production indexing system resulted in code that was simpler, smaller, and easier to understand and modify. This in turn led to a service that was easier to operate, because failure diagnosis and recovery were correspondingly easier. Nevertheless, the approach resulted in fast enough code that there was no degradation in performance.
First let's look in more detail at the map operation. This operation applies a
specified function f to each element of a list to produce a resulting
list. Each element of the resulting list is obtained by applying f
to the corresponding element of the input list. The OCaml library function
is curried to take the function first and the list second.
It produces the following value for a list of three elements:
List.map f [a; b; c] = [f a; f b; f c]
Recall in the last lecture we introduced our own polymorphic
type 'a list_ = Nil | Cons of ('a * 'a list_)
The map operation for this type can be written:
let rec map (f : 'a -> 'b) (x : 'a list_) : 'b list_ = match x with Nil -> Nil | Cons (h, t) -> Cons (f h, map f t)
The type of
('a -> 'b) -> 'a list_ -> 'b list_
f is a function from the element type of
the input list
'a to the element type of the output
Using map with an anonymous function, we can define a function to make a copy of a list:
let copy = map (fun x -> x)
(This is the same as saying
let copy lst = map (fun x -> x) lst
but we don't really need to include the second argument
in the definition; the
copy function is of type
'a list_ -> 'a list_ and is already well defined without it.)
Similarly, we can create a
string list_ from an
# let string_list_of_int_list = map string_of_int;; val string_list_of_int_list : int list_ -> string list_ = <fun> # string_list_of_int_list (Cons (1, Cons (2, Cons (3, Nil))));; - : string list_ = Cons ("1", Cons ("2", Cons ("3", Nil)))
Now let's consider the reduce operation, which like map applies a function to every element of a list, but in doing so accumulates a result rather than just producing another list. In comparison with map, the reduce operator takes an additional argument of an accumulator. As with map, we will consider the curried form of reduce.
There are two versions of reduce, based on the nesting of the
applications of the function f in creating the resulting value. In
OCaml there are built-in reduce functions that operate
on lists are called
List.fold_left. These functions produce
the following values:
fold_right f [a; b; c] r = f a (f b (f c r)) fold_left f r [a; b; c] = f (f (f r a) b) c
From the forms of the two results, it can be seen why the functions
fold_right, which uses a right-parenthesization
of the applications of
which uses a left-parenthesization. Note that the formal parameters of the two
functions are in different orders: in
fold_right the accumulator is to
the right of the list and in
fold_left the accumulator is to the left.
Again using the
list_ type, we can define these two
functions as follows:
let rec fold_right (f : 'a -> 'b -> 'b) (lst : 'a list_) (r :'b) : 'b = match lst with Nil -> r | Cons (hd, tl) -> f hd (fold_right f tl r) let rec fold_left (f : 'a -> 'b -> 'a) (r : 'a) (lst : 'b list_) : 'a = match lst with Nil -> r | Cons (hd, tl) -> fold_left f (f r hd) tl
The types of
('a -> 'b -> 'b) -> 'a list_ -> 'b -> 'b ('a -> 'b -> 'a) -> 'a -> 'b list_ -> 'a
f in both functions is a function from the
element type of the input list and the type of the
accumulator to the type of the accumulator. The types of the input list
and the accumulator do not have to be the same.
Given these definitions, operations such as summing all of the
elements of a list of integers can be defined naturally using
let sum_right_to_left il = fold_right (+) il 0 let sum_left_to_right = fold_left (+) 0
(+) is the same as
fun x y -> x + y.
Note that we don't need the
il in the second case because of the
ordering of the arguments in
Folding is a very powerful operation. We can write many other list functions in terms of fold.
map, while it initially sounded quite different from fold,
can be defined naturally using
fold_right by accumulating a result that is a list.
Continuing with our
let mapp f lst = fold_right (fun x y -> Cons (f x, y)) lst Nil
The accumulator function simply applies
f to each element and
builds up the resulting list, starting from the empty list.
The entire map-reduce paradigm can thus be implemented using
fold_right. However, it is often conceptually
useful to think of map as producing a list and of reduce as producing a value.
What about using
fold_left instead of
map? In this case we get a function that does the map,
but produces an output that is in reverse order, because
fold_left processes the elements of the input list
fold_right works right-to-left.
let maprev f = fold_left (fun x y -> Cons (f y, x)) Nil
(Again, we can leave off the second argument
because of the ordering of the arguments.)
This resulting function can also be quite useful,
particularly as it is tail recursive. If we want the
elements of the result list to be ordered in the same
direction as the input list, we can just reverse the list
when we are done.
Another useful variation on mapping is filtering, which selects a subset of a list according to some Boolean criterion.
let filter f lst = fold_right (fun x y -> if f x then Cons (x, y) else y) lst Nil
f is a test for membership in the output
list. It takes just one argument and returns a Boolean value.
For example, we can filter a list of
integers for the even ones:
let select_evens = filter (fun x -> (x / 2) * 2 = x)
Note that the type of the argument to
its result are restricted to
int list_ rather than the more general
list of the underlying
filter because the
anonymous function takes an integer parameter and returns an integer
Determining the length of a list is another operation that can easily be defined in terms of folding.
let length = fold_left (fun x _ -> 1 + x) 0
What about using
fold_right for this?
You should try writing some functions using
fold_right. These primitives can be incredibly useful.
Even in languages that do not have these operations built in, they are useful
ways of thinking about structuring many computations.