Map-reduce is a programming model that has its roots in functional programming. In addition to often producing short, elegant code for problems involving lists or collections, this model has proven very useful for large-scale highly parallel data processing. Here we will think of map and reduce as operating on lists for concreteness, but they are appropriate to any collection (sets, etc.). Map operates on a list of values in order to produce a new list of values, by applying the same computation to each value. Reduce operates on a list of values to collapse or combine those values into a single value (or more generally a smaller number of values), again by applying the same computation to each value.
For large datasets, it has proven particularly valuable to think about processing in terms of the same operation being applied to all the items in the dataset, either to produce a new dataset, or to produce summary result for the entire dataset. This way of thinking often maps well onto parallel hardware, where each processor can handle one or more data items in parallel. The Connection Machine, a massively parallel computer developed in the late 1980's made heavy use of this programming paradigm. More recently, Google, with their large server farms, have made very effective use of it. They have reported some of these uses in a 2004 paper at OSDI (the Operating System Design and Implementation Conference). Much of the focus on that paper is on separating the fault tolerance and distributed processing issues of operating on large clusters of machines from the programming language abstractions. They found map-reduce to be a useful way of separating out the conceptual model of mapping and reducing large collections or lists from the issue of how that computation is implemented in a reliable manner on a large cluster of machines.
For our purposes here in a programming course, it is illustrative to see what kinds of problems Google found useful to express in the map-reduce paradigm. Counting the number of occurrences of each word in a large collection of documents is a central computational issue for indexing large document collections. This can be expressed as mapping a function that returns the count of a given word in each document across a document collection. Then the result is reduced by summing all the counts together. So if we have a list of strings, the map returns a list of integers with the count for each string. The reduce then sums up those integers. Many other counting problems can be implemented similarly. For instance, counting the number of occurrences of some pattern in log files, such as the number of occurrences of a given user query or a particular url. Again there are many log files on different hosts, this can be viewed as a large collection of strings, with the same map and reduce operations as for document word counts.
Reversing the links of the web graph is another problem that can be viewed this way. The web is a set of out-links, from a given page to the pages that it links to. A map function can output target-source pairs for each source page, and a reduce function can collapse these into a list of source pages corresponding to each target page (i.e., links in to pages).
An inverted index is a map from a term to all the documents containing that term. It is an important part of a search engine, as the engine must be able to quickly map a term to relevant pages. In this case a map function returns pairs of terms and document identifiers for each document. The reduce collapses the result into the list of document ID's for a given term.
In the Google OSDI paper they report that re-implementing the production indexing system resulted in code that was simpler, smaller, easier to understand and modify, and resulted in a service that was easier to operate (ie failure diagnosis, recovery, etc.), yet the approach results in fast enough code to be used for a key part of the service.
Last time we looked at singly linked lists of integers, defining a datatype
we called intlist
. SML has built-in singly linked lists which
are of type list
. A list must be composed of elements that are all the
same type. That type is specified prior to the list type, for instance the type
int list
specifies a list of integers. []
is the empty list, and for a nonempty list
h::tl
is the first element h
followed by the rest of the
list tl
. That is, ::
is the operator that
appends an element to the beginning of a list.
First lets look in more detail at the reduce operation. Suppose we want to write a function to sum a list of integers. By now you should be able to write the following code:
fun sumIntlist (s:int list):int = case s of [] => 0 | h::t => h + (sumIntlist t)
Now say we want to concatenate a list of strings, again producing a single value from a list. We can write:
fun concatStringlist (s:string list):string = case s of [] => "" | h::t=> h ^ (concatStringlist t)
These two functions look almost identical. With the exception of the different types and different operation (^ vs +), both functions are doing the same thing. In both cases, we walk down a list performing an operation that collapses the list into a single value by computing something based on the elements of the list. Since code reuse is a big help in reducing both bugs and coding time, we want to abstract this out.
As we consider the items of a list we can store a partial result in an accumulator. This accumulator changes for each item in the list. For example, if we want to walk across a list of integers and sum them, we could store the current sum in the accumulator. We start with the accumulator set to 0. As we come across each new element, we add the element to the accumulator. When we reach the end, we return the value stored in the accumulator.
Let's try to rewrite the sumIntlist
function to introduce the idea of an
accumulator.
fun sumIntlistAccum (accum:int, s:int list):int =
case s of
[] => accum
| h::t => sumIntlistAccum((accum+h),t)
Of course, to get the sum, we must call sumIntlistAccum
with 0 for the initial accum
value. Similarly, we can rewrite concatStringlist
with this concept of the
accumulator.
fun concatStringlistAccum (accum:string, s:string list):string =
case s of
[] => accum
| h::t => concatStringlistAccum((accum^h),t)
To use this function, we pass in the empty string for the initial
accumulator. Now we can see even more similarity between the two
functions.
We are now nearly in a position to eliminate any differences between the two, by passing
in a function that acts on the head of the list and the accumulator.
However, another difference between the two functions is that the types of their parameters are different. For the moment, lets write a function that
captures the common pattern while ignoring the types of the variables:
fun accumulate (f, a, s) = case s of [] => a | h::t => accumulate (f, (f(h,a)),t)
Now we can rewrite sumIntlist
and concatStringlist
as
fun sumIntlist' (s:int list):int = accumulate((fn(x,a)=>a+x),0,s) fun concatStringlist' (s:string list):string = accumulate((fn(x,a)=>a^x),"",s)
Note that the function accumulate
appears not to be type safe, in
that it does not enforce agreement between the types of the parameters of the
function f
and the types of the accumulator a
and the elements of the list s
. Namely, the
type of the first parameter of
f
needs to agree with that of a
and similarly for the second parameter
and the
elements of s
. However, the ML compiler has actually inferred
parameterized types for the variables that enforce these constraints.
That is, it has inferred the necessary type agreement without specifying what
the specific types are. A natural way to specify agreement is to use
variable names, thereby expressing that two things must be the same because the
have the same variable name.
Next week we will talk about parameterized types in some detail, but today we will
write functions without explicit types for the parameters, even though this is
generally
not good programming practice, and will let the compiler infer the types.
In that way we can focus on issues regarding higher order functions rather than
on types.
However before doing so, it is worth considering the inferred type for
accumulate
,
which is,
fn : ('a * 'b ->
'b) * 'b * 'a list -> 'b
A variable name that starts with a '
in ML denotes a parameterized type. We will consider parameterized types
next week, but for now just look at the pattern. The first parameter of
f
is the same type, 'a
, as the elements of the list
s
. Similarly the second parameter is of the same type, 'b
,
as the value returned by f
, the accumulator a
, and as
the value returned by accumulate
. However the accumulator a
need not be the same type as the elements of the list s
(although
in sumIntlist
and concatStringlist
these types are
the same).
There are two powerful built-in functions in SML for the reduce
paradigm. One is foldl
,
which abstracts the pattern that we have captured in accumulate
,
considering the elements of a list in left-to-right order and building up a
result in an accumulator. However it is also curried. We saw about
currying in recitation, where the arguments are applied in order, to create
intermediate higher order functions, rather than all being evaluated before a
single function call.
fun foldl' f a s = case s of [] => a | h::t => foldl' f (f (h,a)) t
It is worth spending a minute comparing this curried function to the
uncurried accumulate
. Note that the type of foldl'
(and the built-in foldl
function) is:
fn: ('a * 'b -> 'b) -> 'b -> 'a list -> 'b
Another way to think
about this curried version of foldl
is in terms of explicit
currying.
fun curry3 f x y z = f(x,y,z) fun foldl'' f a s = curry3 accumulate f a s
Finally, the parameter s
is not really needed in foldl
,
we can just pattern match directly without the case statement:
fun foldl''' f a [] = a
| foldl''' f a (h::t) = foldl''' f (f (h, a)) t
There is also a built-in function foldr
that operates on a list
from right-to-left and can be defined:
fun foldr' f a [] = a
| foldr' f a (h::t) = f(h, (foldr' f a t))
It is instructive to compare this to foldl
. Note that in
operating right-to-left, the recursive call to foldr
on the tail is
done and then later f
is applied to the head and the result of that
call. In contrast, in operating left-to-right, f
is applied
to the head and the accumulator and then the recursive call to foldl
is made. This causes a large difference in the amount of memory used by
right-to-left folding and left-to-right folding. More on this in soon when
we talk about tail recursion.
Now returning to sumIntlist
and concatStringlist
using foldl
:
fun sumIntlist'' (s:int list):int = foldl' (fn(x,a)=>a+x)0 s fun concatStringlist'' (s:string list):string = foldl' (fn(x,a)=>a^x) "" s
What happens if we replaced foldl
by foldr
in these
definitions (other than the memory issue we just mentioned for large lists)?
SML provides a notation so that infix binary operators can be used as functions, using the op keyword, so op+, or op +, is a function that takes two arguments and adds them. Thus we can also define sumIntlist very concisely as
val sumIntlist''' = foldl' (op+) 0
Note the use of val
rather than fun
in this
declaration and the fact that foldl
is curried, so the type of
sumIntlist'''
is fn: int list -> int
. What would
the definition look like using fun
rather than val
?
For clarity it is a good idea to make it clear that something is a function when
it is being declared using val
rather than the more customary
fun
.
Recall that map applies a function to each element of a list, constructing a new list as a result. We can define an uncurried version of map as
fun mapU (f, []) = []
| mapU (f,h::t) = f(h) :: mapU(f,t)
As with foldl
, we can define a curried version, either by
currying the uncurried one or directly.
fun curry2 f x y = f(x,y) fun map' f s = curry2 mapU f s fun map'' f [] = [] | map'' f (h::t) = f(h) :: map'' f t
Using map we can define a function to make a copy of a list (with an anonymous function),
fun id s = map' (fn (x) => x) s