CS 312 Lecture 4:
Map, Fold and the Map-reduce Paradigm

The Map-Reduce Paradigm

Map-reduce is a programming model that has its roots in functional programming.  In addition to often producing short, elegant code for problems involving lists or collections, this model has proven very useful for large-scale highly parallel data processing.  Here we will think of map and reduce as operating on lists for concreteness, but they are appropriate to any collection (sets, etc.).  Map operates on a list of values in order to produce a new list of values, by applying the same computation to each value.  Reduce operates on a list of values to collapse or combine those values into a single value (or more generally a smaller number of values), again by applying the same computation to each value. 

For large datasets, it has proven particularly valuable to think about processing in terms of the same operation being applied to all the items in the dataset, either to produce a new dataset, or to produce summary result for the entire dataset. This way of thinking often maps well onto parallel hardware, where each processor can handle one or more data items in parallel.  The Connection Machine, a massively parallel computer developed in the late 1980's made heavy use of this programming paradigm.  More recently, Google, with their large server farms, have made very effective use of it.  They have reported some of these uses in a 2004 paper at OSDI (the Operating System Design and Implementation Conference).   Much of the focus on that paper is on separating the fault tolerance and distributed processing issues of operating on large clusters of machines from the programming language abstractions.  They found map-reduce to be a useful way of separating out the conceptual model of mapping and reducing large collections or lists from the issue of how that computation is implemented in a reliable manner on a large cluster of machines. 

For our purposes here in a programming course, it is illustrative to see what kinds of problems Google found useful to express in the map-reduce paradigm. Counting the number of occurrences of each word in a large collection of documents is a central computational issue for indexing large document collections.  This can be expressed as mapping a  function that returns the count of a given word in each document across a document collection.  Then the result is reduced by summing all the counts together. So if we have a list of strings, the map returns a list of integers with the count for each string.  The reduce then sums up those integers.  Many other counting problems can be implemented similarly.  For instance, counting the number of occurrences of some pattern in log files, such as the number of occurrences of a given user query or a particular url.  Again there are many log files on different hosts, this can be viewed as a large collection of strings, with the same map and reduce operations as for document word counts.

Reversing the links of the web graph is another problem that can be viewed this way.  The web is a set of out-links, from a given page to the pages that it links to.  A map function can output target-source pairs for each source page, and a reduce function can collapse these into a list of source pages corresponding to each target page (i.e., links in to pages).

An inverted index is a map from a term to all the documents containing that term.  It is an important part of a search engine, as the engine must be able to quickly map a term to relevant pages.  In this case a map function returns pairs of terms and document identifiers for each document. The reduce collapses the result into the list of document ID's for a given term.

In the Google OSDI paper they report that re-implementing the production indexing system resulted in code that was simpler, smaller, easier to understand and modify, and resulted in a service that was easier to operate (ie failure diagnosis, recovery, etc.), yet the approach results in fast enough code to be used for a key part of the service.

Folding, map and reduce in SML

Last time we looked at singly linked lists of integers, defining a datatype we called intlist.  SML has built-in singly linked lists which are of type list.  A list must be composed of elements that are all the same type. That type is specified prior to the list type, for instance the type int list specifies a list of integers.  [] is the empty list, and for a nonempty list h::tl is the first element h followed by the rest of the list tl.  That is, :: is the operator that appends an element to the beginning of a list. 

First lets look in more detail at the reduce operation.  Suppose we want to write a function to sum a list of integers.  By now you should be able to write the following code:

fun sumIntlist (s:int list):int =
  case s of
    [] => 0
  | h::t => h + (sumIntlist t)

Now say we want to concatenate a list of strings, again producing a single value from a list.  We can write:

fun concatStringlist (s:string list):string =
  case s of
    [] => ""
  | h::t=> h ^ (concatStringlist t)

These two functions look almost identical.  With the exception of the different types and different operation (^ vs +), both functions are doing the same thing.  In both cases, we walk down a list performing an operation that collapses the list into a single value by computing something based on the elements of the list.  Since code reuse is a big help in reducing both bugs and coding time, we want to abstract this out.

As we consider the items of a list we can store a partial result in an accumulator.  This accumulator changes for each item in the list.  For example, if we want to walk across a list of integers and sum them, we could store the current sum in the accumulator.  We start with the accumulator set to 0.  As we come across each new element, we add the element to the accumulator.  When we reach the end, we return the value stored in the accumulator.

Let's try to rewrite the sumIntlist function to introduce the idea of an accumulator.

fun  sumIntlistAccum (accum:int, s:int list):int =
  case s of
   [] => accum
   | h::t => sumIntlistAccum((accum+h),t)

Of course, to get the sum, we must call sumIntlistAccum with 0 for the initial accum value.  Similarly, we can rewrite concatStringlist with this concept of the accumulator.

fun concatStringlistAccum (accum:string, s:string list):string =
  case s of
   [] => accum
   | h::t => concatStringlistAccum((accum^h),t)

To use this function, we pass in the empty string for the initial accumulator.  Now we can see even more similarity between the two functions. 

We are now nearly in a position to eliminate any differences between the two, by passing in a function that acts on the head of the list and the accumulator.   However, another difference between the two functions is that the types of their parameters are different.  For the moment, lets write a function that captures the common pattern while ignoring the types of the variables:

fun accumulate (f, a, s) =
  case s of
    [] => a
  | h::t => accumulate (f, (f(h,a)),t)

Now we can rewrite sumIntlist and concatStringlist as

fun sumIntlist' (s:int list):int = accumulate((fn(x,a)=>a+x),0,s)
fun concatStringlist' (s:string list):string = accumulate((fn(x,a)=>a^x),"",s)

Note that the function accumulate appears not to be type safe, in that it does not enforce agreement between the types of the parameters of the function f and the types of the accumulator a and the elements of the list s.  Namely, the type of the first parameter of f needs to agree with that of a and similarly for the second parameter and the elements of s.  However, the ML compiler has actually inferred parameterized types for the variables that enforce these constraints.  That is, it has inferred the necessary type agreement without specifying what the specific types are.  A natural way to specify agreement is to use variable names, thereby expressing that two things must be the same because the have the same variable name. 

Next week we will talk about parameterized types in some detail, but today we will write functions without explicit types for the parameters, even though this is generally not good programming practice, and will let the compiler infer the types.  In that way we can focus on issues regarding higher order functions rather than on types.  However before doing so, it is worth considering the inferred type for accumulate, which is,

        fn : ('a * 'b -> 'b) * 'b * 'a list -> 'b

A variable name that starts with a ' in ML denotes a parameterized type.  We will consider parameterized types next week, but for now just look at the pattern.  The first parameter of f is the same type, 'a, as the elements of the list s. Similarly the second parameter is of the same type, 'b, as the value returned by f, the accumulator a, and as the value returned by accumulate. However the accumulator a need not be the same type as the elements of the list s (although in sumIntlist and concatStringlist these types are the same).

There are two powerful built-in functions in SML for the reduce paradigm.  One  is foldl, which abstracts the pattern that we have captured in accumulate,  considering the elements of a list in left-to-right order and building up a result in an accumulator.  However it is also curried.  We saw about currying in recitation, where the arguments are applied in order, to create intermediate higher order functions, rather than all being evaluated before a single function call.

fun foldl' f a s =
  case s of
    [] => a
  | h::t => foldl' f (f (h,a)) t

It is worth spending a minute comparing this curried function to the uncurried accumulate.  Note that the type of foldl' (and the built-in foldl function) is:

        fn: ('a * 'b -> 'b) -> 'b -> 'a list -> 'b

Another way to think about this curried version of foldl is in terms of explicit currying.

fun curry3 f x y z = f(x,y,z)
 
fun foldl'' f a s = curry3 accumulate f a s

Finally, the parameter s is not really needed in foldl, we can just pattern match directly without the case statement:

fun foldl''' f a [] = a
  | foldl''' f a (h::t) = foldl''' f (f (h, a)) t

There is also a built-in function foldr that operates on a list from right-to-left and can be defined:

fun foldr' f a [] = a
  | foldr' f a (h::t) = f(h, (foldr' f a t))

It is instructive to compare this to foldl. Note that in operating right-to-left, the recursive call to foldr on the tail is done and then later f is applied to the head and the result of that call.  In contrast, in operating left-to-right, f is applied to the head and the accumulator and then the recursive call to foldl is made.  This causes a large difference in the amount of memory used by right-to-left folding and left-to-right folding.  More on this in soon when we talk about tail recursion.

Now returning to sumIntlist and concatStringlist using foldl:

fun sumIntlist'' (s:int list):int = foldl' (fn(x,a)=>a+x)0 s
fun concatStringlist'' (s:string list):string = foldl' (fn(x,a)=>a^x) "" s  

What happens if we replaced foldl by foldr in these definitions (other than the memory issue we just mentioned for large lists)?

SML provides a notation so that infix binary operators can be used as functions, using the op keyword, so op+, or op +, is a function that takes two arguments and adds them.  Thus we can also define sumIntlist very concisely as

val sumIntlist''' = foldl' (op+) 0

Note the use of val rather than fun in this declaration and the fact that foldl is curried, so the type of sumIntlist''' is fn: int list -> int.  What would the definition look like using fun rather than val?  For clarity it is a good idea to make it clear that something is a function when it is being declared using val rather than the more customary fun.

Map

Recall that map applies a function to each element of a list, constructing a new list as a result. We can define an uncurried version of map as

fun mapU (f, []) = []
  | mapU (f,h::t) = f(h) :: mapU(f,t)

As with foldl, we can define a curried version, either by currying the uncurried one or directly.

fun curry2 f x y = f(x,y)

fun map' f s = curry2 mapU f s

fun map'' f [] = []
  | map'' f (h::t) = f(h) :: map'' f t

Using map we can define a function to make a copy of a list (with an anonymous function),

fun id s = map' (fn (x) => x) s