CS312 Lecture 10: Binary Search Trees

Trees are one of the most important data structures in computer science. We can think of a tree both as a mathematical abstraction and as a very concrete data structure used to efficiently implement other abstractions such as sets and dictionaries. The ML language turns out to be very well designed for manipulating trees.

Graphs

We start with the more general concept of a graph. A graph consists of a set of nodes (also called vertices) and edges that connect these nodes together. In an undirected graph, every two distinct nodes either may be connected or disconnected (a node may not be connected to itself). The nodes may be any kind of object; typically, the node objects contain some additional information that is to be stored at that location in the graph. We may draw a graph pictorially using labeled dots or circles for nodes and lines to represent the edges that connect them.

There is another important kind of graph, directed graphs, which we will talk about later. In these graphs the edges have a directionality and we draw them as arrows.

An undirected graph may be connected, if every node is reachable from every other node (a node is reachable from another node if it can be reached by following some sequence of edges). If some nodes are not reachable from other nodes, the graph is disconnected.

A cycle is a sequence of nodes in which an edge goes from each node in the sequence to the next, and an edge goes from the last node in the sequence to the first one. Pictorially it looks like a loop. If a graph has no cycles, it is said to be acyclic.

Trees

Trees are a particularly important kind of graph. A tree is a undirected, connected, acyclic graph in which one of the nodes is distinguished from the rest and is called the root node of the tree. Because a tree is a connected graph, every node is reachable from the root. Because the tree is acyclic, there is only one way to get from the root to any given node. It is a convention in computer science to draw trees upside down: the root node is drawn at the top and every other node is drawn below it, with any given node drawn below the other nodes on the path from the root to that node. The depth of a node is the number of edges that must be traversed to get from the root node to that node. The depth of the root is zero. The height of a tree is the largest depth of any node in the tree.

Every node N (except the root) is connected by an edge to a single node whose depth is one less. This node is called the parent node of N. The other nodes to which N is connected, if any, have depth one greater than N, and are called children of N. The nodes along the path from the root to a node N are called the ancestors of N, and the nodes whose paths to the root include N (other than N itself) are called the descendants of N. In general a node may have any number of children. The number of children of a node is called the degree of a node. If a node has degree zero (no children), it is called a leaf (or external) node. Other nodes are known as internal nodes. A subtree is a set of descendants of a particular node, plus that node as the root of the subtree.

Binary Trees

Generally a tree structure by itself is not very useful. It becomes interesting when we attach information to the nodes of the tree. We can then navigate down the tree to find information of interest within the tree. In order to find data in a tree efficiently we will need to impose some restrictions on where data is placed within the tree, and we will need to keep more information about the ordering of the children of a given node.

A binary tree is a tree in which every node has at most degree two, and for each child the node keeps track of whether it is a left child or a right child. A node of degree two must have one of each. This sounds complicated but as an ML datatype it is quite straightforward. A binary tree either contains no nodes at all (Empty), or it contains a root node with a left subtree and a right subtree. Here is how we might declare a tree that stores integers at every node:

datatype inttree = Empty | Node of inttree * int * inttree

or, if we felt that the tuple didn't document things enough, we could define the type using a record:

datatype inttree = Empty | Node of {left: inttree, value: int, right: inttree}

Binary trees can be generalized to trees that are similar but have degree up to k nodes. Such trees are called k-ary trees. Each node can have up to k children, and each child has a distinct index in the range 1..k (or 0..k-1). Thus, a binary tree is just a k-ary tree with =2.

There was nothing about the datatypes above that required them to contain integers. We can define a parameterized tree type just as easily:

datatype 'a tree = Empty | Node of 'a tree * 'a * 'a tree

A k-ary tree is full if every internal node has degree k and every leaf node has the same depth. Suppose that a tree has degree k at all internal nodes, and all leaf nodes have depth h. A tree of height 0 has 1 node, of height 1 has k+1 nodes, of height 2 has k²+k+1 nodes, etc. That is, a full k-ary tree of height h has at least k^h nodes; in fact, it has S_i=0,hk^h nodes. With a few simple manipulations we see this is equal to (k^h+1-1)/(k-1). Using this formula, we see that a full binary tree of height h contains 2^h+1-1 nodes. Because this expression is exponential in h, even a relatively short path through a k-ary tree can get us to a huge amount of data.

Traversals

It is very easy to write a recursive function to traverse a binary (or k-ary) tree; that is, to visit all of its nodes. There are three obvious ways to write the traversal code. In each case the recursive function visits the subtrees of the current node recursively, and also inspects the value of the current node. However, there is a choice about when to inspect the value of the current node. For example, we can write three versions of a fold function that operates on trees:

In a pre-order traversal, the value at the node is considered before the two subtrees.

fun fold_pre (f: 'a*'b -> 'b) (b0: 'b) (t:'a tree) =
  case t of
    Empty => b0
  | Node(lf:'a tree, v:'a, rg:'a tree) =>
      let val b1:'b = f(v,b0)
          val b2:'b = fold_pre f b1 lf in
                      fold_pre f b2 rg
      end

In a post-order traversal, the value at the node is considered after the two subtrees:

    | Node(lf:'a tree, v:'a, rg:'a tree) =>
        let val b1:'b = fold_post f b0 lf
            val b2:'b = fold_post f b1 rg in
                        f(v, b2)
        end

In an in-order traversal, the value at the node is considered between the two subtrees:

    | Node(lf:'a tree, v:'a, rg:'a tree) =>
        let val b1:'b = fold_in f b0 lf
            val b2:'b = f(v, b1) in
                        fold_in f b2 rg
        end

Binary Search Trees

Of course, we don't really want to have to traverse the whole tree to find a data element. Suppose that we want to find an element in the tree (or check whether it is in the tree) and we have an ordering on elements that allows us to compare two elements to see whether one is less than the other. A binary search tree lets us exploit this ordering to find elements efficiently. A binary search tree is a binary tree that satisfies the following invariant:

For each node in the tree, the elements stored in its left subtree are all less than the element of the node, and the elements stored in its right subtree are all greater than the node.

When a tree satisfies the data structure invariant, an in-order traversal inspects the value of each node in ascending order.

Finding elements

This invariant allow efficient navigation of the tree to find an element e if it is present. Arriving at a given node, we can compare e to the value e' stored at the node. If it is equal to v, then we have found it. Otherwise, it is either less than or greater than v, in which case we know that the element, if present, must be found in the left subtree or the right subtree respectively.

   fun contains (t:'a tree, e:'a, cmp:'a*'a->order) =
      case t of
         Empty => false
       | Node(lf, v, rg) =>
            case cmp(e, v) of
               LESS => contains (lf, e, cmp)
             | EQUAL => true
             | GREATER => contains(rg, e, cmp)

Given a binary tree of height h, this function will make at most h recursive calls as it walks down the path from the root to the leaves of the tree.

Inserting elements

Suppose that we have a binary search tree, and we would like to create a new binary search tree that contains one additional element. We can write this recursively too:

   fun add(t:'a tree, e:'a, cmp:'a*'a->order): 'a tree =
      case t of
         Empty => Node(Empty, e, Empty)
       | Node(lf, v, rg) =>
            case cmp(e, v) of
               LESS => Node(add (lf, e, cmp), v, rg)
             | EQUAL => t
             | GREATER => Node(lf, e, add(rg, e, cmp))

This code is simple and will make at most h recursive calls when inserting into a tree of height h. However, there is a lurking performance problem. Suppose that we insert a series of n elements that are always increasing in value. In this case the code will always follow the GREATER arm of the case expression and will build a tree that looks just like a linked list of length n ! Therefore looking up an element might require looking at the entire tree. We will see later how to do a better job.

Finding ranges

Recall that an in-order traversal visits nodes in ascending order of their elements. We can use this fact to efficiently find all the elements in a tree in a range between two elements. For example, we can write a fold operation that only considers such elements:

fun fold_range (f: 'a*'b -> 'b) (b0: 'b) (t:'a tree) (cmp:'a*'a->order) (a0:'a, a1:'a) =
  case t of
    Empty => b0
  | Node(lf:'a tree, v:'a, rg:'a tree) =>
      case (cmp(a0,v), cmp(a1,v)) of
        (LESS, LESS) => fold_range f b0 lf cmp (a0,a1)
      | (GREATER, GREATER) => fold_range f b0 rg cmp (a0,a1)
      | (LESS, EQUAL) => f(v, fold_range f b0 lf cmp (a0, a1))
      | (EQUAL, GREATER) => fold_range f (f(v,b0)) rg cmp (a0,a1)
      | (_, _) => fold_range f (f(v, fold_range f b0 lf cmp (a0,a1))) rg cmp (a0,a1)

This code will only visit the nodes in the tree that are within the range and the ancestors of those nodes, which is potentially quite efficient.

Implementing sets

Since we can test for presence in a tree, and we can add new elements to a tree, we have the makings of an implementation for a very important abstraction in any programming language: the set. In fact, we can do a little better, and implement an ordered set abstraction, which the following is a possible signature for:

signature ORD_SET = sig
   type elt

   (* a "set" is an set of elements of type elt where these
    * elements have an total ordering. *)
   type set

   (* empty is the empty set *)
   val empty : set

   (* single(x) is {x} *)
   val single : elt -> set

   (* add (s,x) adds element x to set s *)
   val add: set*elt -> set

   (* union is set union. *)
   val union : set*set -> set

   (* contains(s,x) is whether x is a member of s *)
   val contains: set*elt -> bool

   (* size(s) is the number of elements in s *)
   val size: set -> int

   (* left(s,e) is the set of all e' in s s.t. s'>=s *)
   val left: set*elt -> set

   (* right is the dual of left *)
   val right: set*elt -> set

   (* range (s,e1,e2) returns all elements of s that are between e1 and e2 *
    * Requires: e1<=e2 *)
   val range: set*elt*elt -> set

   (* lst (s) returns the list of all elements of s *)
   val toList: set -> elt list
end

Notice that this signature does not define what the types elt and set are. However, the reason that they are left unspecified is different. elt is left unspecified so that the signature can be instantiated on a particular type elt that one wants to make sets of. The type set is unspecified to provide data abstraction -- so clients of the signature cannot find out how sets are implemented and possibly misuse them.

To instantiate a signature, we use a where clause. The signature ORD_SET where type elt = T is a signature just like ORD_SET, except that every place that elt appears is replaced by T. If we wanted to implement ORD_SET for a particular element type (such as int), we could write a structure that implements that signature as follows:

structure intset : ORD_SET where type elt=int = struct
  ...
end

However, that would be pretty inefficient if it resulted in having to implement binary trees for every possible element type. A better approach is to use a functor : a function that returns a structure. A functor may take some number of arguments (which are also structures) and returns a structure. The "types" of the arguments and result are signatures. A functor that implements the ordered set signature is given below.

We can implement binary search trees on any type that has an ordering function, so we can write a signature ORD_KEY to describe the input to the corresponding functor. The result is an ORD_SET structure where the element type is the same type that was specified in the input to the functor, O.ord_key. Notice that this functor can be applied to the same element type with a different ordering function to get a set whose elements are differently ordered.

Notice the use of the symbols :> in declaring that the functor produces an ORD_SET. This indicates that the details of the implementation are hidden. The only thing that can be known about things of type set are the things that can be observed through the ORD_SET signature. If we had written : instead of :>, we would be able to violate the abstraction barrier and see that the sets are actually binary trees. Sometimes this is useful for debugging purposes.

signature ORD_KEY = sig
   type ord_key
   val compare : (ord_key * ord_key) -> order
end

functor BinarySetFn (O: ORD_KEY) :> ORD_SET where type elt = O.ord_key = struct

   type elt = O.ord_key
   
       (* Representation invariant:
        *    All elements in the left subtree have elements
	*    less than "elt", all elements in the right subtree
	*    have elements greater than "elt".
	*)
    
   datatype set = Empty | Node of set * elt * set

   val empty = Empty

   fun single (e: elt) = Node(Empty, e, Empty)

   fun add (s: set, e: elt) =
      case s of
         Empty => single e
       | Node(l, e', r) => (
            case O.compare (e',e) of
               LESS => Node(l, e', add (r,e))
             | EQUAL => s
             | GREATER => Node(add (l,e), e', r))

   fun union (sets: set*set) =
      case sets of
         ((Empty, set) | (set, Empty)) => set
       | (s1 as Node (l1, e1, r1), Node (l2, e2, r2)) => (
            case O.compare (e1,e2) of
               LESS => union (r1, Node
                                     (union (Node (l1, e1, Empty),
                                             l2), e2, r2))
             | EQUAL => Node (union (l1, l2), e1, union (r1, r2))
             | GREATER => union (l1,
                                 Node (l2, e2,
                                          union(Node(Empty,e1,r1), r2))))

   fun contains (s: set, e: elt) =
      case s of
         Empty => false
       | Node(l,e',r) => (
            case O.compare (e',e) of
               LESS => contains (r,e)
             | EQUAL => true
             | GREATER => contains (l,e))

   fun size (s: set) =
      case s of
         Empty => 0
       | Node (l, _, r) => size l + size r + 1

   fun left (s: set, max: elt) =
      case s of
         Empty => Empty
       | Node (l, e, r) => (
            case O.compare (e,max) of
               LESS => Node (l, e, left (r,max))
             | EQUAL => Node (l, e, Empty)
             | GREATER => left(l,max))

   fun right (s: set, min: elt) =
      case s of
         Empty => Empty
       | Node (l, e, r) => (
            case O.compare (e,min) of
               LESS => right(r,min)
             | EQUAL => Node(Empty, e, r)
             | GREATER => Node (right(l,min), e, r))

   fun range (s: set, min: elt, max: elt) =
      case s of
         Empty => Empty
       | Node (l, e, r) => (
            case (O.compare (min,e), O.compare (e,max)) of
               (LESS, LESS) => Node (right(l,min), e, left(r,max))
             | (LESS, EQUAL) => Node (right(l,min), e, Empty)
             | (LESS, GREATER) => range (l,min,max)
             | (EQUAL, LESS) => Node(Empty, e, left(r,max))
             | (EQUAL, EQUAL) => single e
             | (GREATER, LESS) => range (r,min,max)
             | _ => raise Fail "range percondition violated")

   fun toList (s: set) =
      case s of Empty => [] | Node (l, e, r) => (toList l) @ (e :: (toList r))

end