19. Sets and Maps
In today’s lecture, we’ll introduce two more ADTs, Set and Map. We’ll consider a few realizations of these ADTs backed by different data structures we have seen earlier in the course, and we’ll compare the performance trade-offs we get from each of these approaches. In the next lecture, we’ll introduce another amazing data structure, a hash table, that will provide even stronger performance guarantees under some assumptions. Later in the lecture, we’ll also consider some more specialized operations on Sets, and we’ll leverage Maps to develop an enhanced priority queue that allows the priorities of its elements to be updated.
The Set ADT
Sets are one of the central building blocks of mathematics that provide a way to collect different objects together into a single named entity. Sets are characterized by two main properties. First, their elements are unordered. There is no notion of a first, second, or last element in a set; in other words, sets are distinguished only by their contents (which elements they contain and don’t contain) and not by the particular order that their elements are enumerated. Second, the elements of a set are distinct. A set cannot contain two or more copies of the same element.
A set is an unordered collection of distinct elements.
In designing a Set ADT, we must model these properties. As a collection, we will want a way to add() and remove() elements to our set and to check whether the set contains() a particular element. Since sets are unordered, our ADT should not provide a way to obtain an index of an element or expose notions of next, previous, first, or last elements. To ensure that the elements of our set are distinct, we will need to allow our add() method to fail if the client attempts to re-add an element that is already present in the set. We can do this by having add() return a boolean that indicates whether the addition was successful. Symmetrically, we’ll have our remove() method return a boolean that indicates whether the removal was successful (i.e., whether the element that the client asked to remove() was present in the set). This results in the following Set ADT.
Set.java
|
|
|
|
Note that our Set interface extends the Iterable interface so that a client can iterate over the elements of a Set using an enhanced-for loop. Since Sets are unordered, the specification makes no guarantee about the order that the elements are returned by calls to the iterator’s next() method.
Now that we have introduced the basic Set interface, we can consider different data structures that can realize it.
ListSet
Similar to the priority queue from last lecture, we can implement the Set interface using a List (in this case, Java’s ArrayList) as the backing storage. Since our Set interface exposes less information than the List interface (it “forgets” about the notion of indices), a composition relationship is the natural choice. If we add() new elements at the end of the list, a particular element can be at an arbitrary position in the list. Therefore, checking for the presence of a particular element within the set (necessary for add(), contains() , and remove()) becomes a linear time operation, a linear search. Our implementation is shown below.
ListSet.java
|
|
|
|
The size() and iterator() methods inherit the \(O(1)\) performance of their respective ArrayList counterparts. The runtime of add(), contains(), and remove(), are all bounded by the runtime of contains(), which is \(O(N)\) for the linear search. As we discussed above, the distinct-elements requirement of a Set requires us to frequently perform membership checks, so we’d prefer a data structure that can carry these out efficiently. Just as with the priority queue, a second thought is to impose a sorted invariant on the list entries.
SortedListSet
If our Set implementation composes with a sorted ArrayList, our membership queries can use a binary search instead of a linear search, improving their performance to \(O(\log N)\). We’ll extract this binary search into a private find() helper method that returns the index where this element is/would be located, as this extra information is required for our add() and remove() methods.
SortedListSet.java
|
|
|
|
As a review of binary search, take some time to complete the definition of the find() method according to its specification. To reduce the space complexity, we’ve developed an iterative definition using a loop invariant. Since the list field can store elements of any Comparable reference type, your definition will need to use the compareTo() method.
find() definition
Now, take some time to use this find() method to complete the definitions of the contains(), add(), and remove() methods.
contains() definition
add() definition
remove() definition
Let’s consider the runtime complexities of these methods. Since the find() method is performing a binary search on a list of \(N\) elements, it has an \(O(\log N)\) runtime. Since contains() just calls find() along with three extra \(O(1)\) checks, its runtime is also \(O(\log N)\). While we may be tempted to declare that runtimes add() and remove() also have \(O(\log N)\) runtimes (because of their call to find()), we must be careful. These methods add or remove an element from its sorted position in the list, which may be at its start and require an \(O(N)\) shift of the remaining elements. Thus, the worst-case runtimes of both add() and remove() are \(O(N)\). To address this, we’ll need to move away from a dense, linear backing storage.
TreeSet
We know from previous lectures that binary search trees support add(), contains(), and remove() methods with an \(O(\textrm{height})\) worst-case time complexity, and that balanced binary trees achieve an \(O(\log N)\) complexity for these operations. We can leverage this to give a Set definition (a TreeSet) based on a composition relationship with a BST that achieves these same performance guarantees.
|
|
|
|
Note that this code is almost identical to the ListSet but with list swapped out for tree. Both of these classes delegate most responsibility to their field data structures. This illustrates the benefits of abstractions and helps to underscore the importance of understanding the performance guarantees of different data structures. If our BST is balanced, and if our elements are Comparable, then simply switching out the data structure with which we compose (from List to BST) offers us an exponential performance improvement with minimal additional work.
A Set definition backed by a balanced tree offers the optimal worst-case performance guarantees. In the next lecture, we’ll introduce another data structure, the hash table, that will allow us to greatly improve the expected performance of the Set operations. For today, we’ll next consider how to augment our Set definitions to support some other useful set operations.
Additional Set Operations
Mathematical sets support some additional basic operations that we may wish to support on our Set data types. All of the operations that we’ll consider are operators, meaning they take in one or more sets as inputs and return a new set as their output. In our brief overview, we’ll only consider adding these methods to the ListSet class. We leave it to you to extend these ideas to our other Set implementations, which we walk through in the lecture exercises.
Union
The first operation we’ll consider is the set union.
Given two sets \(S\) and \(T\), their union \(S \cup T\) consists of all elements that belong to either \(S\) or \(T\) (or both of these sets).
We can model the union operation with the following method of the ListSet class.
ListSet.java
|
|
|
|
Throughout, we’ll use \(N\) to denote set1.size() and \(M\) to denote set2.size(). For our complexity analysis, we’ll also assume that \(N \geq M\), since we can always switch the sets in constant time if this is not true. One straightforward way to define union() is to simply add() the elements from set1 and set2 to a new set.
ListSet.java
|
|
|
|
What will be the worst-case runtime of this method? This will occur when there are no common elements between this and other (in other words, this and other are disjoint), which will cause union to be as large as possible. In this case, the \(N+M\) add() calls will have runtime bounded by the \(N+M\) contains() calls, which will have total runtime,
This approach performs a lot of unnecessary work. Notice that during the first loop, we call add() for each element of list, which will do \(O(N)\) contains() checks to make sure none of these elements are in union. However, we know that the elements of list are distinct, so will definitely not already be in union during their add() call. We can skip these contains() checks and add the elements directly to union’s backing ArrayList. Similarly, when we are adding the elements of other, we only need to check whether they are in list, not in the larger union; we know that they cannot an earlier-added element of other since other has distinct elements. Overall, these modifications lead to the following definition.
ListSet.java
|
|
|
|
This definition is another great example of encapsulation. The ListSet class exposes only one way for a client to add elements, the add() method. Since the client may attempt to add() any element (perhaps an element already in the set), this method must carry out a (costly) contains() check. In our union() method, we are more than just a client of the ListSet class; we are the class implementer. Therefore, we can access the (private) list field of another ListSet object within this method to perform the add without the check (with the added responsibility of guaranteeing that our short-cut still preserves the class invariant).
The list.add() calls in the first loop each run in \(O(1)\) amortized time for an \(O(N)\) amortized complexity. Then, in each of the \(O(M)\) iterations of the second loop, we perform an \(O(N)\) call to this.contains(). These contains() calls dominate the runtime and result in an overall amortized complexity of \(O(NM)\).
Is this really a big improvement? It depends on the relative sizes of \(N\) and \(M\). If \(N\) and \(M\) are on the same order meaning \(N = O(M)\), then both \(O\big( (N+M)^2 \big)\) and \(O(NM)\) simplify to \(O(N^2)\), and the improvement is only in the constants. On the other hand, if \(N\) is significantly larger than \(M\), there can be a notable performance difference. For example, if \(N = M^2\), then \(O\big( (N+M)^2 \big)\) simplifies to \(O(M^4)\) but \(O(NM)\) simplifies to \(O(M^3)\).
We can leverage sorting or an order invariant on the set elements (as in both the SortedListSet and TreeSet), to further improve the performance of union().
Predicates and Restriction
When we restrict a set, we filter out only those elements that satisfy a certain desired property, called a predicate. We end up with a subset, another set in which every element belongs to the original set, but where some elements may be absent.
A predicate is a function that maps elements of a particular type to a boolean value. We say that the elements that are mapped to true satisfy the predicate, while the elements that are mapped to false do not satisfy it.
When we restrict a set on a predicate, we obtain the subset formed from the elements of the set that satisfy the predicate (and excluding all elements that do not satisfy the predicate).
We can model a predicate with a simple functional interface.
Predicate.java
|
|
|
|
We can instantiate this interface with a lambda expression. For example, we can create an isEven predicate over the Integer type by writing:
|
|
|
|
Now, we can add a restrict() method to our ListSet class that takes in a Predicate and builds a new set out of only those elements that satisfy this predicate. We can similarly optimize this method by operating directly on the list field of our new set.
ListSet.java
|
|
|
|
Take some time to try completing the definition of this method on your own before looking at our implementation.
restrict() definition
Overall, the runtime of the restrict() method is \(O(N)\) times the complexity of the satisfiedBy() method.
Intersection
Finally, we’ll consider the intersection operation.
Given two sets \(S\) and \(T\), their intersection \(S \cap T\) consists of all elements that belong to both \(S\) and \(T\).
Using the tools that we have already developed (in particular a call to restrict() with a particular Predicate), we can develop a one-line definition of an intersection() method. Take some time to come up with this definition.
intersection() definition
This intersection() definition performs a restriction based on a predicate with an \(O(M)\) time complexity. Thus, it requires \(O(NM)\) time.
Maps
Now, we’ll consider another ADT called a Map (or a Dictionary) that is closely related to a Set. Similar to a set, a Map consists of an unordered collection of distinct items called its keys. However, a Map contains additional information as well. Each key is associated with a value.
A Map is an unordered collection of distinct keys of one type that are each associated with one value of another (potentially different) type. We call one associated (key, value) pair an entry in the map.
Typically, the keys will be smaller, simpler objects whose primary purpose is to enable quick access to their entry in the map. The values will often be larger or more complicated objects that model richer information about the entry. For example, in a physical dictionary, an entry consists of a word (its key) and an arbitrarily long description of various aspects of that word (its one or more definitions, parts of speech, example usages, etymology, etc.) that make up its value. When we want to look something up in the dictionary, the alphabetical organization of its keys allows us to quickly locate our desired entry. Once we have located the entry, we interact mainly with its value.
This (key, value) abstraction provides a convenient way to organize large, complicated data entries so is ubiquitous in many software systems. For example, our course grade book uses Maps to associate each student netID with an inner Map that associates each assignment with a grade. Because of their close connection to spreadsheets and other tabular data, we can visualize Maps using two-column tables, in which the first column contains the keys and the second column contains the values. For example, a nutritional app may create a Map that associates different foods with their calorie count, and we can visualize this map as follows:
| Key (String) | Value (Integer) |
|---|---|
Avocado (2 tbsp) |
40 |
Broccoli (1 cup) |
31 |
Chicken (8 oz) |
543 |
Pineapple (1/2 cup) |
40 |
Swiss Cheese (1/4 cup) |
110 |
â‹® |
â‹® |
When first encountering Maps, a common source of confusion is determining which type serves as the key and which type serves as the value. Filling in the following sentences can be helpful:
"In my map, every ( Key ) is paired with exactly one ( Value )." Remember, the keys in a map are distinct, but the values do not need to be (look at the calorie counts of pineapple and avocado in the above table). Here, every food has one calorie count, but not every calorie count belongs to exactly one food, so the foods must be the keys.
"In my map, I want to use a given ( Key ) to look up its ( Value )." Remember that keys are used for searching and values hold additional information. We'll see that the client passes keys into a Map and retrieves values. In this case, the user will enter foods to look up their calorie counts, so the foods must be the keys.
Some cases will be less clear than this example, and still others might require maps in both directions. Whenever you are working with a map but lacking the functionality that you need, it can be helpful to step back and think through these questions to check whether a map is useful in that scenario.
The Map ADT
Our Map interface is the first example that we have seen of using multiple generic types. It will have both a generic key type, K, and a generic value type V.
Map.java
|
|
|
|
What operations should our Map support? First, a client should be able to add entries to the map by specifying their key and value. Similarly, a client may wish to update an entry with a given key to have a new value. We will capture both of these behaviors through a single method, put().
Map.java
|
|
|
|
For most of the other operations, the client will pass only the key to the Map. They’ll need a containsKey() method to check whether a particular key has an associated entry. Once they know that a key is present, they can use this to access its associated value (either without modification using get() or with modification using remove()).
Map.java
|
|
|
|
Finally, we’ll want a method to return the size() of the map, the number of (key, value) pairs it stores, as well as a method to efficiently iterate over the entries in a map by obtaining a set of its keys.
Map.java
|
|
|
|
Implementing a Map
Central to every Map implementation is an Entry class to represent (key, value) pairs, which we model as a nested record class.
|
|
|
|
Once we have this Entry class, a Map is essentially a Set of Entrys, so it can be defined using any of the data structures that we considered earlier (a list, a sorted list, a BST, or a hash table that we will discuss in the next lecture). Unfortunately, though, we cannot easily leverage a composition or inheritance relationship to reuse the Set operations in our Map definition. This is because the distinction between Entrys (the type stored in the Map) and their Keys (the type used to traverse the data structure) causes a new level of indirection that our previous code is unequipped to handle. As a result, our Map implementations will follow the same logic as our Set implementations, just with a few small changes to accommodate keys, values, and entries.
We give a complete definition of a ListMap class backed by an unordered list of Entrys below, and leave the other definitions as exercises.
ListMap.java
|
|
|
|
In this implementation, we extract the common subroutine of locating the map entry with the given key into a private find() helper method. This handles one indirection between keys and entries. This method returns a reference to an Entry object, the type stored in the list field, and we can use the reference to modify (in put()), access a value of (in get()), or remove (in remove()) an entry.
A SortedListMap definition follows a very similar design (see Exercise 19.8). A TreeMap definition is more complicated, since we will need to re-implement the BST to incorporate the Entry-Key indirection into its private find() method (see Exercise 19.9).
Dynamic Priority Queues
As a nice application of a Map, we’ll end today’s lecture by enhancing the heap-backed priority queue implementation that we wrote in the previous lecture. In particular, we will make the priority queue dynamic by allowing the client to update the priority of an element while it is queued. This, for example, can model an emergency room patient who experiences new symptoms in the waiting room and must be moved ahead of other patients.
To enable priority updates, we must add a restriction that the elements in the priority queue are distinct (so that we can unambiguously update their priorities). We’ll need to update the spec of our add() method to require that the entry is not already present in the queue. We’ll also add an update() method that allows the client to modify the priority of an element present in the queue.
DynamicPriorityQueue.java
|
|
|
|
Some sources prefer to define a single method (similar to put()) that is used both to add new elements to the priority queue and adjust the priorities of existing elements. We chose to use two separate methods so that our DynamicPriorityQueue is a more natural subtype of a PriorityQueue.
Now, we must consider how to implement these new methods. The heap order invariant doesn’t provide an efficient way to search for a particular element. If our target element is not the maximum element, then it can be in either of the heap’s subtrees. Thus, any search may need to visit all entries of the heap. To shortcut this process, we can use a Map to associate the entries with their indices in the heap. In particular, we’ll use Java’s TreeMap, which supports worst-case \(O(\log N)\) put(), get(), containsKey(), and remove() operations. We end up with the following state representation.
MaxHeapDynamicPriorityQueue.java
|
|
|
|
Note that we cannot leverage our Heap class from the previous lecture, since we will need to rewrite many of the Heap methods to update the index and preserve the class invariant. We leave it as an exercise to complete the definition of this dynamic priority queue and analyze the complexity of its methods (see Exercise 19.10). A dynamic priority queue will play a central role in our implementation of Dijkstra’s shortest path algorithm in a few lectures.
Main Takeaways:
- A Set is an unordered collection of distinct elements. In its most basic form, the
Setinterface supportsadd(),contains()andremove()operations. - There are many data structures that can implement a
Set. List backed implementations have poor performance due to inefficient searching (when unsorted) or memory shifting (when sorted). Balanced BSTs enable aSetimplementation with \(O(\log N)\) operations. - Predicates filter the elements of a set, can be instantiated using lambda expressions, and form the basis for many set operations.
- A Map consists of a collection of (key, value) entries. The keys are unique and are used to navigate the collection. The values store additional information related to the keys.
- The two primary map operations are
put()(to add or modify map entries) andget()(to access the value associated with a key). - An index map associating elements with their heap indices can be used to enable updating priorities in a dynamic priority queue.
Exercises
Integers into both a Set<Integer> set and a CS2110List<Integer> list in the same order. Which of the following must be true?Suppose we carry out the following operations on a map:
|
|
|
|
map.get("y")?m, key k, and value v. Which of the following can you guarantee to be false in m?Set implementations (ListSet, SortedListSet, and TreeSet):
Iterator must comply to? If not, is there a natural order that clients would want to have?
iterator().
ImmutableSet being returned.
|
|
|
|
Implement the constructor.
|
|
|
|
contains() and size().
Implement add() and remove(). Keep in mind that these methods should not modify the backing representation of this.
|
|
|
|
union()s
union() method. Implement the changes and update the rest of the method to leverage them. State the new runtime complexity.
ListSet.union(), we begin by sorting this.list.
merge() method of merge sort, we merge the two lists together in SortedListSet.union(). Instead of adding duplicate elements to the resulting list, we’ll discard them.
TreeSet.union(), first clone other using a pre-order iterator. Then use simultaneous in-order traversals of both the clone and this BST to add additional elements to the clone so it contains the union of both original BSTs.
Predicate.
String is a palindrome.
Point (refer to Lecture 11) is in quadrants 2 or 4 and not on any axes.
CS2110List<T> satisfies a different predicate Predicate<T> check.
restrict.
|
|
|
|
The symmetric difference of two sets \(A\) and \(B\), denoted \(A \triangle B\) consists of all the elements in \(A \cup B\) that are not in \(A \cap B\).
|
|
|
|
p1 and p2, apply these on a set s using the following ways.
restrict().
restrict() one time on a lambda expression that combines p1 and p2.
restrict() and intersection().
SortedListMap
Map can be implemented with a sorted list.
|
|
|
|
Write a helper method findKey() that returns the index of a given key or -1 if not in the list. This should run in \(O(\log N)\) time.
|
|
|
|
findKey() to implement containsKey() and get().
put() and remove(). Keep in mind the invariant of list.
keySet().
size.
TreeMap
|
|
|
|
MaxHeapDynamicPriorityQueue
MaxHeapDynamicPriorityQueue. Recall that this data structure supports changing the priority of elements within a max heap. It maintains a max heap and a map associating each heap element with its index in the list.
Trace the state of the map and heap of a MaxHeapDynamicPriorityQueue<Character> pq after each line is executed.
|
|
|
|
assertInv() method.
In our implementation of a max heap, we had a helper method swap() to swap two indices in the backing list. Now that we have another field that must be consistent with the list, the map must also be updated on a swap.
|
|
|
|
add() and remove() to re-satisfy the class invariant of index.
update().