CS410, Summer 1998, Homework 4

Where do the Most Students Live

Due: 11:35 AM, Wednesday July 22

Last Updated: 15 July, 9:45 AM.

Note: You are responsible for following the course policy on academic integrity.

Purpose

The purpose of this assignment is to use hash tables and heaps to solve efficiently some interesting problems. It is also to emphasize that although space is often plentiful, it should not be wasted without reason. Other goals include learning to use standard libraries and writing polymorphic data structures.

The Fun Part

The following two files contain the entire student directories for Cornell University and Ithaca Kollege, respectively:

cornell.txt (original file deleted)
ithaca.txt (original file deleted)

I want to know which addresses have the most Cornell students living at them. This should be done in time O(n log m) assuming hashing behaves, where n is the number of students and m is number of most-common-addresses asked for. I also want to know what addresses have both Cornell students and Ithaca students living at them. This should be done in time O(n) where n is the number of total students, assuming hashing behaves. Space requirements are discussed in more detail below.

Overview

You have 3 main things to write, all static methods in the provided class called BigHouses.

printBigHousesOnePass which solves the most-common-address problem while looking at the input file only once.
printBigHousesTwoPass which solves the most-common-address problem while looking at the input file twice.
printIntersection which solves the addresses-with-people-from-each problem while looking at each input file only once.

The first two methods take 3 arguments: the name of the file to be used as input, a value for m (the number of most-common-addresses desired), and a PrintWriter to which output should be sent. The last method takes 3 arguments: the two input file names, and a PrintWriter object.

You should use the Hashtable class provided by the java.util library. You will also need a Heap; you should write your own. Your Heap class should be polymorphic over different kinds of objects that have integer keys. This is described more below.

Here are more specific requirements for each part:

printBigHousesOnePass
- It must run in time O(n log m).
- It must not look at the input file more than once.
- The output should have the following format: Print the address of a place on one line. Then print the names of all the people living there, one-per-line, indented by a single tab. The m most common addresses must be in order with the most common appearing first. Here is an example:
```
15 YELLOW CARTOONS ROAD, SPRINGFIELD
       HOMER SIMPSON
       MARGE SIMPSON
       BART SIMPSON
       LISA SIMPSON
       MAGGIE SIMPSON
       ITCHY
       SCRATCHY
100 STONE STREET, ITHACA, NY 14850
       FRED FLINTSTONE
       WILMA FLINTSTONE
       BAM BAM
```
printBigHousesTwoPass
- It must run in time O(n log m).
- It must not look at the input file more than twice.
- The output should have the same format as the previous part.
- It must not take more space that O(a + b) where a is the number of distinct addresses in the input and b is the number of people living in the m most common addresses. Notice this means you cannot store all of the names in memory.
printIntersection
- It must run in time O(n) where n is the combined size of the two input files.
- It's output should have each address that appears in both input files on a separate line. You should not output any names for this part of the assignment.
- No address should be printed more than once.
- It must use space O(a) where a is the number of distinct addresses in the second input file, not in the two input files combined.

The turn-in procedures, style guidelines, etc. are the same as for the previous programming assignment.

Big Hints

Think of an address as a key and the set of people who live there as a value. So when you encounter a name and address, see if you have seen the address before. If so, add the name to the value for that address. Else, make a new entry in your table of "addresses and who lives there". Then when you're done with the input, walk through all the entries you have and use a heap to keep track of the m with the largest number of people. Then sort the m winners and print them out in reverse-sorted order.

For the second part, realize that you can find the m most common addresses without storing any names. Instead, just keep a counter with each address that maintains how many people are already known to live there. Then after you have the m most common addresses, go back through the input file a second time. However, use a new table that only has the m most common addresses in it. If an address is not in the table, ignore the corresponding name. If it is in the table, add the name to its value as in the previous part.

For the third part, put all of the addresses from the second input into a table. (Users of your method presumably know to make the second input the smaller one.) Then use this to decide if each address in the first input should be printed or not. You can completely ignore the names. Be a little clever to efficiently avoid printing out any address more than once. Note: since all we need are keys, you might be tempted to put null values in Java's Hashtable. This appears to cause errors. Just use a dummy Object instead. Use one dummy Object over and over again; do not waste the time and space to make a new dummy Object for each table entry.

For maintaining a list of names, feel free to use Java's Vector class. A Vector holds Objects, so you have to cast anything you take out of a Vector. The Hashtables are the same way, and your Heaps will be similar, so you will be doing quite a bit of casting.

In the first two parts, you will need Heaps. The trouble is that some Heaps need to hold objects that have addresses and names, while others need to hold objects that have addresses and counts. In order to avoid cut-and-paste, you should write a Heap that works for any class that implements the Heapable interface. We have written this interface for you. An interface is like an abstract class (everything listed in an interface is abstract), but classes may implement multiple interfaces. If you're unfamiliar with interfaces, you can just think of the Heapable interface as an abstract class; the differences aren't relevant to this assignment. Just say implements Heapable instead of extends Heapable in other class definitions. Then your Heap should manipulate Heapable objects. For example, deleteMin() should return a Heapable object. Of course, code using the Heap will need to cast this to a particular kind of Heapable object so it can use the actual data in the object.

Do not bother providing a buildHeap method in your Heap class; this is not an efficiency bottleneck for this assignment. You should, however, provide methods getMinKey() and replaceMin() for efficiency reasons.

Input and Output

The input files are in a very simple format. There is one name or address per line. It alternates name, address, name, address, etc. The name corresponds to the address on the following line. The provided code has a method that takes a file name and returns a BufferedReader object. This object has a readLine() method which returns a String which is the next line of the file. It returns null after the end-of-file has been reached.

For output in your methods, all you have to do is call the println(String s) method of the Printwriter object passed to the methods.

The methods described above could throw an IOException. Hence the methods you are writing also declare that they may throw such an exception. You should not catch such exceptions.

If you are using a version of Java that doesn't have BufferedReader, it should be simple enough to write your own readLine() method for an input object.

Don't forget that your methods should work for any files in the format described above. Only the main method should explicitly mention the files being used.

We assume that two addresses are equal if and only if they are identical Strings. Hence the following are all different addresses as far as we're concerned:

123 SESAME STREET APT. #2
123 SESAME STREET APT. #1
123 SESAME STREET APT. No. 1

Code We're Providing You

Extra Credit

Change the Heapable interface so that getKey() returns an Object. You should also add to the interface a method

boolean lessThan(Object
otherKey)

. Modify your Heap to work with this new definition of Heapable. You will also have to modify your implementations of Heapable and uses of Heaps, especially because an int is not an Object. (So you will have to use the Integer wrapper class as you would with, for example, Hashtables.) The advantage of doing this is that your Heap will work for data objects that have keys that are not ints.

Food For Thought

The main method provided to you gives very rough timing information for the first two parts of the assignment. Do the results surprise you? How do you explain them?
Your Heap seems to be a min-heap (meaning it has deleteMin instead of deleteMax). But how could you use it as a a max-heap without changing it at all?
Were you surprised by the second most-common address among Cornell students? Do you know why it is what it is?