Assignment 3: Spreadsheet Join Utility
Due: Thursday, September 29

Learning Objectives

Introduction

Spreadsheets are often used as a way to store and manipulate information. Sometimes information ends up spread across multiple spreadsheets, making it useful to be able to combine information from them. For example, if we had a spreadsheet listing the Gross Domestic Product (GDP) of all 50 states, and another spreadsheet containing median income of the residents of those states (not necessarily listed in the same order), it would be handy if we could easily generate a single spreadsheet containing both kinds of information for each state.

Joins

This operation corresponds to what is called a join in the database area. Like a spreadsheet, a database is made of tables with rows and columns. A join combines two tables to produce a new table, based on information matching in one or more columns. There is more than one kind of join, but we will be interested in computing a left join using the first column of each spreadsheet, under the assumption that the first column of each spreadsheet represents the same data item. For each row in the first table, the left join identifies every row in the second table with a matching first column. Then for each matching row, the new table contains a row that concatenates the row from the first table with the matching row's contents except for the first column. If there is no such row in the second table, a single row is added to the new table, but with empty entries for columns that would have come from the second table.

For example, consider the following two tables. The first one records some (past and present) capitals of states; the second has some population and economic data:

StateCapital
New YorkAlbany
CaliforniaSacramento
FloridaTallahassee
TexasAustin
TexasHouston
VermontMontpelier
StatePopulationGDP
New York19.51500
Florida21.21150
California39.43400
Texas28.62000
North Dakota0.856

The left join of these two tables is the following table:

StateCapitalPopulationGDP
New YorkAlbany19.51500
CaliforniaSacramento39.43400
FloridaTallahassee21.21150
TexasAustin28.62000
TexasHouston28.62000
VermontMontpelier

The order of rows in the result table always matches the order in the first table. There is an incomplete Vermont row because there is no Vermont row in the second table; there is no North Dakota row because North Dakota does not occur in the first table. Note that if there had been two “Texas” rows in the second table, the result table would contain four Texas rows: two Austin, Texas rows followed by two Houston, Texas rows.

CSV Input Format

Your program will need to be able to read in such tables in a simplified CSV (comma-separated values) format that is compatible with many spreadsheet programs. In this format, values on each row are separated by commas. For example, the first table above would look like the following in CSV format:

State,Capital
New York,Albany
California,Sacramento
Florida,Tallahassee
Texas,Austin
Texas,Houston
Vermont,Montpelier

Note that values may contain space characters, as in “New York”. The first row of this table is serving as a header row, but it will not be treated specially by your program; it will be just like any other row. For simplicity, we will require the CSV input to only contain values that do not use any special characters such as commas or newline characters. This restriction will allow you to break a line into a sequence of values just using the method String.split().

You will need to read this kind of data from the input file using the Java I/O library and to extract the data into a list of rows, each of which is a list of values. You must use the linked list data structure that you are implementing as part of this assignment to store the table data.

Command-Line Syntax

Your program should take the names of two files as command-line arguments, which can be set in the IntelliJ IDEA run configuration. For example, if you set the program arguments to “file1.csv file2.csv”, the strings "file1.csv" and "file2.csv" will then be available in the arguments array passed to the a3.Main.main method, at positions 0 and 1.

The new table that results from a join of the two tables should be output to the console, and this should be the only output to System.out. The output must be in the format of a valid CSV file like the ones your program reads. Notice that you should not try to use the toString() method to generate output, because it does not produce output in the format of a CSV file.

If the program is run with the wrong number of arguments, or if one of the two specified files does not exist, an appropriate helpful error message should be printed and the program should exit cleanly. A stack trace is not considered a helpful error message.

Optionally, you can run your program outside IntelliJ using a shell like bash or PowerShell. Run your program in the top top directory of your IntelliJ project with a command like the following:

java -classpath out/production/a3 a3.Main file1.csv file2.csv

Output redirection can be used to write the program output to a file:

java -classpath out/production/a3 a3.Main file1.csv file2.csv > output.csv

Linked Lists

In this assignment you will be implementing a list abstraction whose interface is called LList. The implementation will be in the class SLinkedList. To promote loose coupling, client code should mostly use lists through the LList interface.

The abstraction and implementation are both parameterized over an element type T, so they will work on any boxed type T. Inside the implementations of LList and SLinkedList, the type T stands for an unknown type that is determined when these abstractions are instantiated on a particular type. For example, the signature of LList<Integer>.head() looks just like the signature of LList<T>.head() except that T is replaced by Integer. Your code for SLinkedList will be written in terms of an unknown type T, and therefore needs to be written in a way that works for any legal choice of type T.

This list abstraction will be implemented using a linked-list data structure whose nodes are objects of the class Node<T>, which has been provided for you. Your main implementation challenge is to implement most of the methods of the SLinkedList class.

You will notice that SLinkedList already has an iterator() method implemented for you. This method makes it possible to use an SLinkedList or LList in an “enhanced for-loop”, which will be helpful to client code such as your main program. Enhanced for-loops support easy iteration over arrays and other data structures. For example, you can write a loop like the following to iterate over and printout all the integers in an array:

int[] a = ...
for (int s : a) {
    System.out.println(s);
}
Similarly, you can iterate over all the integers in a LList<Integer>:

LList<Integer> lst = ..
for (Integer s : lst) {
    System.out.println(s);
}

You may not use any classes from the java.util package in your implementation of SLinkedList. Importing interfaces and exceptions is fine, however.

Academic Integrity and Collaboration Policy

You may not collaborate with anyone on the code for the classes in this assignment— this includes looking at anyone else’s code or showing anyone else your code in any form. While you can talk to others, this should not be code specific but rather general strategies on how to tackle a problem. You may not look at solutions to similar previous assignments in 2110.

In this assignment, as in A2, you may work with another student with whom you will share test cases; however, for A3 it is optional to have a testing partner. You and this student can work together to write test cases for testing the classes. However, you and your partner may not share any code for the classes themselves.

If you do not know where to start, are stuck, do not understand testing, or are lost, please see someone on course staff immediately. This can be in the form of an instructor, TA, or consultant.

How to do this Assignment

To set up your project, download the provided file a3.zip and unzip it into a directory named a3, which will be the root of your IntellJ project. You can import it into IntelliJ by either dragging the folder onto the IntelliJ program or by using IntelliJ's “New Project from Existing Sources”, and pointing it at this folder. After it is set up, you should have a top-level folder in your IntelliJ project, with src and tests folders inside it. Inside the src folder, there should be a folder named a3 that contains Java source files. The src folder should be colored blue to indicate that it is a source folder.

a3
 ├─src
 │  └─a3
 │     ├─Main.java
 │     ├─LList.java
 │     └─...
 │
 └─tests
    └─a3
       └─SLinkedListTest.java

If your folders look more like the picture below, it won't work! Everything needs to be moved up a level to get rid of the extra a3 layer.

a3
 └─a3
    ├─src
    │ └─a3
    │   └─...
    │
    └─tests
        └─...

Scan the entirety of these assignment instructions before starting. The three assignment classes are commented with various TODO statements numbered 1 through 9. We recommend designing your classes and tests in this order to avoid confusion.

Testing will be an important part of this assignment. An important part of design is to develop your classes and test them incrementally. This means you should test each part of the code you write before continuing to the next implementation task.

Since you are building your CSV processor on top of your linked-list implementation, bugs in the linked list are likely to show up in the CSV processor, making the program hard to debug. Careful testing of the linked list is advised so that you have a solid foundation to work on.

We will pin a post on Ed with FAQs about this assignment. This post will be updated regularly with answers to questions including notes about what you can or can't use in your solution. Please read this post regularly to stay up to date on this information.

  1. At this point, you should have all of the release code classes in your project. The first step is to read the method specifications in the interface LList.java. Two of the methods, remove and contains, do not have specifications. You are expected to write specifications for these methods, using your engineering judgment. A description of the methods can be found below.
  2. Once you have identified and fixed the specifications, you are now ready to set up your test suite. Please see the section on testing for detailed instructions. You should write thorough tests for each method. You will need to use toString() to test some methods.
  3. Once your test suite is set up, it is time to start implementing the methods in SLinkedList.java. The methods that you have seen in lecture are provided to you here. You will implement the following additional methods of an SLinkedList:
    1. classInv(): A partial implementation of this standard method for checking the class invariant is already provided, but it is incomplete. Think carefully about what your class invariant is and complete the necessary checks. You should use classInv() in your other method implementations.
    2. toString(): You must use a StringBuilder, which is already initialized for you.
    3. append(T v): This method must run in constant time rather than iterating through the list.
    4. insertBefore(T x, T y): You must use a while-loop.
    5. get(int k): You must use a for-loop.
    6. contains(T elem): This method should return true if elem is in the linked list, and false otherwise.
    7. remove(T x): This method returns whether x was removed or not. If x is successfully removed, the method should return true, otherwise it should return false. If there are more than one items in the list that have value x, the method should remove the first instance of x from the list.
  4. You are now ready to start coding your Main class. In this class you will do 3 things:
    1. Write a method called csvToList that takes in a CSV file name as a string and returns a LList<LList<String>>. This method should use a FileReader to read each line of the CSV and separate it by commas.
    2. Write a method called join that takes in 2 CSV tables of type LList<LList<String>>. The method should then return a single LList<LList<String>> that is a table that combines the two input tables, retaining the row order of the first table.
    3. Write your main method so that it uses the two previous methods, and perhaps other methods you define, and outputs the resulting LList<LList<String>> table in CSV format to standard output (System.out).
    You may write any other helper methods you would like to add.

Testing

Unlike A1 where you tested using print statements, in A3, you will use JUnit tests to test your SLinkedList methods.

  1. There should be a folder called “tests” outside the a3 folder. Right-click the tests folder and select from the dropdown: Mark directory as → Test Sources Root
  2. In the SLinkedListTest.java file, create at least 3 different tests for each method that you implement.

Notes on testing:

How to Test CSV Left Join

There are six CSV files provided to you in the release code (in the inputs-tests/example and input-tests/states folders). In each folder you will find two files named input1.csv and input2.csv, and a file output.csv showing the expected output of the program on those input files.

To create your own CSV files that you can submit as test cases, open a Google Sheets file or Microsoft Excel file. Write name columns on the first row and fill in the corresponding data in the subsequent rows. To save as a CSV file:

Once you have your CSV files, drag them into a new folder under the input-tests folder. Each such folder should include input files input1.csv and input2.csv, and the expected output output.csv. You can create the output file by editing it manually, by copying the console output from running your program, or by using output redirection as described above.

Submission

Please make sure to fill in the time you spent using the timeSpent field in the Main class. Additionally, be sure to fill in your NetID and other information in the comment at the top of the SLinkedList class. Please also provide your testing partner’s NetID after your own. Then upload your files LList.java, SLinkedList.java, and Main.java to Assignment 3 on CMSX before the deadline.

You should also submit your tests to Assignment 3 Tests on CMSX before the deadline in cooperation with your partner if any. Your tests consist of the unit tests you have written in SLinkedListTest.java, plus some system tests, which are CSV files that should be placed in the input-tests folder. Each subfolder should contain files named input1.csv, input2.csv, and output.csv, where the file output.csv contains the expected output from running your program on the inputs input1.csv and input2.csv. The current release of the files already shows this format for two tests.

To locate these files on your computer, you can right click on any of the classes and select “Open In” and then select your file explorer. For Windows, that may be “Explorer” and for Mac, that may be “Finder.” Do not submit any files with different extensions at the end. To check if you submitted the correct files, download the files from CMSX after you have submitted them.