Assignment 7

Part B

Due: March 31 by the end of lecture.

Contents and Checklist

Preliminaries (do Part A first, bonus style points, saving typing and paper)

Introduction

The X2 Test

What To Do

Step 0 Get copies of the files we are giving you.

Step 1

(1) Run tables.m

(2) Compute engineering average.

(3) Compute another college's average. Is there a significant difference between the averages?

(4) Plot the data (this is optional). Is there a significant difference among the colleges?

Step 2

(1) Complete writing chi.m

(2) Compute X2 for each data set (9am, 11am, combined).

Step 3

(1) Run distrib.m

(2) Write a function to lookup the percentile of X2 values in charts

(3) Lookup the percentiles of the X2 values of the three data sets. Are they significant at the 5% level?

Step 4

(1) Call function truedistrib on the combined data

(2) Lookup percentile of combined data according to chart from (1) above. Compare with Step 3.

What to Turn in (Command window for Steps 1 - 4, optional graph, chi.m, lookup function)

Appendix: Generating Random Tables

Preliminaries

Part A: Before embarking on Part B, first do the exercises in Part A to get practice with some of the basic Matlab commands before trying to use them in a larger problem. The grading and guidelines for this assignment are all described in the handout for Part A. If you have trouble running the function you wrote in part A, read the instructions in Step 1 on how to make a directory or folder the working directory.

Bonus style points: We include "bonus points" on this Part for students who want to use more math and more features of Matlab. Although bonus points "erase" style errors, the maximum score for Part B is unchanged (still 3 correctness and 3 style points). Only do these if you have the time and interest. They are not required in order to get a perfect score.

Saving yourself typing and paper: For both Parts A and B, to receive full credit, you are strongly encouraged to work as follows. (1) Type your work into a script file. (Try things out in the Command window and paste the final results into your script file, or type things into the script file and then either run the script file or paste into the Command window). (2) When you are done, run the script file. Be sure to start with echo on so that the commands and results are interleaved in the Command window. That is, type in the words echo on into the command window before running your script file. (3) Turn in the interleaved commands/results in the Command window but not the script file. Reason: This will generate nice output, save paper, and let you save your work as you go. Note: If you already did part A before receiving this handout, you do not need to redo your work as described here.

Introduction

For this part of the assignment you are to do a statistical analysis of the sleep survey that we gave you in class. The idea is to determine whether there is any connection between the college a CS100 student is in and the amount of sleep he/she gets. In order to do this we need to teach you a little bit of statistics (but not very much).

It is commonly believed that students in some colleges and/or majors get less sleep than other students. Each group of students, of course, is firmly convinced that they work the hardest, particularly after pulling an all-nighter. So for fun we decided to take a survey in CS100. The survey you filled out asked how many hours of sleep you got (on average) and what college you were in. The results of our survey for the 9am lecture can be seen in the following table. The columns give the number of hours of sleep per night, the rows give the colleges.

Average Hours of Sleep Per Night

 

= 4

5

6

7

= 8

Ag. & Life Sciences

1

4

4

9

0

Architecture, Art & Planning

0

0

0

0

0

Arts & Sciences

1

6

11

13

4

Engineering

5

18

24

32

5

Hotel

0

0

2

0

0

Human Ecology

0

1

1

1

0

ILR

0

0

0

0

0

This table, plus a similar one for the 11:15 lecture will be given to you in a file. You will combine the two of them to also get a table containing totals.

How should we analyze this data for significance? Even if we "see" a difference among the rows, is it large enough to mean anything? When we do statistical tests we always get some variation. For example, if we flip a fair coin 1000 times, we would expect to get roughly a 50/50 split, but we do not expect to get exactly 500 heads and 500 tails. The question is to see if the differences are larger than one would expect. In the case of our survey, if the differences are sufficiently large, we would conclude that there is likely some connection between the college a student is in and the amount of sleep he/she gets. We would say then conclude that the two variables (sleep and college) are dependent rather than independent.

A standard measure of independence that is often applied to tables like this is known as the X2 -test (read chi-square test). Because our data set is small compared to the size of the table, this might not give us a very accurate measure, but we will try it to see what result we get. We will then investigate further to see if we can get a more accurate measure of the independence.

THE X2 TEST

The X2 test is a statistic (like "mean", "median", and "standard deviation") that returns a number. This number is a measure of the deviation from a perfect split that occurs in the data we are analyzing. In addition we have an associated probability distribution that interprets the X2 statistic. This distribution is normally stored in a large chart that people use to determine the meaning of their result. We will give you the part of the chart needed for 5x5, 6x5, and 7x5 tables. (The table shown above is considered a 5x5 table because two of the rows are all 0's.)

We calculate the X2 statistic the following way.

(1) We first remove any rows and columns from the table that contain only 0’s. A row or column that has some 0’s is fine, we only throw out the ones that are totally 0. This gives us a new table. For example, for our 9am lecture we will get a table that looks like the original one with rows 2 and 7 removed. Call the new table T. So T(i, j) is the entry in row i and column j of the new table.

(2) Let E(i,j) be the number of people we would expect to have in the entry for row i, column j if there is no relation between a student's college and how much sleep he/she gets, i.e. if the variables sleep and college are independent. E(i, j) will in general be a real number rather than an integer. (We will give you the code for producing an array containing these values.)

(3) Let S(i, j) equal [ T(i, j) - E(i, j ) ] 2 / E(i, j) for each i and j. Since we have removed the rows and columns that contain only 0’s, E(i, j) will never be 0.

(4) Now define X2 (T) = sum over all i and j of S(i, j). That is, if S is a matrix with each element S(i, j) having the value given in (3) above, we sum all the elements of S to get the result.

In order to do your calculations easily we have given you a script file named tables.m that you can use. When this file is executed, it creates two arrays data9 and data11 containing the tables for the 9:05 lecture and the 11:15 lecture. To get a table with the two lectures added together you can simply add these two arrays and store the result in a third array called dataTotal. Be sure to compute dataTotal before deleting rows from data9 and data11.

How do we execute this file and create the arrays? As with any script file, simply type the name of the file, without the ".m" extension, as if it is a command in the command window and the file will be executed. All variables created by the script file are now available just as if the commands had been typed into the command window directly. You should next type in a command to create the array dataTotal containing the sum of the other two arrays.

What To Do

Step 0

Get the files we are giving you off of the web page. Put all of them in the same folder or directory.

Step 1

Calculate some statistics and do an optional plot of the data as specified below. Do your calculations in the Command window, making sure to display your results.

(1) Run the tables.m script file to create the two arrays data9 and data11, then create yourself the third array dataTotal containing the combined data. In order to run a script file, Matlab has to be able to find it. This means that you have to set your working directory to be the one that the file is in.

On a Mac, the easiest way is as follows. Open the file from inside Matlab (i.e. using OPEN under menu FILE). Then select the SAVE AND EXECUTE command. Thereafter, simply typing the filename should work. As you can see, it is most convenient if all the files you are using are in the same folder or directory, then you should only have to do this for one of them. Note: You do not actually have to open the file and save it. It is sufficient to use the OPEN command, search around using the dialogue box until you find the file you want and then hit cancel.

On a PC (or a Mac), you can tell Matlab where to look by typing in a command like cd a:\homework The cd tells it to change directory. The rest is whatever path name you need to give it to get to your file. In this example it would look on the floppy disk in the a drive for a directory called homework. This is now your working directory.

(2) Compute the average number of hours engineering students get (use the combined data). Be careful how you do this. You don’t want to end up with just the average number of engineering students per column but rather the average amount of sleep. Be sure to exploit the Matlab operations: You do not need (m)any loops. NOTE: Script file tables.m also declared some useful constants, e.g. engineering=4, that you should use to make your code more readable. (It also declare an array of strings holding the names of the colleges; this array is for the bonus points.)

(3) Do the same for one of the other colleges. Type a comment into the command window that states whether you see a difference and whether you can tell if it is significant.

Bonus style point: Compute all the averages at once (this takes care of both (2) and (3)), using a single line of code. HINT: use matrix multiplication.

Bonus style point: Label the numbers with the colleges. To do so, you'll need to read about strings in Matlab. Note that tables.m puts the names of colleges into an array colleges. HINT: What does num2str([3.4 ; 5.7 ; 9.8], '%.1f') do?

Optional task: This exercise will give you good practice in drawing graphs. We recommend that you try to do it, but it is not required. Plot the combined data using either the ribbon or bar3 plotting function and hand in a printout or screen dump. Use online help to figure out how to use these functions.

Use the title command to give the graph a suitable title. Label the three axes appropriately and make sure the range on the hours of sleep is from 4 to 8 (not 1 to 5). Make sure all data for a given college is the same color (rather than all data for a given number hours of sleep); you'll need to figure out how to manipulate the data. HINT: Use some combination of transposition and the commands flipud, fliplr, and rot90.

Type a comment into the command window that states whether you see a difference among the colleges and whether you can tell if it is significant.

Use the text command to label each ribbon or set of bars with the name of the college. Note that tables.m declared an array colleges holding names.

At this point it should be clear that although there might be a difference among the colleges, it is not at all obvious that it is significant (meaningful). Therefore, let us apply the X2 test in the Steps 2 and 3.

Step 2

(1) Complete the function called chi that we will give you in a file called chi.m. The function will have the following header:

function [x] = chi(table)

% CHI(TABLE): the chi-square statistic X for the given TABLE

Parameter table is the table (an array) for which we need to calculate the X2 statistic. x is the resulting value to be returned. This function will assume the table has no all zero rows or columns.

This function must first define an array called expect that is the table of expected values E(i,j) described above. This is done by the following lines of code that you will find at the top of the function.

rowsum = sum(table')'; % a column vector containing the sums of the rows

colsum = sum(table); % a row vector containing the sums of the column

n = sum(rowsum); % total number of data points

expect = rowsum*colsum/n; % expected values from row and column sums

% NOTE: above we use * (not .*) to get matrix multiplication.

% Don't worry if you don't know what this means

Complete the function so that it uses table and expect to calculate the X2 statistic for the table and assigns this value to x. When you do this, be sure to exploit the Matlab operations: You do not need (m)any loops. If you use the Matlab commands to their fullest, you will not need to refer to elements individually (i.e. notation such as S(i, j) will not need to appear in your code.)

To complete this function, do steps 3 and 4 in the description of the X2 statistic, using the name expect instead of E and table instead of T. The result should be stored in x. You should not need to add very much code to the function.

(2) Go back to the command window and call this function three times to calculate X2 statistic for each of the three tables you created in (1) above. Store the three results in three appropriately named variables and display the values in the command window. Remember to delete rows of 0's in your tables before calling function chi.

Bonus style point: When deleting rows of 0's, write the code to work for any 2-dimensional table. HINT: try help logical and figure out what the following code does and how it works: noprime=1:20; noprime=noprime(~isprime(noprime))

(This particular code does not eliminate rows and columns of 0’s. But it should give you a tip as to how you might do this.)

Step 3

We now need to interpret our results. Unfortunately the student version of Matlab that you will find in the labs does not include the charts used to interpret these results. So we will give you the row of the chart that we need. In a script file called distrib.m we define three vectors named chart5x5, chart6x5, and chart7x5, giving us the charts we need for the three tables we get after rows and columns of 0's are removed. (Notice the three tables end up with different sizes. That is why we need three charts.)

To understand how to use one of these vectors, let’s look at an example. The first part of a vector might look like this. (None of them actually contain these values, but we use this for illustration.)

percentile = [1 4 10 24 92 ........]

The values in this vector should be interpreted the following way. Each element in the array represents one percentile. Normally we start our chart at the 0th percentile, but to keep things simple we will start at the 1st percentile. Similarly, we will have a 99th percentile but nothing for the 100th.

We use the table to find out what percent of the tables of a given shape (i.e. with a specific number of rows and columns) for two independent variables would produce a X2 result that is less than or equal to the number that we calculated with our chi function. For example, the chart above tells us that 1% of the time one gets a X2 score of =1, 2% of the time we get a score =4, 3% of the time we get a score =10, etc.

For each of the three tables we are analyzing, we need to take our X2 result, see where it fits into the corresponding percentile array (one of the three charts you are given) and report this information. Do this as follows. Assume we have stored our result in a variable called chiVal. If chiVal = 24 and our chart looks like the vector percentile above, we would report that our result is at the 4% level. If chiVal = 92 we would report that it is at the 5% level. If chiVal = 70 what should we do? To keep it simple we will just interpolate linearly to get the result. That is, we want an answer between 4% and 5% but exactly what? We do this calculation:

4 + (chiVal - percentile[4]) / (percentile[5] - percentile[4])

Here chiVal = 70, percentile[4] = 24, and percentile[5] = 92. This is called linear interpolation because we travel along the line segment from (24, 4%) to (92, 5%) until we reach (70, answer), giving us an answer (number) between 4% and 5%.

To do this in general you need to use a variable to play the role of 4. How will you figure out what that index should be given the value of chi? Hint: You could search through the percentile array to find the right spot. But there is an easier way. Look at the value of the expression

sum(percentile <= x)

Remember that percentile <= x is a vector with 1’s every place that the vector percentile has a value >= x and a 0 in every place that it is < x.

Now that we have explained everything to you (with utmost clarity we hope) you should go to work and figure out where your results from Step 2 fits into the percentile array.

(1) Run the script file distrib.m so that you now have use of the three charts chart5x5, chart6x5, and chart7x5. These three charts are row vectors. (If you want to see what they look like just open this file.)

(2) Write a new function that has an array parameter called chart and a scalar parameter called chisq and that returns a scalar giving the percentile where the score chisq falls in chart. Compute the percentile by finding the right place in the chart (see our hint above) and then using linear interpolation as described above. Save it in the same folder or directory as your other files. Your function should have the following header.

function [percentile] = lookup (chisq, chart)

% LOOKUP (CHISQ, CHART): the percentile where the number CHISQ

% falls in the list of 1% percentiles stored in the vector CHART

(3) Test your three tables on the three corresponding charts by calling the function three times. Display the results in the command window. Type a comment into the command window that states for each result whether or not it is significant at the 5% level. It will be significant at this level if the result you get is = 95%, i.e. if most (= 95%) differences are smaller, i.e. the probability of seeing such a large difference is unlikely (= 5%).

NOTE: If you are doing things right, then at least the test for the combined data is significant, suggesting that there is a connection between college and sleep.

Step 4

This analysis we have just finished is not really accurate due to the fact that our data set is quite small compared to the size of the tables. So we want to do it a second time using a chart that you will create by means of a function that we are supplying to you in a file called truedistrib.m.

(1) Call the function truedistrib from the Command window giving it the combined data (which is a 7x5 table since it has no all 0 rows or columns). The call should look like this:

truechart = truedistrib(dataTotal);

where dtataTotal is the array containing the total of the survey for the two lectures.

The function will generate a sequence of random tables with the same shape as the given table and check their X2 values; note that the function uses the chi function that you have defined, so it is important that they be in the same folder or directory. It will use these results to return a new chart for you to use in place of the given ones. This function should take at most a few minutes to run, but not terribly long. (In case you are curious, a description of how the random tables is generated is given in the Appendix at the end of the handout.) Do not print out the chart — use ";"

(2) Now take the chart you just produced, i.e. truechart, and use it in place of chart7x5 to analyze the X2 result for the combined lectures. Do this by calling your function lookup again. Compare this answer to the one you got previously for the same table: Write a comment in your command window that compares the two results, e.g. how much they differ and whether or not the new one is significant at the 5% level.

What to Turn in

Turn in the following 4 things: (1) printout of the Command window containing commands, results, and comments for Steps 1–4, (2) graph from Step 1 if you decide to do it, (3) completed chi.m file, and (4) the file containing your lookup function.

Appendix: Generating Random Tables

Think about the original table. Effectively, every student chose a "college card" and also a "sleep card". The original table counts the number of students holding each possible combination of pairs of cards. Given the table, it is easy (re)generate the pairs of cards. But now it is easy to generate a random table: collect all the college cards into one deck and all the sleep cards into another deck, shuffle each deck, deal each deck out, and count the number of students holding each combination.

Here is a simple optimization: Dealing shuffled sleep cards to students is the same as "shuffling" the students and then dealing out sleep cards in sorted order. But students are anonymous, so their order doesn't matter, so we don't need to shuffle them. Therefore, simply deal out shuffled college cards, and then deal out the sleep cards in sorted order (without shuffling either sleep cards or students).

Matlab supports another trick/optimization: full(sparse(cards1,cards2,1)) creates an array where each element is originally zero, and each card combination (i,j) = (cards1(k),cards2(k)) results in the (i,j) entry being incremented by 1. Thus, given the decks cards1 and cards2, we do not need to write any loops: Matlab does all the work!