Project 7: Cracking a Substititution Cipher

Summary:
In Project 6, you developed "tools" for analyzing text in terms of
unigram and bigram frequencies. Now you will use those tools plus others
shown in this write-up for cracking a substititution cipher.

What to do:
+ You are welcome to program in either MATLAB or Java
+ You must choose only one of those languages.
+ You may design your program however you wish.
+ Your program must be clear, concise, and use good style.
+ You MUST use the algorithm shown below.

Bonus:
For bonus points, you may use any other approach to improve the decryption.
We will assign bonus points based on the clarity of your deciphered text and
the elegance of your code.

Prize(s):
Whoever best deciphers the text with the "best" code wins a prize...free Java
textbook(s)!

What to hand in:
+ All code.
+ All output.
+ (For bonus code) An explanation of your algorithm.

We will supply the ciphertext we wish you to decipher sometime during the
week of 4/25.

_______________________________________________________________________________

Algorithm for cracking a substititution cipher:

(The motivation and justification will be in a separate write-up.)

0. Obtain our solutions to Project 6. Rewrite Project 6 methods and classes
using fractions, e.g. percentages, for frequencies -- do not use tallies.

1. Obtain a large "training" plaintext: Pick a large webpage written in
English. For example, you can use Romeo & Juliet:

http://etext.library.cornell.edu/cgi-bin/shakeEd-idx?type=HTML&byte=186196093&rgn=play

or Genesis from the New English Bible:

http://etext.library.cornell.edu/cgi-bin/bie-idx?type=HTML&byte=116542360&rgn=book

Note 1: enter each URL as a single line.

Note 2: Save documents as *text*, not as HTML.

Compute the table of bigram frequencies for the
large "training" plaintext. Include the column and row labels as part of the
"table." How? Use separate arrays or arrays of objects, objects with arrays,
or anything else you prefer.

2. Compute the table of bigram frequencies for ciphertext. Include row and
column labels for this step, too. You may use the following for ciphertext:

http://courses.cs.cornell.edu/cs100/2000sp/cryptannounce.txt

3. To crack the ciphertext, you need to bring the ciphertext bigram frequencies
"close" to training frequencies. To measure "closeness", use the notion of
array distance from Milestone 4 of Project 6. Here is how to bring
the ciphertext bigram table close to the training bigram table:

repeat until the ciphertext table stops getting closer:

for each pair of letters
swap the labels and swap the columns and rows
compute the distance between ciphertext and training bigram frequencies

choose the table closest to training frequencies

Each time you swap a pair of labels, you are really testing a new decryption
key. So, this algorithm tries out multiple decryption keys.

Notes:

+ Milestone 5 from Project 6 accomplishes "swap the columns and the rows";
don't forget to also swap the labels. See "Elaboration" below for an
example.

+ You might need to "unswap" the labels and "unswap" the columns and rows.

(Think of choosing your next move in a game of chess. For each piece, you
try moving it to another place ("swap its location"), and then move it back
to the original place ("unswap its location") before trying another move.)

4. Set the decryption key to:

top line = the labels of the ciphertext bigram table
bottom line = the labels of the training bigram table

6. Transform the ciphertext according to the decryption key.

_______________________________________________________________________________

Elaboration of the body of the inner loop in Step 3.

Suppose the ciphertext bigram table is currently

a b c d
a 1 2 3 4
b 5 6 7 8
c 9 0 1 2
d 3 4 5 6

If we choose the pair ('a', 'd'), then swapping the labels and the
columns and rows yields:

d b c a
d 6 4 5 3
b 8 6 7 5
c 2 0 1 9
a 4 2 3 1

Performing the swaps again (swap the labels and the columns and rows)
undoes the swaps to yield the original table

a b c d
a 1 2 3 4
b 5 6 7 8
c 9 0 1 2
d 3 4 5 6