Problem Set 4: Data Structures, RSA

Due Wednesday, October 25, 2006, 11:00 pm.

Oct 17: Part 2: Please download the signature file hashmap.sig and have your HashMap implement this signature: HashMap(HB: HASH_BIND) :> HASHMAP.
Oct 15: Part 2: The unfinished sentence in the first paragraph of Part 2 has been fixed.
Oct 12: Part 4: questions b) and c) have been updated to correct an error.

This problem set has 4 parts. In parts 1 and 2 you'll implement advanced data structures. In Part 3 you'll learn about the RSA cryptosystem and implement algorithms in this system. Part 4 is a written problem.

It is required that you work with one partner for this assignment. We ask you to use CVS for concurrent code development within your team (see the end of the assignment).

As before, you will write your solutions in the appropriate files and submit the files separately on CMS. To get you started, we are providing a zip file containing several files. You can edit the necessary files, and submit your solution using CMS when you have finished. Non-compiles and warnings will receive an automatic zero.

[Download the zip file ps4.zip]

Part 1: Balanced Trees

Recall that an AVL tree is a binary search tree where all nodes have the property that the height difference between their children (of balance factor) is -1, 0, or 1. In class you have seen an implementation of AVL trees that keeps information about height in each node. It is, in fact, possible to have an implementation where each node contains the balancing factor instead. Since balancing factors can be represented using only two bits, this approach can be more space-efficient.

In this part of the assignment you will implement a simple map abstraction using AVL trees with balance factors. The AVL tree map is parameterized on a signature ORD_BIND containing types key and value, plus a function that compares two keys. The datatype definition of tree maps is as follows:

datatype bal = LEFT | BAL | RIGHT
datatype bind = key * value 
datatype avl = Nil | Node of bind * bal * avl * avl
type map = avl

Each node of the tree contains a binding of an integer key to a value, plus the balance factor of the current node. The balance factor indicates whether the current node is skewed to the right (RIGHT), to the left (LEFT), or if it is balanced (BAL).

repOk. Implement a repOk function that checks whether a tree is a valid AVL tree map. For full score, your implementation must be O(n) in the number of nodes of the tree.
Insert. To get you started, we have provided a function put that inserts a new binding in the tree. The code places the new binding as a leaf, and then walks the tree up, re-balancing nodes along the way. In the absence of explicit height information, the rebalancing process becomes more difficult: one must keep track of whether insertion has caused sub-trees to increase in height. The helper function put'(t,e) inserts an element e into an AVL tree t, and returns a pair consisting of the resulting AVL tree t' and a flag that indicates whether t' is taller than t. Please look at the code and understand how the recursive insertion process works.

A call balance(s,n) rebalances node n after insertion. All of the subtrees of n are AVL trees, but the current node n contains the old balance factor. The first parameter s indicates the side where the new binding has been inserted.
You task is to fill in the code for balance. This requires a detailed case analysis. For each case you must determine if rotations are needed, how many rotations need, what are the resulting balance factors, and whether the resulting tree increases in height as a result of the insertion. The use of pattern matching can greatly simplify your code (to 1-2 lines per case). Your implementation of balance must take constant time.
Other functions. Implement the other functions that complete the map interface map.sig, that is functions get, fold, and size. The fold function will traverse the nodes of the tree using in-order traversal.

To do: turn in the solution in the file avlmap.sml.

Part 2: Resizable Hash Tables

Hash tables represent a simple, efficient data structure for building abstractions such as sets or maps. In this part of the assignment you will implement such a structure. The structure is parameterized for types key and value, such that a function is provided to compute hashcodes for key types, and an equals function determines if two keys are the same. This functionality is described by the following interface:

signature HASH_BIND = sig
   type key
   type value
   val hash : key * int -> int
   val equals : key * key -> bool
end

The hash function hash(k,n) returns a hash code between 0 and n-1. Any implementation of the above signature must be such that if equals(k,k')=true then hash(k,n) = hash(k',n) for any value of n. This condition is required for correctness, but is not checked by the program.

The hash table will be implemented in the standard manner, as an array of buckets where each bucket contians a list of key-value elements. In addition, your implementation must support dynamic rehashing. When a new hash map is created, one must specify the initial number of buckets and the maximum load factor of the hash table. When the load factor reaches the maximum value, the number of buckets must be doubled, and the table must be rehashed. Your task is to do the following:

Fill in an appropriate type definition for resizable hash tables.
Document the representation invariant.
Implement the basic functionality. Function new(n,f) creates a hash table with n buckets and with maximum load factor f. Function put(m,k,v) inserts value v for key k into the map m; if m already contains a binding for k, the old value is replaced. Function get(m,k) returns an option that contains the binding of k in m. Function delete(t,k) removes the binding of k in m.
```
val new : int * real -> map
val put : map * key * value -> unit
val get : map * key -> value option
val delete : map * key -> unit
```
Implement the other functions. Function size returns the number of data elements in the hash table. Function buckets returns the number of buckets in the hash table. Function fold performs the fold operation over all the values in the table, starting from bucket zero, and traversing each bucket from left to right.
```
val size : map -> int
val buckets : map -> int
val fold : (key * value * 'a -> 'a) -> 'a -> map -> 'a
```

To do: submit your solution in the file hashmap.sml.

Part 3: The RSA Cryptosystem

In a world that relies increasingly upon digital information, public key encryption and digital signatures play an important part in achieving private communication. RSA is a popular public-key encryption system. It is based on the fact that there are fast algorithms for exponentiation and for testing prime numbersbut no known fast algorithms for factoring large numbers. In this problem set you will implement a version of the RSA system. The algorithms will you implement have both a deep mathematical foundation and practical importance.

Cryptographic systems typically use keys for encryption and decryption. An encryption key is used to convert the original message (the plaintext) to coded form (the ciphertext). A corresponding decryption key is used to convert the ciphertext back to the original plaintext. In traditional cryptographic systems, the same key is used for both encryption and decryption. Two parties can exchange coded messages only if they share a secret key. Since anyone who learned that key would be able to decode the messages, keys must be carefully guarded and transmitted only under tight security: for example, couriers handcuffed to locked, tamper-resistant briefcases!

Diffie and Hellman (1976) discovered a new approach to encryption and decryption: public key cryptography. In this approach, the encryption and decryption keys are different. Knowing the encryption key cannot help you find the decryption key. Thus you can publish your encryption key on the web, and anyone who wants to send you a secret message can use it to encode a message to send to you. You do not have to worry about key security at all, for even if everyone in the world knew your encryption key, no one could decrypt messages sent to you without knowing your decryption key, which you keep private to yourself. The RSA system, due to Rivest, Shamir, and Adelman, is the best-known public-key cryptosystem; the security of your web browsing probably depends on it.

How RSA Works

RSA does not work with characters, but with integers. That is, it encrypts numbers. To encrypt a piece of text, just combine the ASCII codes of several characters into a big number, and then encrypt the resulting number. A big number is an integer of arbitrary magnitude (i.e., is not restricted to 32 bits or 64 bits).

Suppose you want to obtain public and private keys in the RSA system. You select two very large prime numbers p and q. (Recall that a prime number is a positive integer greater than 1 with no divisors other than itself and 1). You then compute numbers n and m:

n = pq
m = (p−1)(q−1)

It turns out that no one knows how to compute m, p, or q efficiently, knowing n.

(Notation: [a=b] mod n means that (a mod n) = (b mod n), where a mod n is the remainder obtained when dividing a by n using ordinary integer division. Equivalently, [a=b] mod n if a−b is divisible by n. For example, [17 = 32] mod 5.)

Now you pick a number e < m relatively prime to m; that is, such that e and m have no factors in common except 1. The significance of relative primality is that e is relatively prime to m iff e is invertible mod m; that is, iff there exists a d such that [de = 1] mod m. Moreover, it is possible to compute d from e and m using Euclid's algorithm, described below. Then:

The number n is the modulus.
The public key, which you can advertise to the world, is the pair (n,e), containing the modulus and the public exponent.
The private key is (n,d), containing the modulus and the private exponent.

Anyone who wants to send you a secret message s (represented by an integer) encrypts it by computing a number E(s). The cyphertext E(s) is obtained by raising s to the power e, then taking the remainder modulo n :

    E(s) = s^e mod n

The decryption process is exactly the same, except that d is used instead of e:

    D(s) = s^d mod n

The operations E and D are inverses:

    D(E(s))    = (s^e)^dmod n
                    = s^de mod n
                    = s^1+km mod n
                    = s(s^m)^k mod n
                    = s(1)^k mod n
                    = s mod n
                    = s

The integer s representing the plaintext must be less than n. That's why we break the message up into chunks. In the above reasoning, the equality [s^m=1] mod n holds only if s is relatively prime to n (using a theorem due to Euler and Fermat). Otherwise, if s divides n, that is, s is either p or q, then D(E(s)) = s^de mod n = s, trivially. Therefore, it is always the case that D(E(s)) = s.

In summary, to use the RSA system:

    1. pick large primes p and q;
    2. compute n = pq and m = (p−1)(q−1);
    3. choose e relatively prime to m and use this to compute d such that [de = 1] mod m;
    4. publish the pair (n,e) as your public key, but keep d, p, and q secret.

How secure is RSA? At present, the only known way to obtain d from e and n is to factor n into its prime factors p and q, then compute m and proceed as above. But despite centuries of effort by number theorists, factoring large integers efficiently is still an open problem. Until someone comes up with an efficient way to factor numbers, or discovers some other way to compute d from e and n, the cryptosystem appears to be secure for large numbers n. Although several "better" factorization algorithms have been proposed, the problem remains computationally intractable for numbers that are large enough. For example, it is infeasible with today's best supercomputers and best algorithms known to date to factor an arbitrary 1024-bit number. In fact, the RSA Factoring Challenge offers money prizes for factoring large numbers. To date, the largest number from this challenge that has been successfully factored had 640 bits (or 193 decimal digits). It took about 5 months for several tens of machines working in parallel to factor this number.

Implement RSA

In the file rsa.sml you have been given supporting code for implementing the RSA algorithm. The code relies on the implementation of big integers and random numbers from the SML/NJ library. Note that this library is not loaded by default! Using the SML Compilation Manager, the library is loaded by the line smlnj-lib.cm in the sources.cm file (check the comments in that file if you have trouble loading the library).

The given code provides a variety of functions, including functions to generate prime numbers, and to test numbers for primality; code that generates key pairs; to read and write keys to and from files; or to convert between lists of bytes and big numbers. We briefly discuss a few of these functions.

Function expmod(b, e, m) computes b^e mod m. Function euclid(m,n) yields numbers (s,t,g) such that sm + tn = g, where g is the greatest common divisor of m and n.
Function isPrime(n) tests is number is prime. This is a probabilistic test: it yields true if there is a high probability that n is prime; and yields false if n is definitely not prime. This function is based on Fermat's theorem, which states that if n is prime, then all numbers a in the range 0 < a < n satisfy [a^n-1 = 1] mod n; and if n is not prime, then at most half do. Therefore, successfully checking if [a^n-1 = 1] mod n for k numbers a indicates that n is prime with a probability of error of 1 / 2^k.
The function generateKeyPair(r) generates an RSA key pair. It picks random primes p and q that are in the range from 2^r to 2*2^r so that n = pq will be in the range 2^2r to 2^2r+2. To find the keys e and d given m = (p-1)(q-1), function selectKeys picks a random e, and computes d using Euclid's algorithm.

You must add the encryption and decryption functions specified below. Your functions encode an input list of bytes into an output list of bytes. Therefore, these functions operate on blocks of bytes. Each block consists of M bytes or (M-1) bytes, where M is the number of bytes required to store the modulus n of the key (n,e) or (n,d).

Encryption then works as follows: take each (M-1)-byte block at a time (i.e. a list of M-1 bytes), treat it as a large integer, encode it with RSA, and then convert the resulting number to an M-byte list. The resulting block must have exactly M bytes, so you need to pad the resulting number with leading zeros if necessary. To keep the code simpler, assume that the input is a multiple of M-1 bytes (one can relax this condition by padding the input message with an appropriate number of trailing spaces) addition.
To make the encrypted message more difficult to break, we'll add a small tweak to the encryption scheme: before encoding a block, the value of the previously encoded block is added to the value of the current block. For instance, if we denote by {X} the RSA encoding of one single input block X, then a sequence of 3 blocks:

A, B, C

is encoded as:

A' = {A}, B'={ (A'+B) mod n }, C' = { (B'+C) mod n }

You must implement the function that encrypts an input message represented as a list of bytes:
```
  (* encrypt k l encrypts byte list l with key k.
   * If M is the number of bytes in k's modulus, then break l into
   * (M-1)-byte blocks, and encrypt each block into an M-byte block.
   * Requires: length(l) must be a multiple of M-1.
   * Returns: the encrypted list of bytes *)
  val encrypt : key -> word list -> word list
```

Decryption performs the opposite operation: it decodes each M-byte block into an M-1 byte block. Remember that each block in the cyphertext is the RSA-transform of the current block and the encrypted previous block. Decrypting a block must yield exactly M-1 bytes.

  (* decrypt k l decrypts byte list l with key k.
   * If M is the number of bytes of k's modulus, then break l into
   * M-byte blocks, and decrypt each of them into an (M-1)-byte block.
   * Returns: the list of bytes from the decrypted blocks *)
  val decrypt : key -> word list -> word list

Note: You may find it useful to use a single function that performs the RSA-transform, and use this function for both encryption and decryption:

(* transform k n l encodes a list of bytes l into an
 * n-byte output list l', using key k. *)
  val transform : key -> int -> word list -> word list

To do: Complete encrypt and decrypt in rsa.sml.

Digital Signatures

Public key encryption can be used to solve another problem of secure communication. Suppose you wish to send a message by electronic mail and sign it so that the recipient can be sure that it really came from you. What is required is some scheme for signing a message in a way that cannot be forged. This is called a digital signature.

Suppose you want to sign a digital message M in a way that convinces anyone else that the message came from you. This can be achieved by providing a signature that could only have been sent by someone with your private key. This signature is simply D(M). Anyone who receives M and D(M) can verify that the signature corresponding to the message by applying E to it, obtaining E(D(M)) = M. Only someone in possession of the private key d could have constructed a signature with this property. If someone wanted to forge a signature to a bogus message M, they would have to compute D(M), but assuming that RSA is secure, this cannot be done without the private key d.

We can combine the two techniques as follows. Suppose Amy wants to send Bob a message that only Bob can read, and sign the message so that Bob knows it could only be from her. She encrypts the message using Bob's public key. Then she signs the encrypted result using her own private key using the signature method just described. When Bob receives the message, he first uses Amy's public key to authenticate the signature, then decrypts the message using his own private key. Bob can be sure that only someone with Amy's private key could have sent the message. Amy can be sure that only someone with Bob's private key can read the message. This is accomplished without exchanging any secret information between Bob and Amy. It's this capacity for achieving secure communication without having to worry about exchanging secret keys that makes public key cryptography such an appealing technique.

Note: For efficiency and other reasons, real implementations of digital signatures do not sign the message itself, but just sign a small hashcode of the message. That is, for a message M, the digital signature is not D(M), but D(H(M)), where H is a secure one-way hash function (an example hash function is MD5). The pair (M, D(H(M))) is the signed document.

To Do: Implement the functions encrypt_and_sign and unsign_and_decrypt in rsa.sml.

Part 4: Recurrences and Complexity

Denote by T_h the AVL tree of height h that has the minimum number of nodes (among those of height h). The height of a tree with 1 node is 1 by definition. Derive a recurrence relation for |T_h|, the number of nodes in T_h.
Let us define C_{h =}|T_h|+1. Find two distinct numbers c₁ and c₂ such that (c_i)^h satisfies the recurrence relation of C_h (for i=1,2).
Show that |T_h| = ((c₁)^h+2 - (c₂)^h+2) / √5 - 1
For what value of c is |T_h| = Θ(c^h)?

To do: turn in the solution in the file complexity.txt.

Other Requirements

CVS: We ask you to use CVS while developing code with your partner. CVS is a popular, convenient versioning system that allow you and your partner to share code in a common repository. It allows you to share common code, commit your code into the repository, or update your version with the changes of your partner.

To do: submit a file log file cvslog showing your cvs activity during the assignment. For this, look into the "cvs log" command. This will count for 5% of the grade for this assignment.
After submission, you will meet with the consultants to explain your work. This count for another 5% of the grade.