Computer Science 280: Homework 4

Homework 4:
2/18/09 (due 2/25/09) The following problems are all taken from the handout on Number Theory from Rosen's book.

Section   Number       Points    Comments
2.5       20             4       Don't use Fermat's Little Theorem for this one.
          28             5       Hint: what is the congruence class of 10 mod 11


2.6       2(b),(f)       6       For (f), use the Euclidean algorithm to
                                 compute the gcd and then back substitute
          8              3       Use the same techniques as problem 2.
          10             4
          15             4
          28(a)          3
          29(a)          3
          48             4      There seems to be a bug in Rosen's description of the 
                                Euclidean algorithm.  Rosen says (just before Exercise 48) 
                                that sj = s(j-2) - q(j-1)s(j-1) and q(j-1) tj = t(j-1) -q(j-1)t(j-2).  
                                If we assume (using the notation of the class notes)
                                that num(j) = qj demon(j) + r(j), then it should really
                                say tj = t(j-2) - q(n-j+1) t(j-1) and sj = s(j-2) - q(n-j+2)s(j-1).   
                                (Note: this is a correction as of 2/24/09, although I doubt anyone will
                                actually use Rosen's equations.)  I'll accept any reasonable 
                                backsubstitution approach to computing the coefficients.


[Two problems from Jon Kleinberg:] 
(1) [5 points] The folks  at MyCrawl are having a bizarre performance problem.
As they collect news headlines, they want to know if they've seen  a
given headline before. They don't want to have to consult all the 
headlines in their index to do this, so they're using a very primitive
form of hashing.  Their scheme works as follows. They pick a number n,
and maintain n virtual "buckets" of headlines, labeled 0, 1, 2, ..., n-1.
Each headline in their index belongs to exactly one of these
buckets. When a new headline a comes in, they map it to a number f(a) and
check whether it's already in bucket f(a). If it's not, they add it to
this bucket. (So if the same headline a had come in previously, it would
already be sitting in bucket f(a) when they go to check.)
This is the essence of hashing in general: you divide up your data so
that you only have to check a small fraction of it in situations like
this. The hope is that if your function f -- the \hash function" 
-- spreads things out nicely, then you're only consulting a 1/n fraction 
of your data when you look in a single bucket.
To make all this concrete, one needs to specify the function
f. Here's how they're  doing it at MyCrawl. (Despite the long
description that follows for the sake of completeness, 
it's actually describing a very standard type of thing one might do.)
  • Each headline is in the following structured style. It begins with the acronym of the wire service that carried the headline, then has a colon and a space, then the text of the headline itself, and then it ends with the date in the following format: two characters for the day of the month, then the first three letters of the name of the month, then four characters for the year, and then the first three letters of the day of the week. For example: AP: Turing's biographer wanted Russell Crowe in biopic 16Sep2005Fri
  • They then convert the text into a sequence of numbers, by mapping each character to a number as follows. The letters `a' through `z' are mapped to the numbers 1 through 26 (ignoring whether they're upper-case or lower-case), the "space" between words is mapped to 27, the numbers 0 through 9 are mapped to 28 through 37, and all other characters (e.g. punctuation) are mapped to a single "special symbol," which is encoded as the number 38.
  • They then convert each number to binary, using exactly six bits per number, inserting leading 0's if necessary. (So `a' becomes 000001, `b' becomes 000010, ..., space becomes 011011, and so forth.)
  • These sequences of bits are then all concatenated together, causing the headline to be represented as a single bit string. So for example, the bit string corresponding to the headline above would begin 000001010000100110011011.
  • Finally, one views this long bit string as a (very large) number M written in binary notation.
  • The value f(a) is defined simply to be M mod n.
Now, there's the issue of what to use as the value for n. The MyCrawl people had read somewhere that it was a good idea in practice to have n be a prime number. They weren't exactly sure why, but they did it, choosing the prime number n = 4099, and things worked very nicely. Profiling indicated that when they went to consult a bucket, it tended to have roughly (1/4099)th of their headlines. Then at some point, it's not clear how, someone accidentally changed the code so that 4099 became 4096. Immediately, performance became much worse. As far as they could tell, every time they mapped a headline a to f(a), and then consulted bucket f(a), this bucket seemed to contain at least 10% of all their headlines. This is where they were hoping you could help them out. It's no trouble for them to switch back to n = 4099, but in addition to doing this they want to know what went so wrong with n = 4096. Is there a simple explanation for it, or does it hint at some bug deep in the internals of their system? Explain why they observed a big performance problem with n = 4096. Your explanation should be based on the information you have above; if you want, you can make very mild further assumptions about their system, but the fewer the better. Also, it's not necessary to explaid why 4099 worked so well, just why 4096 worked so badly.) (2) [5 points] The phone rings one day and you pick it up. The voice at the other end says, ``Hello, is this Professor Rivest?" You meant to say ``No," but for some reason you said ``Yes" instead; the next thing you know, people from this mysterious company arrange to y you out to the West Coast to advise them on their security system. ``We have a system where we put 100 agents out in the field," they explain to you, in a sub-basement surrounded by electromagnetic shielding. ``Each agent announces an RSA public key, and then they use the RSA cryptosystem to send messages to each other using these public keys. We're not concerned about whether the agents know each other's private keys or not -- we're all on the same team, after all -- but we're very worried about outsiders learning the private keys. Here's the basic code we planned to use to assign a public key to each agent:

Method I:
For i = 1, 2, 3, ... , 100
Generate a prime number pi.
Generate a prime number pi'.
Compute the public and private keys for agent i from pi and pi'

After looking at this for a while, we thought: `Isn't Method I sort of wasteful of primes?' So we came up with Method II:

Method II:
First generate prime numbers p1, p2, ..., p101
For i = 1, 2, 3, ... , 100
Compute the public and private keys for agent i from pi and pi+1.

Notice how we only use about half as many primes in Method II. But we're worried that Method II is somehow insecure -- that is, if our enemies knew we were using Method II to compute keys, they could compute the private keys from the public keys by a good algorithm."

So here's the question: show that in fact the company's fears are well-founded. Given a set of 100 public keys known to be generated by Method II, show how to compute the 100 corresponding private keys by a good algorithm. (Although this is not critical for answering the question, you can assume for simplicity that each public key is publically labeled with the ID number, from 1 through 100, of the agent that possesses it.)

Note: The point is that, in trying to break the keys, you know that Method II was used, but you (obviously) don't know at the outset what the primes p1, p2, ... , p101 are. Nevertheless, your algorithm should be able to figure them out. Also, here is a crucial thing: your good algorithm is allowed to have a factor proportional to the number of agents (or some small power of the number of agents) in its running time, since the number of agents is simply the (small) constant 100.)