Computer Science 280: Homework 4

Homework 4:
2/18/09 (due 2/25/09) The following problems are all taken from the handout on Number Theory from Rosen's book.

Section   Number       Points    Comments
2.5       20             4       Don't use Fermat's Little Theorem for this one.
          28             5       Hint: what is the congruence class of 10 mod 11


2.6       2(b),(f)       6       For (f), use the Euclidean algorithm to
                                 compute the gcd and then back substitute
          8              3       Use the same techniques as problem 2.
          10             4
          15             4
          28(a)          3
          29(a)          3
          48             4      There seems to be a bug in Rosen's description of the 
                                Euclidean algorithm.  Rosen says (just before Exercise 48) 
                                that sj = s(j-2) - q(j-1)s(j-1) and q(j-1) tj = t(j-1) -q(j-1)t(j-2).  
                                If we assume (using the notation of the class notes)
                                that num(j) = qj demon(j) + r(j), then it should really
                                say tj = t(j-2) - q(n-j+1) t(j-1) and sj = s(j-2) - q(n-j+2)s(j-1).   
                                (Note: this is a correction as of 2/24/09, although I doubt anyone will
                                actually use Rosen's equations.)  I'll accept any reasonable 
                                backsubstitution approach to computing the coefficients.


[Two problems from Jon Kleinberg:] 
(1) [5 points] The folks  at MyCrawl are having a bizarre performance problem.
As they collect news headlines, they want to know if they've seen  a
given headline before. They don't want to have to consult all the 
headlines in their index to do this, so they're using a very primitive
form of hashing.  Their scheme works as follows. They pick a number n,
and maintain n virtual "buckets" of headlines, labeled 0, 1, 2, ..., n-1.
Each headline in their index belongs to exactly one of these
buckets. When a new headline a comes in, they map it to a number f(a) and
check whether it's already in bucket f(a). If it's not, they add it to
this bucket. (So if the same headline a had come in previously, it would
already be sitting in bucket f(a) when they go to check.)
This is the essence of hashing in general: you divide up your data so
that you only have to check a small fraction of it in situations like
this. The hope is that if your function f -- the \hash function" 
-- spreads things out nicely, then you're only consulting a 1/n fraction 
of your data when you look in a single bucket.
To make all this concrete, one needs to specify the function
f. Here's how they're  doing it at MyCrawl. (Despite the long
description that follows for the sake of completeness, 
it's actually describing a very standard type of thing one might do.)

Each headline is in the following structured style. It begins with
the acronym of the wire service that carried the headline, then has a
colon and a space, then the text of the headline itself, and then it
ends with the date in the following format: two characters  
for the day of the month, then the first three letters of the name of
the month, then four characters for the year, and then the first three
letters of the day of the week.  For example:
AP: Turing's biographer wanted Russell Crowe in biopic 16Sep2005Fri
They then convert the text into a sequence of numbers, by mapping
each character to a number as follows. The letters `a' through `z' are
mapped to the numbers 1 through 26 (ignoring whether they're upper-case
or lower-case), the "space" between words is mapped to 27, the numbers 0
through 9 are mapped to 28 through 37, and all other characters
(e.g. punctuation) are mapped to a single "special symbol," which is
encoded as the number 38. 
They then convert each number to binary, using exactly six bits per
number, inserting leading 0's if necessary. (So `a' becomes 000001, `b'
becomes 000010, ..., space becomes 011011, and so forth.)
These sequences of bits are then all concatenated together, causing
the headline to be represented as a single bit string. So for example,
the bit string corresponding to the headline above would begin
000001010000100110011011. 
Finally, one views this long bit string as a (very large) number M
written in binary notation.
The value f(a) is defined simply to be M mod n.  

Now, there's the issue of what to use as the value for n. The MyCrawl
people had read somewhere that it was a good idea in practice to have n
be a prime number. They weren't exactly sure why, but they did it,
choosing the prime number n = 4099, and things worked 
very nicely.  Profiling indicated that when they went to consult a
bucket, it tended to have roughly (1/4099)th of their headlines.
Then at some point, it's not clear how, someone accidentally changed the
code so that 4099 became 4096. Immediately, performance became much
worse.  As far as they could tell, every time they mapped a headline a
to f(a), and then consulted bucket f(a), this bucket seemed to contain
at least 10% of all their headlines.  

This is where they were hoping you could help them out. It's no trouble
for them to switch back to n = 4099, but in addition to doing this they
want to know what went so wrong with n = 4096.  Is there a simple
explanation for it, or does it hint at some bug deep in the internals of
their system?  Explain why they observed a big performance problem with
n = 4096. Your explanation should be based on the information you have
above; if you want, you can make very mild further assumptions about
their system, but the fewer the better. Also, it's not necessary to
explaid why 4099 worked so well, just why 4096 worked so badly.) 

(2) [5 points] The phone rings one day and you pick it up. The voice at
the other end says, ``Hello, is this Professor Rivest?" You meant to say
``No," but
for some reason you said ``Yes" instead; the next thing you know, people
from this mysterious company arrange to y you out to the 
West Coast to advise them on their security system. 

``We have a system where we put 100 agents out in the field," they
explain to you, in a sub-basement surrounded by electromagnetic
shielding. ``Each agent announces an RSA public key, and then they use
the RSA cryptosystem to send messages to each other using these public
keys. We're not concerned about whether the agents know each other's
private  keys or not -- we're all on the same team, after all -- but
we're very worried about outsiders learning the private keys. 
Here's the basic code we planned to use to assign a public key to each
agent:  

Method I:

  For i = 1, 2, 3, ... , 100 

  Generate a prime number pi. 

  Generate a prime number pi'. 

  Compute the public and private keys for agent i from pi and pi'



After looking at this for a while, we thought: `Isn't Method I sort of
wasteful of primes?'  
So we came up with Method II: 


Method II: 

   First generate prime numbers p1, p2, ...,  p101
 
   For i = 1, 2, 3, ... , 100 

   Compute the public and private keys for agent i from pi and pi+1.


Notice how we only use about half as many primes in Method II. But we're
worried that Method II is somehow insecure -- that is, if our enemies knew 
we were using Method II to compute keys, they could compute the private
keys from the public keys by a good algorithm."

  
So here's the question: show that in fact the company's fears are
well-founded. Given a set of 100 public keys known to be generated by
Method II, show how to compute the 100 corresponding private keys by a
good algorithm. (Although this is not critical for answering the
question, you can assume for simplicity that each public key is
publically labeled with the ID number, from 1 through 100, of the agent
that possesses it.)  

Note: The point is that, in trying to break the keys, you know that
Method II was used, but you (obviously) don't know at the outset what
the primes p1, p2, ... , p101 are.  Nevertheless, your algorithm should
be able to figure them out.  Also, here is a crucial thing: your good
algorithm is allowed to have a factor proportional to the number of
agents (or some small power of the number of agents) in its running
time, since the number of agents is simply the (small) constant 100.)