Computer Science 280: Homework 4

Homework 4:
2/20/08 (due 2/27/08) The following problems are all taken from the handout on Number Theory from Rosen's book.

Section   Number       Points    Comments
2.4       44             3
          46             3

2.5       20             4
          28             5       Hint: what is the congruence class of 10 mod 11


2.6       2(b),(f)       6       For (f), use the Euclidean algorithm to
                                 compute the gcd and then back substitute
          8              3
          10             4
          12             3
          15             4
          18             3
          20             5       Note that you can't immediately apply the CRT,
                                 since 6, 10, and 15 are not pairwise relatively prime.
                                 You need to convert it to a form where you can apply 
                                 the CRT. (Hint: if x is congruent to 5 mod 6,
                           	 what is x congruent to mod 2 and mod 3?)
          26             3
          28             5      (Hint: for part (b), set up a system of
                                congruences mod 5, 7, and 11 of which
				3**302 is a solution; then find another solution.
          48             4      There seems to be a bug in Rosen's description of the 
                                Euclidean algorithm.  Rosen says (just before Exercise 48) 
                                that tj = t(j-1) -q(j-1)t(j-2).  It should say
                                tj=t(j-1) - qj t(j-2).  I'll accept any reasonable 
                                backsubstitution approach to computing
                                 the coefficients.


[A problem from Jon Kleinberg:] [5 points]
The folks  at MyCrawl are having a bizarre performance problem.
As they collect news headlines,they want to know if they've seen  a
given headline before. They don't want to have to consult all the 
headlines in their index to do this, so they're using a very primitive
form of hashing.  Their scheme works as follows. They pick a number n,
and maintain n virtual "buckets" of headlines, labeled 0, 1, 2, ..., n-1.
Each headline in their index belongs to exactly one of these
buckets. When a new headline a comes in, they map it to a number f(a) and
check whether it's already in bucket f(a). If it's not, they add it to
this bucket. (So if the same headline a had come in previously, it would
already be sitting in bucket f(a) when they go to check.)
This is the essence of hashing in general: you divide up your data so
that you only have to check a small fraction of it in situations like
this. The hope is that if your function f -- the \hash function" 
-- spreads things out nicely, then you're only consulting a 1/n fraction 
of your data when you look in a single bucket.
To make all this concrete, one needs to specify the function
f. Here's how they're  doing it at MyCrawl. (Despite the long
description that follows for the sake of completeness, 
it's actually describing a very standard type of thing one might do.)
  • Each headline is in the following structured style. It begins with the acronym of the wire service that carried the headline, then has a colon and a space, then the text of the headline itself, and then it ends with the date in the following format: two characters for the day of the month, then the first three letters of the name of the month, then four characters for the year, and then the first three letters of the day of the week. For example: AP: Turing's biographer wanted Russell Crowe in biopic 16Sep2005Fri
  • They then convert the text into a sequence of numbers, by mapping each character to a number as follows. The letters `a' through `z' are mapped to the numbers 1 through 26 (ignoring whether they're upper-case or lower-case), the "space" between words is mapped to 27, the numbers 0 through 9 are mapped to 28 through 37, and all other characters (e.g. punctuation) are mapped to a single "special symbol," which is encoded as the number 38.
  • They then convert each number to binary, using exactly six bits per number, inserting leading 0's if necessary. (So `a' becomes 000001, `b' becomes 000010, ..., space becomes 011011, and so forth.)
  • These sequences of bits are then all concatenated together, causing the headline to be represented as a single bit string. So for example, the bit string corresponding to the headline above would begin 000001010000100110011011.
  • Finally, one views this long bit string as a (very large) number M written in binary notation.
  • The value f(a) is defined simply to be M mod n.
Now, there's the issue of what to use as the value for n. The MyCrawl people had read somewhere that it was a good idea in practice to have n be a prime number. They weren't exactly sure why, but they did it, choosing the prime number n = 4099, and things worked very nicely. Profiling indicated that when they went to consult a bucket, it tended to have roughly (1/4099)th of their headlines. Then at some point, it's not clear how, someone accidentally changed the code so that 4099 became 4096. Immediately, performance became much worse. As far as they could tell, every time they mapped a headline a to f(a), and then consulted bucket f(a), this bucket seemed to contain at least 10% of all their headlines. This is where they were hoping you could help them out. It's no trouble for them to switch back to n = 4099, but in addition to doing this they want to know what went so wrong with n = 4096. Is there a simple explanation for it, or does it hint at some bug deep in the internals of their system? Explain why they observed a big performance problem with n = 4096. Your explanation should be based on the information you have above; if you want, you can make very mild further assumptions about their system, but the fewer the better. Also, it's not necessary to explaidn why 4099 worked so well, just why 4096 worked so badly.)