Computer Science 280: Homework 4

Homework 4:

2/20/08 (due 2/27/08) The following problems are all taken from the handout on Number Theory from Rosen's book.

Section Number Points Comments
2.4 44 3
46 3

2.5 20 4
28 5 Hint: what is the congruence class of 10 mod 11

2.6 2(b),(f) 6 For (f), use the Euclidean algorithm to
compute the gcd and then back substitute
8 3
10 4
12 3
15 4
18 3
20 5 Note that you can't immediately apply the CRT,
since 6, 10, and 15 are not pairwise relatively prime.
You need to convert it to a form where you can apply
the CRT. (Hint: if x is congruent to 5 mod 6,
what is x congruent to mod 2 and mod 3?)
26 3
28 5 (Hint: for part (b), set up a system of
congruences mod 5, 7, and 11 of which
3**302 is a solution; then find another solution.
48 4 There seems to be a bug in Rosen's description of the
Euclidean algorithm. Rosen says (just before Exercise 48)
that tj = t(j-1) -q(j-1)t(j-2). It should say
tj=t(j-1) - qj t(j-2). I'll accept any reasonable
backsubstitution approach to computing
the coefficients.

[A problem from Jon Kleinberg:] [5 points]
The folks at MyCrawl are having a bizarre performance problem.
As they collect news headlines,they want to know if they've seen a
given headline before. They don't want to have to consult all the
headlines in their index to do this, so they're using a very primitive
form of hashing. Their scheme works as follows. They pick a number n,
and maintain n virtual "buckets" of headlines, labeled 0, 1, 2, ..., n-1.
Each headline in their index belongs to exactly one of these
buckets. When a new headline a comes in, they map it to a number f(a) and
check whether it's already in bucket f(a). If it's not, they add it to
this bucket. (So if the same headline a had come in previously, it would
already be sitting in bucket f(a) when they go to check.)
This is the essence of hashing in general: you divide up your data so
that you only have to check a small fraction of it in situations like
this. The hope is that if your function f -- the \hash function"
-- spreads things out nicely, then you're only consulting a 1/n fraction
of your data when you look in a single bucket.
To make all this concrete, one needs to specify the function
f. Here's how they're doing it at MyCrawl. (Despite the long
description that follows for the sake of completeness,
it's actually describing a very standard type of thing one might do.)

Each headline is in the following structured style. It begins with
the acronym of the wire service that carried the headline, then has a
colon and a space, then the text of the headline itself, and then it
ends with the date in the following format: two characters
for the day of the month, then the first three letters of the name of
the month, then four characters for the year, and then the first three
letters of the day of the week. For example:
AP: Turing's biographer wanted Russell Crowe in biopic 16Sep2005Fri
They then convert the text into a sequence of numbers, by mapping
each character to a number as follows. The letters `a' through `z' are
mapped to the numbers 1 through 26 (ignoring whether they're upper-case
or lower-case), the "space" between words is mapped to 27, the numbers 0
through 9 are mapped to 28 through 37, and all other characters
(e.g. punctuation) are mapped to a single "special symbol," which is
encoded as the number 38.
They then convert each number to binary, using exactly six bits per
number, inserting leading 0's if necessary. (So `a' becomes 000001, `b'
becomes 000010, ..., space becomes 011011, and so forth.)
These sequences of bits are then all concatenated together, causing
the headline to be represented as a single bit string. So for example,
the bit string corresponding to the headline above would begin
000001010000100110011011.
Finally, one views this long bit string as a (very large) number M
written in binary notation.
The value f(a) is defined simply to be M mod n.

Now, there's the issue of what to use as the value for n. The MyCrawl
people had read somewhere that it was a good idea in practice to have n
be a prime number. They weren't exactly sure why, but they did it,
choosing the prime number n = 4099, and things worked
very nicely. Profiling indicated that when they went to consult a
bucket, it tended to have roughly (1/4099)th of their headlines.
Then at some point, it's not clear how, someone accidentally changed the
code so that 4099 became 4096. Immediately, performance became much
worse. As far as they could tell, every time they mapped a headline a
to f(a), and then consulted bucket f(a), this bucket seemed to contain
at least 10% of all their headlines.

This is where they were hoping you could help them out. It's no trouble
for them to switch back to n = 4099, but in addition to doing this they
want to know what went so wrong with n = 4096. Is there a simple
explanation for it, or does it hint at some bug deep in the internals of
their system? Explain why they observed a big performance problem with
n = 4096. Your explanation should be based on the information you have
above; if you want, you can make very mild further assumptions about
their system, but the fewer the better. Also, it's not necessary to
explaidn why 4099 worked so well, just why 4096 worked so badly.)