BIB-VERSION:: CS-TR-v2.0
ID:: CORNELLCS//TR92-1288 
ENTRY:: 1993-10-14
ORGANIZATION:: Cornell University, Computer Science Department
LANGUAGE:: English
TITLE:: Using Content-Addressable Search Engines To Encrypt and Break DES
AUTHOR:: Wayner, Peter C. 
DATE:: June 1992
PAGES:: 11
ABSTRACT::
A very simple parallel architecture using a modified version of 
content-addressable memory can be used to cheaply and efficiently encipher and 
decipher data with DES-like systems. The paper will describe how to implement 
DES on these modified content-addressable memories at speeds approaching some 
of the best specialized hardware. The chips can also be used to build a large 
scale engine for exhaustively searching the entire keyspace of DES.
END:: CORNELLCS//TR92-1288 
BODY::
Using Content-Addressable Search Engines
To Encrypt and Break DES
Peter 0. Wayner
TR 92-1288
June 1992
Department of Computer Science
Cornell University
Ithaca, NY 14853-7501
Using Content-Addressable Search Engines
To Encrypt and Break DES
Peter C. Wayner
Computer Science Department
Cornell University
Ithaca NY 14853
Abstract
A very simple parallel architecture using a modified version of content-addressable
memory can be used to cheaply and efficiently encipher and decipher data with
DES4ike systems. The paper will describe how to implemeiit DES on these mod-
ified content-addressable memories at speeds approaching some of the best spe-
cialized hardware. The chips can also be used to build a large scale engine for
exhaustively searching the entire keyspace of DES.
The original DES system was designed to be easily implemented in hardware [NBS77]
and the current silicon manifestations of the cipher use modern processor design tech-
niques to encipher and decipher information at about 1 to 20 megabits per second.
Implementations of DES in software for standard CP Us, however, are markedly slower
than specialized chips because many of the operations involved in DES are bit-level ma-
nipulations. As a result, many of the DES-like systems such as Merkle's Khufu [Mer9O]
were designed as replacements that could be easily implemented on conventional hard-
ware.
There is one class of general architecture, however, that implements bit-level op-
erations. The machines like the CM-1, CM-2 and CM-200 from Thinking Machines
Corporation and the Maspar machine all have thousands of one-bit processors. The
designers intended that large number of processors would make up for the deficendes
of the individual nodes.
Another example of this small architecture is now emerging from the labs of mem-
ory designers who are trying to build sophisticated content addressable memory. The
individual processors of these machines are even weaker than the ones of the CM-1, but
they can be packed very densely on a chip. The tiny processors have only a fraction of
the memory of a CM-1 (42 bits versus thousands) and only a one dimensional intercon-
nection network (vs 12), but this is sufficient to implement DES successfully. Moreover,
the packing density (1024 processors per chip) ensures that a significant amount of
parallelism is very cheap.
Implementing the cipher on generalized parallel architectures like this have one
main advantag? cost. Many computer designers often find that the tradeoff between
the speed of a specialized DES chip is often not worth the price. Generalized, content-
addressable machines, however, have many other applications and this makes them a
good comprornise for the system designer. This paper will describe how to implement
the algorithm 011 this architecture and produce results that are 011 par with the best
specialized hardware.
1 Content-Addressable Memory Machines
Standard memory maps an address to a value. Unfortunately, there are many appli-
cations when an algorithm needs to know which memory location holds a particular
value. The only recourse is to search all the memory to find the value in question.
Content-addressable memory is a hardware solution to this problem that will invert
the search and provide the address holding a value in ouly one operation. The tech-
nique has been well-researched over the years and the book by Kohonen [Koh87] notes
many approaches and summarizes some of the more salient research. Several companies
including AMD are making basic content-addressable memory modules.
Recently teams at Syracuse University (some publications include [Old86, OWN87,
OSB87]), MIT and Cornell ([Bri90, WS89, Zip90]) have developed more sophicated
and powerful implementations in silicon. They allow the programmer to chain the
result of several searches together in a simple fashion so that larger data structures
and more complicated searches can be performed in hardware. Some of this hardware
was originally intended to speed up logic programming, but many people have found
surprising and interesting applications for the simple hardware. Oldfield and his team
at Syracuse, for instance, are currently working on compressing data.
A company, Coherent Research Incorporated of Syracuse, New York, is building
sophisticated content-addressable memory chips called tile Coherent Processor for
widespread use. This paper will use their chip as an example because it is commercially
available, but there is no reason why the algorithms cannot be modified slightly for use
on similar chips.
At the basic level, the Coherent Processor is a large, single dimensional array of very
simple parallel processors. Each processor has 42 bits of memory (W?[Oj.. .W?[41], the i
denotes the processor number) and three one-bit registers (R1, R2 and R3). It also has
a processing unit that can execute instructions on the registers, transfer data between
the registers and the memory, communicate with the two neighboring processors or
match a value on the internal bus. The instructions are simple operations that read
three bits of memory and store the result in a third. The match instructions can be
used to simultaneously compare one 42-bit value against the entire array of processors.
If there is a match, then the appropriate value is placed in a register.
The following table shows the basic Coherent Processor instructions and the number
of clock cycles used to complete them.
1. MATCH Simultaneously compare the 42 general bits at each processor with the
values on a bus and store the result of this match in R1. This is used to look up
2
items quickly. The match routine can include wild-card matches for individual
bits so it is possible to match for strings of bits like "OOOO******ll*****? (a "*"
matches both a "0" and a "1"). If you want to move the value of bit Wi[2] into
R3, then you would "match" a pattern with 1 in bit Wi[2j and wild-card matches
specified for the rest and store the result in R3. If the value of bit Wj[2] was 1 in
a particular word, then the match would be successful and a 1 would be stored in
R3. If a zero was in bit W??[2], then the match would be unsuccessful and a zero
would be stored. The values of the other columns would not be affected. Cost:
4 cycles
2. CALC Calculate a threebit function of the three registers and store the result in
a third register. Cost: 2 cycles.
3. READ Take the result of a selected word and place it on the bus. This operation
usually follows a MATCH operation. Cost: 3 cycles.
4. WRITE Move the result from the bus into the selected word(s). Cost: 2
cycles
5. SHIFT The first registers of each word are interconnected. They can shift the bit
in their register to adjacent words in one step. Cost: 2 cycles
6. WRITECOL UMN Moves a bit from a register into one of the 42 bits of memory.
Cost: 2 cycles.
These commands can be strung together to manipulate data in simple and straight-
forward methods.
2 Implementing Plain DES
There are three main operations involved in encrypting a block of 64 bits with the basic
mode of the Data Encryption Standard known as the Electronic Code Book (ECB).
They are 1) permuting the bits, 2) passing a 32-bit block through an s-box and 3)
permuting the key structure. Each of these steps is easy to program on the Coherent
Processor, in a large part because the architecture is so limited. Several features of the
instruction set, however, make implementing the algorithm very easy.
Let the plaintext blocks of data be denoted, ??....., Bn? and the individual bits of
block Bj be ?B?[oj... Bi[63]1. The key is K and the individual bits are ....... K[55].
There are sixteen rounds of encryption and a key scheduling algorithm chooses a
47 bit subset of key bits to be used on each round. Let K(')[o].. .K(1)[47] be the 48
bits used in round 1. Each block of 64 bits is broken into two, 32-bit halves (called BL
and Bn) and in each round the value of one of the halves is mixed with a subset of the
key bits, passed through the s-box and then mixed with the other 32-bit half. More
precisely, in each round:
BL BL e f(E(BR) e
3
("e"=xOR) Then BL and BR are exchanged. The () function is all
function that maps 32 bits into 48 bits so it can be combined with the 48 bits of key.
Some bits are used more than others.
The data to be encrypted is broken into 64-bit blocks and each block is stored in
32-bit halves in two adjacent 42 bit words in the array, W? and Wj+1.
2.1 Permuting the Bits
At the beginning and the end of the encryption process, the 64 bits in the block are
passed through a bit-wise permutation. This step is often considered the slowest part of
many software implementations for general purpose machines and many people believe
that it was included to slow down software implementations and force general CPUs to
move bits one by one. The Coherent Processor must also move each bit one at a time,
but at least this is the best that it can do. In practice, the large number of parallel
processors makes up for the weakness.
Let the permutation be written as a set of cycles: Wi[po]			W?j[pi]			...
Wt[Pi] W??[po]. There are 64 bits to be exchanged, but they do not move in one
cycle. The process can be accomplished by stringing together a chain of bit moving
commands. When the bits to be exchanged are on different words, then the CAM must
also execute a bit-passing command to swap the bit to the adjacent word. The work
can be summarized in pseudo-code:
Move Wj[p0] into a bit
for k:=1 to 63 do
Move Wi[pk] into a bit.
Move W?[pk?1] into its destination.
If Wj[pk] is on the wrong word,
then pass it to the correct one.
Move Wi[p63] into Wi[p0]
There are only 32 bits that need to be shifted between words. It is possible to do
this quickly. The next section which computes the values of the s-boxes is much more
time intensive.
Cost: 129 MATCH or WRITECOLUMN instructions, 32 SHIFT instructions. 580
cycles.
2.2 Computing the S-boxes
The s-box are responsible for providing the non-linear mixing of the bits that is necessary
to provide adequate security. At the highest level, the s-box is a function that maps 32
bits to 32 other bits. The s-boxes used in DES are, though, much simpler and they can
be described as eight functions that take 6 ollt of the 32 bits and return four. Some
bits are used more than others. These eight s-boxes can be further simplified into 32
functions that map six bits to one bit and this is the best level of abstraction to use
when programming the Coherent Processor.
4
Meyer and Matyas [MM82] describe the design of the s-boxes ill terms of minterms,
which are roughly the same as clauses of boolean variables. All equation describing
output of one bit of an s-box might look something like this:
Bj[1] ?Bi[2] Bj[3] Bj[4] + Bj[1] ?Bj[5] ?Bj[6] + Bi[2] Bj[5].
(1)
("."=boolean and, "+"=boolean or, "?"`=boolean not.) There are three minterms
in the example and it is generally believed that the number of minterms in a minimal
expression is one measure the complexity of the s-box. The recent papers by Biham and
Shamir ?BS91] and others, show that there are additional ones that are more important.
Meyer and Matyas note that there are 52 and 53 minterms in the description of each of
the 8 s-boxes.
These minterm descriptions of the s-boxes can be directly converted into operations
for the Coherent Processor . Each clause of variables to be ANDed together can be
computed with a MATCH equation with appropriate set of ones for the variables in the
clause, zeros for the negated variables in the clause and wildcards for the unrepresented
variables. The expression from equation 1 can be encoded:
MATCH"1011 * * * ... * **" R1
CALCR1 R2
MATCH"1***00**...***" ?R1
CALCR1 R2 R2
MATCH" * 1 * **** . . . *
CALCR1 R2 R1
R1
(2)
This takes 6 cycles per minterm. At 53 minterms per s-box and 8 s-boxes per
encryption round, this takes 2544 cycles per encryption round to calculate the values
of the bits. It takes one SHIFT, one MATCH, one CALC and one COLUMNWRITE
to XOR each of the 32 bits into the adjacent word. That is an additional 384 cycles
for 2928 per encryption round. There are 16 rounds in DES, the permutations take 580
cycles and the overall encryption process takes 47,528 cycles.
2.3 Handling the Key
When the result of one of the 32 functions is comp?ted it must be XOR-ed with the
key and then passed to the adjacent word to be XOR-ed with the appropriate bit. The
same key encrypts all the blocks at the same time and it can be included by XORing the
key vector, K('?, into the match words. For instance, assume that "11001100 10101110
01001100 11100101" is the 32 bits of key being used in a round and the minterms from
equation 1 define the s-box equations. Then the operations in example 2 become:
MATCH"0111 * * * ... * **" R1
5
CALCI?1			R2
MATCH "0 * * * **** ... * **"			R1
CALCR1 R2			R2
?A???????????????????? ?R1
CALCI?1 R2			R1
(3)
The same key is used to encrypt or decrypt each block of data ill the simple version
of DES. There are 56 key bits, but only 32 of them are used during each of the 16
different rounds. The bits being used are maintained by the program running on the
general machine that is driving the Coherent Processor . It selects the subset of 32
bits that are used in each encryption and modi?es the s-box functions accordingly.
This method presupposes that the sixteen 32-bit subsets of the keys are precomputed
and "compiled" into the code. This process is non-trivial and certain to cost some time.
When the amount of data encrypted or decrypted per key change is large, then this
"compilation" time is minimal. If the key is changed frequently,then there may be some
signifIcant impact on the encryption times. A better understanding of the effects of this
will need to wait until the software is completely implemented on a working system.
2.4 The Total Cost
The current version of the Coherent Processor will run at speeds up to 50 MHtz. If
an encryption takes about 47,428 cycles, then each pair of words in the processor array
can encrypt about 1,000 64-bit blocks per second. Writing a word into the array and
reading it out takes 5 cycles in total. One chip of the current model has 1024 words
or processors, so it can read in, encrypt and write out blocks of 32K in 52,548 cycles.
This is equivalent to 31.2 megabits per second something that is in line with the
middle range of current DES chips. The Cryptech CRY12C102 data sheet reports that
it runs at 22.5 megabits per second and the Pijnenburg PCC100 gets 20 megabits per
second. Moreover, the Coherent Processor is designed to be easily expanded by linking
together multiple copies of the chip and n chips will n times faster for small numbers
of n. When there are hundreads or thousands of chips, the cost of writing and reading
the information from the Coherent Processor becomes the limiting factor. Coherent
Research reports that the new chip will cost about $100 per copy in small quantities
and substantially less in large ones.
3 Exhaustive Attack 011 DES
When DES was introduced in 1977, some computer scientists protested that 56 bits
were not sufficient because it would be possible to do an exhaustive search of the key
space in a short amount of time using a massively parallel computer. [DH77] In their
book, Meyer and Matyas [MM82] disco?nt that possiblity and predict that it would
just not be physically possible to build the machine until the 1990's because there were
too many physical limitations. Heat and power usage are two major barriers.
6
How easy would it be to build one today? Standard off-the-shelf encryption chips
are plentiful and relatively cheap, but they require a second processor feeding them the
keys and the test cases. Anyone who wants to build such a machine must undertake a
project of building such a large array of distributed computers. This would require a
large amount of custom design work. A truly dedicated attacker could even fabricate
custom DES testing chips which have a built in circuit for incrementing the key by one
bit and testing the result against another register. Only governments could afford a
budget this large. Moreover, the slightest change in the algorithm would render this
machine worthless.
Garon and Outerbridge calculated the approximate costs of desiguing such a ma-
chine and found that it would cost about $129,000 for a machine that would break DES
within 1 year if the machine was built in 1990. [GO91] They assume that it is possible
to build a node that encrypts 2 million key tests for $25 in order to complete such a
machine. They do not describe the details of how to design the board or manufacture
it is sufficient quanties.
The Content Addressable Memory array chips, however, are designed to be built
into large parallel arrays of chips. It is already possible to buy a board for a PC which
has 64 chips of a previous model of the Coherent Processor. Large arrays should not
be hard to create. Moreover, the algorithm is implemented in software, so the machine
can also be used to attack many other subtle and not-so-subtle variations of DES.
What is the best way to do an exhaustive search with the current architecture of
the Coherent Processor? The version described for simple encryption and decryption is
able to work very quickly because it can encode the key in the stream of instructions fed
to the Coherent Processor. This approach must be abandoned because an exhaustive
search of the key space requires that each processing node must use a different key.
One alternative is to store the key bits in the 10 extra tag bits stored at each node.
Two nodes are used to hold the two 32-bit half-blocks of each case, so there are up
to 20 extra key bits which can be stored at each node. Let there be 2? processors in
the machine. That means there are 2fl--Hi potential keys that can be tested with each
round because two nodes are used for each encryption. Assume that n < 21 and the
problem does not overflow the physical space of the real machine. (Later versions of the
architecture could have more free bits available.) At each pair of nodes, store a unique
set of n --H 1 key bits. These bits will be used by this pair of nodes alone. The other
56 --H (n --H 1) bits are shared by all the instances and they are encoded in the instruction
stream as before.
At the beginning of each round of encryption, the local key bits must be XOR?-
ed into the appropriate half-block of bits before that half-block is passed through the
s-boxes. These four or five instructions will XOR. in the key bit Kj in to position B3:
MATCHK1
SHIFT
MATCHTh' R2
CALCR1XORR2 R2
WRITECOLUMNR2 Bj
7
(4)
The SHIFT instruction is only necessary if the key bit is on the opposite node from
the destination bit. This process is repeated at the end of the s-box calculation to
remove the bit from the data. Only 32 of the 56 key bits are used at each round, but
it is possible that up to n --H 1 of these bits will come from the bits stored locally. The
operations in equation 4 take 16 cycles. They must be repeated 2n --H 2 times for each
round. The result takes 512n --H 512 extra cycles for each encryption. If a machine was
bullt with a full complement of 221 processors, then it would take 57,126 cycles to test
220 potential keys. This step must be repeated 236 times and the machine is capable of
doing about 875 of these tests per second or about 76 million per day. Exhausting the
entire space would require 904 days. If the well-known trick of exploiting symmetry in
the keys is used to reduce the key space to 2?? keys, then one machine will test all in
452 days.
How much would such a machine cost? There are 210 processors on a chip that will
cost between $30 and $100. 211 chips are necessary and this would cost between about
$60,000 and $200,000. Control hardware would add additional $10,000 to $20,000. 45
machines would cost about $3 million dollars and exhaustively search the space in 10
days or $30 million to search the space in 1 day with 450 machines. I'm assuming that
volume discounts would apply at this scale. This is a price that should apply at the
end of 1992 when the chips become widely available.
The standard assumptions about time and transistor density should apply to this
model as well. It is entirely conceivable that we will see large changes in density and
price of these machines in the near future and this should prove to be a major threat
to the security of DES and by logical extension, the UNIX password system. This large
machine made up of CAM is reprogrammable so it is possible to retool it to mount
an exhaustive search on DES-like system where either the s-boxes or the scheduling
algorithms were constrained.
4 DES with Modified Chaining
The last several section described how to encrypt a large block of data in parallel using
a simple DES with no feedback. A more robust version of DES feeds the result of
encrypting each block into the key selection of the next block. Let Ej = f(K, Bi)
represents the cipher text blocks. A feedback cipher sets Ej = f(K, Bj ? Ej?1). "0"
represents boolean XOR. E0 is set to a pre-arranged constant. This process is called
Cipher Block Chaining (CBC).
The modification adds a great deal of strength to the plain DES because it reduces
the redundancies that can developed if there is an 64 bit block that occurs often in the
plaintext. The feedback mode ensures that a different value will permute each block
and obscure the redundancy. It should be obvious that this system cannot be used
when all the blocks are computed in parallel. Here is a modified version of chaining
that can be implemented in parallel.
One solution is to exchange and XOR bits with neighbors at the end of certain
rounds of encryption. In round 1, the left half of each block is used to compute the
8
value XORed into the right half. After this, the left blocks are exchanged with the
neighboring blocks and XOR'ed into the right halves of the neighboring block. This
can be done with pseudo-code like this. Wi is the left half and Wj+1 is the right half.
for k:=O to 31 do
MATCH Wi[k]
CALC COPY R1
SHIFT
SHIFT
CALC XOR R1R2 R1
WRITECOLUMN R1
This command shifts one bit to the next pair of words over and XOR's it with the value
of a neighboring block. It takes 16 cycles per bit to do it. This can be repeated as
often as desired at the cost of slowing down the entire encryption. Doing this at the
end of each round of encryption costs 8,192 cycles and this slows the encryption rate
to 27.0 megabits per second. In this case, a change in block Bj will propigate through
blocks Bj to Bj+16 and effect their encrypted values. Arbitrarily complex shifting can
be included as long as care is taken to ensure that the results can be reversed. If this
step is done often in the process, it can effectively turns the encryption into one large
block at a small decrease in speed.
5 Conclusion
This paper has shown that a simple architecture intended for information storage and
retrieval can encrypt and decrypt messages faster than many of the best specialized
chips. Naturally, the results show that the process can be extended to other DES-like
systems. The only problem is expressing the s-boxes so they can be implemented with
minterms.
Chips like the Coherent Processor also makes it very easy to create a large-scale
processor for exhaustive cryptanalysis of the key space because the chips were designed
to be grouped together in a large array. The hypothetical machine described here
is much different from the other machines described in the literature because it is
substantially closer to being created. The next several generations of this architecture
should pose a large threat to wide-spread use of the old DES.
There are several changes to the Coherent Processor that would make it better able
for encrypting DES. Currently, the key is "compiled" into the program for the CAM
and this may be a non-trivial event. If future versions of the architecture have more
that 42 bits per word, then it could be practical to store the key locally and add the
key in bit by bit as it is done in the brute force attack. Mso, the current version of
the Coherent Processor will only compute 3 bit functions. 4 or 5 bit functions may be
quite practical and they would speed the results of the process.
This draft of the paper represents a preliminary look at the problem of implementing
DES on the special CAM architecture. Better numbers will be available when working
hardware is available.
9
