MathBus

[ Term Structure | Algebra | Geometry | Logic | Drawing ]
[ Language Bindings | C++ | Java | Lisp | ML ]


Additional Term Structure Functionality

This section documents the basic libraries provided in each programming language. These libraries include a hash code library that computes a canonical, architecture invariant hash code for term, a stream I/O library that converts a term structure to and from a byte stream, and an iterator library that allows one to apply a function to each sub-node in a term structure.

Hash Coding Mechanism

A very common mechanism used in symbolic computation is to use a hash code obtained from the expression as an index for quick comparisons, searches and generally to manage collections of symbolic expressions. To simplify use of the MathBus term structure, a hash coding algorithm is specified and an implementation provided. This algorithm is specified so that hash codes computed on one machine may be used by other machines.

There are two parameters to the hash coding algorithm, the maximum depth of the MathBus term that is examined and the maximum sub-terms examined of each node when computing the hash code. For this release of the MathBus term structure both of these constants are equal to 7. In future releases, when we have more experience with the hash coding structure the may be changed to produce better hash codes or to make the hash coding process more efficient.

The hash coding algorithm is quite simple. The hash code of each of the sub-terms of the node is computed recursively. The 32-bit hash codes of these sub-terms are then rotated (low bit positions to high bit positions) according to their position. Thus the hash code of the sub-term indexed by zero (the numeric node label) is left unchanged, that indexed by one is rotated one bit position to the left, that indexed by two, two bit positions and so on. (Note that the hash code of the thirty second sub-term is left unchanged, and that of the thirty third is only rotate by a single bit position.) These values are then pair-wise logically XOR'ed and the resulting 32-bit value is returned.

During the traversal of the term, certain special node labels are given special handling. When nodes labeled MBVariable are encountered, then the meaning of the variable is hash-coded, not the MBVariable term. Thus the hash-code of a directed acyclic graph will be the same as that of the corresponding tree, with shared nodes replicated. Nodes labeled Bind have the same hash code as their body terms.

Stream Based Conversions

The tools discussed in this section transform MathBus terms into an architecture independent sequence of printable ASCII bytes. These byte sequences can be stored in files, embedded in mail message or exchanged among processes written in different programming languages and/or running on machines with different architectures. The MathBus terms can then be reconstructed from the byte streams. All implementations of the MathBus term structure, in all programming languages (C++, Java, Lisp and ML will be soon available), will use the same byte stream specification.

The MathBus byte stream has three components: a fixed size header structure, the body of the stream and a fixed size trailer. The header of the byte stream identifies byte stream as a MathBus byte stream and provides some initialization information. The byte stream's trailer contains check-sum information. The body represents a modified version of the term that is being transmitted. Because the term that is being transmitted may refer to information that is maintained in the local registry, this information must be in included in addition to the term itself. The process by which this is done is described in the section Preparing Terms for Transmission.

Since all MathBus terms are represented by trees, the byte stream can be processed by a stack based process, which we call the byte stream processor. When generating a byte stream processor converts the MathBus terms into a sequence of 32-bit words. These 32-bit words are coded into a compact sequence of 8-bit bytes, where small integers are efficiently coded. Finally, this sequence of bytes is converted into a sequence of printable ASCII text (Base64), which is the final form of the byte stream. In section 32-bit Structure of Byte Stream we discuss how the MathBus term structure is converted into a sequence of 32-bit integers, and section Base64 Conversion we discuss the final conversion to Base64 format.

Preparing Terms for Transmission

When transmitting a MathBus term, which we call the content term, any references to information contained in the local registry must be extracted and included with the term that is being transmitted. In addition, any references to global variables contained in the context must also be de-referenced. This is done by making the content term the last sub-term of a Sequence node, where all of the preceding sub-terms are RegistryStoreLocal or GlobalAssign terms.

The information that appears in this sequence is generated as follows.

Node Type Action
Local Node Label Determine the corresponding symbolic label and include two RegistryStoreLocal commands-one each for the String and one SubTypes entries.
Local String Identifier A RegistryStoreLocal command is included to indicate that a local string identifier has been encountered.
Global Variable Include a MBGlobalAssign command. At some point, the value included could be a RemoteNode so that the data would only be transmitted on demand.

When a term is received, the RegistryStoreLocal commands are handled a little carefully. However, if the registry type is String then local numeric string identifiers must be allocated and the numeric values present in the transmitted term must be converted to the local numeric values on the receiving machine. If the registry type is SubTypes then the three sub-terms are immediately passed to the function RegistryStoreLocal, since no translation is necessary.

[Still need to deal with contexts. Otherwise where will the GlobalAssign's go?]

32-bit Structure of Byte Stream

The header structure consists of three 32-bit unsigned integers. The first is the constant 0xFADEBAC0, which identifies this as a version 1.0 MathBus term. The second of these numbers is at least as large as the number of sub-terms at any node of the MathBus terms about to be transmitted. The third is at least as large the maximum depth of any MathBus term to be transmitted. These two numbers may be used to initialize the byte stream processor's internal data structures.

The body of the byte stream is a sequence of 32-bit words that represent a single MathBus term. (Multiple terms can be handled using the OrderedCollection node and perhaps GlobalAssign terms.)

Conversion of a term proceeds in a depth first fashion. First two 32-bit words are emitted. The first contains the node's label and the second is the number of sub-terms of the node. Next the byte stream for the first sub-term is emitted, followed by that for the second sub-term and so on. The leaves of a MathBus tree are 32-bit unsigned integers and are emitted without any transformation. For instance, assume the node labels LIST1 and LIST2 are associated with the numeric values 0x10FAD100 and 0x10FAD200 respectively. Ignoring the header and trailer information, the byte stream used to represent the term

(LIST1 0xABCD (LIST2 0x1234 0x5678) 0xCAFE)

would be:

< 10FAD100 00000003 
           0000ABCD
           00FAD200 00000002 00001234 00005678
           0000CAFE >

where all of the numbers above are in hexadecimal. Which sub-terms of a node are integers and which are themselves nodes can be determined by referring to the SubTypes entry of the node label in the registry.

Throughout the processing of the incoming byte stream, a 32-bit checksum is maintained. This checksum is the last four bytes of the byte stream. Actually, we should probably use the NIST Secure Hash Standard for the checksum/digest instead. This is more robust and it is standard.

Base64 Conversion

The sequence of 32-bit words generated by the above process is first converted into a sequence 8-bit bytes and then the byte sequence is converted into a sequence of ASCII printable characters which is the final form of the byte stream. The simplest approach would be to convert each 32-bit word into a sequence of 4 bytes, but this would be inefficient since so many of the 32-bit words have a large number of leading zeros. Recall that each node contains a count of the number sub-terms in a node, and this is most often quite small. Consequently, a simple coding scheme is used to minimize the space used by small 32-bit numbers. The precise conversion is given in the following table.

32-bit number byte sequence
0x0000 0000 - 0x0000 00FC B1B2
0x0000 00FC - 0x0000 FFFF FD B1B2 B3B4
0x0001 FFFF - 0x00FF FFFF FE B1B2 B3B4 B5B6
0x0100 FFFF - 0xFFFF FFFF FF B1B2 B3B4 B5B6 B7B8

 

The binary byte stream just discussed is then encoded using the MIME (RFC 1563) Base64 coding scheme. This scheme codes sequences of three 8-bit bytes into four 6-bit bytes that are represented as a printable characters. Although this approach increases the size of the byte stream by 33% it also ensures that the byte stream will not be corrupted by operating system specific conventions such as the end of line convention used. The mapping of six bit bytes into printable characters is given by the following table.

  00 01 02 03 04 05 06 07
+00 A B C D E F G H
+08 I J K L M N O P
+16 Q R S T U V W X
+24 Y Z a b c d e f
+32 g h i j k l m n
+40 o p q r s t u v
+48 w x y z 0 1 2 3
+56 4 5 6 7 8 9 + /

In addition, an "=" is used to indicate situations when the sequence of eight bit bytes is not an even multiple of six bit bytes. In particular, if the number of bytes in the sequence is exactly a multiple of 3 then nothing is added. If there is one 8-bit byte left at the end of the 8-bit byte stream then a single "= " is appended. If there are two characters left, then two "=" characters are appended.

Iterator Structure

The iterator mechanisms provides tools that allow one to apply a supplied function to every sub-node in a term. At each invocation information is made available about bound variables. At the moment the we do not have a language independent specification of the different iterator structures, but there are specifications of the iterators that we have implemented in the language specific specifications elsewhere.


Copyright Cornell University
For problems or questions regarding this web contact
[ SimLab project].
Last updated: November 03, 1997 by .