Dalí: An Overview

This document provides an overview of the Dalí Virtual Machine (or more simply, Dalí). Dalí is a high performance library of routines for manipulating video, audio, and image data. This text describes the design of Dalí, not its current implementation. As such, there may be places where the current implementation does not match the design.

Getting into the right mindset

To understand Dalí, you have to get in the right mindset. Dalí was designed to be the run-time library for the next version of Rivl. The goal of the Rivl project is to design a programming language where video, audio, and images are data types in the language. The first version of Rivl was written as an interpreter. In the next version, a compiler will read a Rivl program and compile it, producing a program written in some virtual machine language. This language is Dalí. The Dalí language is a low level, high performance language with abstractions and operators to store and manipulate audio, video, and images. Since we didn't want to write yet-another-byte-code-interpreter for the Dalí Virtual Machine (VM), we decided to make Dalí an extension of a traditional programming language. Thus, the implementation of Dalí is a set of library routines written in portable C. A binding to the Tcl language is also available, and we are currently designing a Java binding.

These design goals led us to make different design decisions than the designers of more traditional image/audio/video libraries. First, Dalí primitives are lean and mean. They are designed for maximal performance, not ease of use. Thus, they provide almost no type checking and generally do not handle boundary cases, since such checking would slow down the code. Instead, the we expect the compiler (or Dalí programmer) to do such checks.

Second, Dalí primitives are designed to give predictable performance. This means that functions whose performance is difficult to predict, such as I/O operations, are separated from computationally intensive functions.

Third, we wanted to give the programmer full control over resource utilization. Thus, programmers explicitly allocate and free resources such as temporary image buffers. This design also helps lead to more predictable performance, since the amount of memory allocated is entirely under programmer control so paging can be minimized.

Fourth, we wanted composable primitives -- primitives that did enough work so that the function call overhead was small, but could be composed in interesting and unforeseen ways. Thus, Dalí's data abstractions have minimal semantics. Instead, the semantics are in the operators.

This split is analogous to conventional assembly languages where the type of data stored in a register depends on the context in which it is used. Depending on the operator that uses the register value, a register can store an address or an integer (or both at the same time). This same design paradigm is used in Dalí. For example, Dalí has no abstraction to represent color images (or even color pixels). Instead, multiple ByteImages (2D arrays of numbers) are used to represent color images. Thus, to represent a YUV image, 3 ByteImages are required, one for each color plane. Different sampling rates (e.g., 4:2:2 vs. 4:2:0) are implied by the operators. For example, one Dalí operator converts from 4:2:0 YUV color space to RGB. This operator takes 6 arguments (three planes for the YUV image, and three for the RGB image), and the images that represent the U and V planes are half the width and height of the images that represent the Y, R, G, and B planes.

These design decisions allow programmers to easily write very fast code. For example, in the current implementation of Dalí, an MPEG decoder can be expressed in about 100 lines of code and it runs 20% faster than hand tuned C code (i.e., the Berkeley MPEG player, mpeg_play).

Overview of the Abstractions

Now that you're in the right mindset, the next thing you need to understand is the abstractions Dalí provides (just as you must understand the concepts of register and memory to understand a CPU). Once you know the abstractions, you can look at the operators (like the opcodes in a traditional assembly language). There are two major kinds of abstractions in Dalí: data abstractions (like images) and operator abstractions (like lookup tables). We will first give an overview of each, and then describe them in some detail.

The Dalí data abstractions are Images, Audio Buffers, Bitstreams/BitParsers.

Audio Buffers are used to store audio data. That is, they contain a one dimensional array of samples. Each sample can be an 8- or 16-bit value and can be a mono or stereo value (the details are below).
Bitstreams represent an uninterpreted array of bytes. They are typically used as buffers for the input and output of external representations (e.g., JPEG or MPEG data). To access the Bitstream data, you usually must attach a BitParser to the Bitstream. The BitParser provides a cursor into the Bitstream and very fast bit- or byte-at-a-time read/write functions.
Images represent, well, images. That is, a two dimensional array of pixels. Dalí currently has four kinds of images.

In a ByteImage (BitImage), each pixel is a number in the range 0..255 (0..1).
In a VectorImage, each pixel is a motion vector (VectorImages are typically extracted from an MPEG bitstream).
In a DCTImage each pixel is an array of 64 DCT coefficients. DCTImages are typically extracted from MPEG or JPEG images. In the current implementation, DCTImages are called "SCImages".

In addition to these basic data abstractions, Dalí also contains a few other abstractions to store data needed by the operators. For example, to adjust the contrast of an image, a Lookup Table (LUT) is normally used. Thus, Dalí has an abstraction to represent a LUT. Dalí has four such operator abstractions: Image Lookup Tables (ILUTs), Audio Lookup Tables (ALUTs), Image Kernels, and Audio Kernels.

ILUTs (called ImageMaps in the current release) are one dimensional arrays with up to 256 values. An ILUT can be applied to a ByteImage to produce a new ByteImage. ILUTs are used to represent colormaps in GIF images and used for operations like histogram equalization and contrast adjustments. ILUTs can be composed, so that when a computation requires the application of two or more ILUTs in a row, they can be combined into a single ILUT (which is much faster than applying the separate ILUTs).
Image Kernels (called Kernels in the current release) represent convolution kernels. They can be used for operations like smoothing, edge enhancement and detection, etc.
ALUTs and Audio Kernels are lookup tables and convolution kernels for Audio Buffers. Audio Kernels are not implemented in the current release.

Overview of the Operators

Finally, Dalí operators create and modify these abstractions. Dalí operators are grouped into packages. Some of the packages are:

ByteGeom: operators that scale and warp byte images
ILUTs and ALUTs: lookup tables for Images and Audio Buffers
Image Kernels: convolution kernels for Byte Images
Color conversion: converts between color spaces
Parsing: functions that read/write external data formats.

The operators in first four packages are straightforward -- they take the required operands and perform the requested computation. One interesting feature of these operators is specialization, where a class of operations is implemented as fast, special purpose operators and slower, more general operators. The combined operators can perform any function in the class, although no single operator can. Scaling is an example of a specialized operator. Suppose you wanted to shrink an image by a factor of 5. Instead of providing a single operator that shrinks an image by an arbitrary factor (which would use complex image filtering if implemented correctly), Dalí provides several special purpose operators that shrink the image by a factor of 2 or 4 (which can be implemented using only adds and a shift), and a general scaling operator that can shrink the image by a factor between 1 and 2. The latter operator uses bilinear interpolation to calculate the values of the intermediate pixels. Thus, to scale by a factor of 5, you would use the special purpose operator to scale by a factor of 4, and then apply the more complex operator to scale by the remaining factor 1.25. Since the latter function is processing much less data, the overall operation runs very fast.

The design of the Dalí codec functions is also interesting. We will explain their operation in the context of decoding a GIF bitstream (i.e., an chunk of memory containing a GIF image). This bitstream contains three structural elements: a header, some color tables, and compressed color-mapped image data. For each structural element, Dalí provides five generic functions: parse, find, skip, dump, or encode. Thus, the gif_img_hdr_parse operator parses the GIF header, the gif_img_skip operator skips the GIF image data, and the gif_img_ct_dump dumps the color table.

The parse operator decodes the element into a structure (typically a C data structure).
The skip operator skips over the element, advancing the Bitstream cursor to the next structural element
The find operator advances the Bitstream cursor to the specified structural element. For example, gif_img_ct_find advances the Bitstream cursor to the GIF color table.
The dump operator copies a structural element from one memory location to another without interpreting it.
The encode is the inverse of parse: it encodes a structure into a Bitstream

Dalí contains codec functions to parse, skip, dump, find, and encode structural elements in MPEG video, MPEG audio, MPEG system, JPEG, GIF, PPM, and WAV data files. All these functions use Bitstreams and Bitparsers to store the encoded data. Recall that a Bitstream is just an uninterpreted arrays of bytes, and that a Bitparser store the cursor position within a Bitstream. It is the responsibility of the programmer to load the data into the bitstreams.

Another interesting feature of the Dalí codec functions is how they handle interleaved streams. For example, consider the problem of extracting a MPEG video stream from an MPEG system stream (i.e., demultiplexing the stream). MPEG system streams contain interleaved audio and video data streams. In Dalí, one would use an index structure to do the demultiplexing. An index is a sequence of (offset,length) pairs. Dalí has a function to copy one Bitstream to another using an index. This function only copies the subset of the defined by the index. In addition, Dalí contains functions to parse an MPEG systems stream to produce an index which indicates the location of all the video packets in the bitstream. Thus, to demultiplex an MPEG systems stream, one would read a block of MPEG data into a Bitstream, calculate an index that defines the locations of the video packets within that Bitstream, and copy the relevant blocks into another (output) Bitstream.

Now that you have an overview of some of the Dalí structures, we will describe these abstractions and operators in more detail.

Image Data Types

The Dalí VM has 4 ways to represent images:

A BitImage represents a 2D array of 0/1 values.
A ByteImage represents a 2D array of values in the range 0..255, inclusive.
An DCTImage represents a 2D array of DCT blocks.
An VectorImage represents a 2D array of MPEG motion vectors.

All images can be either physical or virtual. When a physical image is created, a chunk of memory is allocated for its pixels and a chunk is allocated for a "header" structure, as shown in the left figure, below. One field in this header is a pointer to the pixel data of the image. A virtual image can be created from a physical image to represent a rectangular region of the physical image's pixel data. When a virtual image is created, data is allocated for the header (to indicate the position and size of the rectangle, and the width of the parent buffer), but the memory is shared with the physical image. Thus, writing data into this virtual image will modify the pixel data of the parent image (in the obvious way).

wpe5.gif (3843 bytes)

Virtual images have three advantages. First, they can be used to avoid unnecessary data copies. Second, they allow one to reuse allocated memory as much as possible, which leads to better caching performance. Third, they can be used to avoid conditionals in the inner loop of operators. To see why, consider the operation of smoothing an image. The value of the pixel at coordinates (i,j) in the output image are given by

out[i,j] = in[i,j]/2 + (in[i-1,j] + in[i+1,j] + in[i,j-1] + in[i,j+i])/8

Now, the question is, what happens in the first row (i=0)? Since the formula refers to in[i-1,j], the formula is undefined and will probably produce a memory error. A conventional implementation must check for this, and might use 0 as the value of the undefined pixel. Such checks are expensive if they occur in the inner loop of the execution (note: although you could easily pull this check outside the loop, there are other cases, such as MPEG decoding, where this optimization is impossible).

In Dalí, one would handle it differently. Suppose the image is 256x256. You could allocate a 258x258 physical image, and a 256x256 virtual image with a one pixel border around it. If you initialize the border to black, then you can simply compute the convoultion using the virtual image without any checks in the inner loop.

BitImages

Internally, the pixels in a BitImage are packed into an array of 8 bit values. Basic functions are provided to read and write bit images form PBM files, to AND and OR bit images, and to create a bit image mask from a byte image (so that all pixels within a specified range in the byte image have a 1 in the corresponding pixel in the bit image).

The primary use of bit images is as a mask. Two operations, bit_copy_with_mask and bit_set_with_mask can use a bit image as a mask, to determine which pixels in the output are affected.

ByteImages

ByteImages are simply an array of 8 bit values. Basic functions provide I/O, initialization, geometric transformations, lookup table transformations (e.g., contrast adjustment or histogram equalization), and convolutions.

Color images are represented by 3 ByteImages. For instance, an RGB image is easily represented using 3 ByteImages of equal size. Images where some components are subsampled (e.g., 4:2:0 YUV data) are represented using 3 byte images of differing sizes. Functions are provided to convert between color spaces.

DCTImages

DCTImages are the basic image type for representing the blocks of DCT coefficients, as used in block based compression schemes such as MPEG and JPEG. Each "pixel" of an DCTImage is an SCBlock. Thus, an DCTImage is a factor of 8 smaller in width and height than its corresponding image. An SCBlock is an array of 64 DCT coefficients, stored as (index,value) pairs, plus several flag values. The DCT values are dequantized and independent of other blocks (i.e., the DC coefficient is the true DC value, not the difference from the previous block). Functions are provided to convert DCTImages to pixel images. DCTImages are used in conjunction with the MPEG and JPEG package.

VectorImages

VectorImages represent a 2D array of motion vectors. Each motion vector is contains an (left,down) pair, and a flag to indicate whether the pair has been correctly initialized. This flag is required because blocks in MPEG P frames can be intra-coded (i.e., I-blocks), so the corresponding motion vectors are meaningless. Similar concerns hold for MPEG B-frames. VectorImages are used in conjunction with the MPEG decoder.

Audio Buffers

Dalí defines a single abstraction for storing and manipulation of audio data, an AudioBuffer. It can be used to store either mono or stereo, 8-bit or 16-bit audio data. Internally, stereo data is represented in an interleaved format.

Audio buffers can be either physical or virtual. When a physical audio buffer is created, a chunk of memory is allocated for its samples and a chunk is allocated for a "header" structure. Part of this header is a pointer to the audio samples of the buffer. A virtual audio buffer can be created from a physical audio buffer to represent a segment of the physical audio buffer. When a virtual audio buffer is created, data is allocated for the header (to indicate the position and size of the audio data chunk), but the memory is shared with the physical audio buffer. Thus, writing data into this virtual audio buffer will modify the audio data of the parent audio buffer.

There are two separate sets of commands for 8-bit and 16-bit audio respectively. Corresponding command should be invoked in order to get correct results.

BitStreams and BitParsers

Dalí uses the BitStream and BitParser abstractions to support input-output operations. A BitStream is a chunk of memory that represents input-output buffer. A BitParser provides the functions needed by the primitives to read and write bit-at-a-time or byte-at-a-time quantities from a BitStream. For example, a programmer could decode a large MPEG video file by creating a BitStream that memory-maps the file into virtual memory, attach the BitStream in a BitParser, and then pass the BitParser to the MPEG decoding functions.

BitStreams interface with the actual i/o system (such as file and networks) to fill up or flush the buffer. Filters can be applied to access only a portion of the input (e.g., the video portion of an MPEG systems stream). Functions are provided to read and write filters, and various media parsers can construct filters (e.g., the MPEG systems stream parser constructs a filter so that the audio and video portions of the bitstream can be independently accessed).

BitStreams can be cast into other types of object, such as ByteImages (for a PGM file) and AudioBuffers (for PCM audio data). Casting avoids unnecessary data copies.

Data Format Support

MPEG support

Dalí has support for decoding MPEG audio and video bitstreams. An MPEG video bitstream starts with a sequence header, followed by a sequence of GOPs (group of pictures). Each GOP is a GOP header, followed by a sequence of pictures. Each picture is a picture header followed by picture data. The Dalí-VM provides functions to find, skip, decode, or dump any of these structural elements. The find functions search the bitstream for the next occurrence of the structure element (e.g., the mpeg_pic_hdr_find function finds the next picture header in the bitstream). The decode functions parse the structural element from the bitstream and return the structure (e.g., the mpeg_seq_hdr_decode function decodes a sequence header from a bitstream). The skip functions move the bitstream past that structural element, and the dump function copies that structural element from one bitstream to another. Finally, a function is provided to return the type of the next structural element in the bitstream, and to find the next structural element (regardless of its type).

The functions to decode pictures return DCTImages for the Y, U, and V color planes, and VectorImages to represent the motion vectors. Functions are provided to convert I, P, and B frames to Y, U, and V ByteImages.

For instance, to copy the second and third GOPs from one MPEG bitstream to another, a programmer would create two bitstreams (one for the source and one for the output), dump the sequence header from the source bitstream to the output bitstream, skip the GOP header, find the next GOP header, and then dump the following two GOPs. Dumping a GOP requires dumping the GOP header and all the intermediate picture headers and pictures. This type of operation is extremely fast in Dalí. For instance, a function to count the number of frames in an MPEG bitstream operates at about 15 times real-time speed on a 266 MHz Pentium II.

Decoding audio streams is similar to decoding video streams. An MPEG audio stream is a series of audio headers and layer 1, 2, or 3 audio frames that contain compressed audio data. Primitives are provided to skip, dump, parse, and find these elements.

MPEG Systems streams are supported using the TOC (Table of Contents) abstraction. A TOC contains information about multiplexed stream (video or audio) in a MPEG system stream. Information in a TOC includes:

offsets and lengths of every packet for every stream
packet header of every packet for every stream
number of video stream and audio streams in the system stream.

A TOC can generate a BitStream filter (described above) to represent any stream in the sequence.

JPEG support

The design of the JPEG decoder is similar to that of the MPEG decoder. A JPEG stream consists of a picture header followed by one or more scans. Each scan contains a scan header followed by scan data. The scan data contains one or more encoded components. Functions are provided to find, decode, skip, and dump these structural elements.

WAVE support

Dalí provides functions to parse and encode header information. The contents of the WAV files can be read into a BitStream, and then cast into a AudioBuffer for processing.

GIF Support

PNM Support

AVI Support

Usage Note

Cornell University specifically reserves all commercial rights to the Dalí software, and these rights may be licensed by Cornell Research Foundation to third parties. Certain of these commercial rights have been licensed by Cornell Research Foundation to a third party for a specific application of Dalí.

Last Updated :