Tutorial

This is a tutorial and an online demonstration of PREDATOR. It leads you through
several examples of database queries you can run on a PREDATOR 1.0 server.
This tutorial should help you figure out the syntax for various commands and acts
as the user manual for the system. We expect to keep modifying this tutorial as
new functionality is introduced into the system. Details of design decisions and
implementation issues are in a different document.

The layout is as follows:

Overview

PREDATOR is a client-server database system. Most of the database functionality
exists at the server, which accepts SQL queries and returns the results to the client.
Several different kinds of clients have been implemented (including a Java applet
client, a Java application client, and clients build on ODBC and JDBC). All the
examples presented here work from all the clients, unless explicitly mentioned.

If you click on the queries in buttons, you can interactively run them against a
sample database maintained on a server at Cornell. The results of the interactive
version of the tutorial will differ slightly from the text of the tutorial, because all the
tables are pre-created and the data is pre-loaded. Update, create, delete and drop
statements are not provided with buttons.

In order to try these examples on your local installation, start a server and then use
one of the provided client programs to connect to the server. Type in any of the
ample SQL queries and commands. Most of the examples in this document are also
provided as part of the regression test scripts in the release. The regression tests
provide more exhaustive examples of various features.

In order to use PREDATOR, you need to take three steps:

Establish environment settings. Details are provided here. You should do this
before running the server or any clients.
Start the server. Details are provided here.
Start a client and run queries.

Client-specific functionality is described in the individual client tutorial pages listed
below. This tutorial will only describe the simplest client, the "Text Client".

TxtClient: The simple command-prompt client"
JClient: The Java command-prompt client"
JGUI: The Java applet client"
CGIClient: The Forms/CGI based client"

1.2: Text Client Program

The Text Client Program is a very simple command prompt SQL interface to the
PREDATOR server. You can start the Text Client Program in its own window by
typing

txtclient -s [Port Number]
         <-h [Server Host Name]>

The port number you supply should be the port number of a running PREDATOR
server. If no host name is provided, the server is presumed to be running on the
same host as the client. If the connection to the server is made, you will see
the "Q>" prompt at the client, while there will be some diagnostic messages at the
server window to indicate the connection establishment. Type "list rels;" at the
client. If there were the appropriate environment variables provided to the server,
you should see the names of the catalog files in that device. Multiple clients can
connect at the same time to the same server. They must all connect to the same
server port. All commands typed into the Text Client Program must be terminated
with a semi-colon ";". You can terminate the client connection by typing "_quit;".

TroubleShooting

Here are some possible causes of problems, and their symptoms:

Client unable to connect to server: has the server been started? Have you
provided the correct port number on the txtclient command line?
"list rels;" produces no answers : did the server start cleanly, and were the
environment variables correct?
both client and server appear to hang: did you enter a ";" to terminate your
command? If not, the server is waiting for the client, and the client is waiting for
you.

Section 2: SQL Data Definition

[ Listing Information | Creating Tables | Creating Indexes ]

Listing Information

The most useful initial command is "list rels" -- the sample run is shown below



Q> ;

_STABLES
_SATTRS
_SSTATS
_SASTATS
_SINDXS
_SVIEWS

This tells you that there are five tables present (all of which happen to be default
catalog tables created automatically by the system). If you want to see the schemas
and catalog information for all the tables, type ;. These two are very useful
commands, and we recommend that you try them at various stages through the
tutorial to understand the effects of various commands on the underlying tables.
Other useful commands are:

list views;          ----- lists all views defined in the system
list schemas;        ----- lists all schemas of tables
list env;            ----- lists the execution environment settings

Creating Tables

New tables can be created using the CREATE TABLE command.

CREATE TABLE Foo (field1 int, field2 double) KEY(field1);

This creates a table called Foo -- check using "list rels" to see that it was indeed
created. You can also check the schema of the new table using "list cats". The SQL
parser is case-insensitive with respect to keywords, but table names and field names
are case-sensitive. The key of this table is "field1". Physically, this means that a unique
index is built on this table on "field1". The current support for keys is very minimal ---
keys can be specified only on a single attribute, and only on integer, double or string
data types. A table can also be created without a key specified.

CREATE TABLE Bar (field1 int, field2 double);

The fields or attributes of a table can belong to any of the data types known to the
system. The following primitive data types are supported:

  int:        4 byte integer
  double:     8 byte double precision floating point
  char[X]:    fixed length (X) character string
  string:     variable length character string (upto 
		 XXX_TEXTSTRING_MAX:5000 chars)
  boolean:    boolean value

The type system is extensible and new data types can be added. The current
release also has several other data types, as described here.

CREATE TABLE Locations (field1 int, field2 polygon("1000"), field3 string);
list rels;

Here, we create a table called Locations, which has a field of a polygon data type.
Note that the polygon field has additional meta-information which specifies further
information about the polygons -- in this case, it tells the system to expect an average
of about 1000 points in each polygon.

Partial support for NULL values has been added; NULL values can be stored and
created, but the specification of certain fields as "NOT NULL" is not supported.
Further, indexes may not work correctly on NULL values.

Creating Indexes

An index can be created on any existing table, whether data has already been loaded
into it or not. PREDATOR supports extensible indexing --- new kinds of indexes can
be added to the system. Therefore, when creating an index, the nature of the index
needs to be specified too. Currently, two types of indexes -- B+ tree ("B") and R*
tree ("R") --- are supported. If the index type is not specified, it is assumed to be a
B+ tree.

CREATE UNIQUE B INDEX Loc_fld1 ON Locations (Locations.field1);

CREATE B INDEX Loc_fld1plus ON Locations (Locations.field1 + 100);

CREATE R INDEX Loc_fld2 ON Locations (Locations.field2);

CREATE B INDEX Loc_fld3 ON Locations (Locations.field3);

;

B+ tree indexes can be created on integer, string, and double fields. They can also
be built on any expression that returns one of these types. Note that the fields must
be prefixed by the table name in the indexing expression. R* trees can be built on any
data type that provides a bounding box. This is defined in the ADT method interface,
and we will not go into the details here. Many of the geographic data types like point,
polygon and box support this capability. All indexes can be defined to be "unique" ---
this means that duplicate index key values will cause an insertion error.

Section 3: SQL Data Manipulation

[ Loading Data | Inserting Data | Updating Data | Deleting Data ]

Loading Data

There are two ways to add data to a table: by loading it from a file, or by inserting it
one tuple at a time. Loading from a file is the preferred route, especially when there
is a large amount of data to load. Since we are yet to implement bulk-loading of
indexes, the tuples are still inserted into the indexes one at a time.

LOAD TABLE Foo FROM "data/foo.data";

The filename "foo.data" must represent a relative pathname from the directory where
the server is running, or an absolute path name on the server machine. The next
release will allow a URL to be specified instead. The format of "foo.data" is a simple
tab-separated file with each tuple on a different line.

Inserting Data

INSERT INTO Bar VALUES (1, 2.2);

The use of "insert into" allows specific values to be added to a table. The values can
be constructed using any arbitrary expressions that produce the correct types. To
further illustrate this, consider the following:

INSERT INTO Bar VALUES (1+2, 3.5*7.2);

INSERT INTO Bar VALUES ("wrong", "types");

The first statement succeeds, while the second gives an error.

INSERT INTO Bar SELECT * FROM Foo;

The last example shows that the results of a query can be treated as input for
insertion. "Select * from Foo" is the simplest kind of SQL query --- more complex
ones are in the later sections. However, it is useful to be able to look at the data you
just inserted. Try:

; 

;

Updating Data

The values of columns of tuples in a table can be updated using the "update"
statement.

UPDATE Bar SET field2 = field2*2 WHERE field1 = 1;

SELECT * FROM Bar;

Deleting and Dropping Data

You can delete data by deleting individual tuples or by totally removing a table.

DELETE FROM Bar WHERE Bar.field1 = 1;
SELECT * FROM Bar;

This deletes all tuples in Bar that have 1 as the first field. If you wish to remove the
entire table:

DROP TABLE Bar;
list rels;

Of course, this is a persistent database system, and this means that data persists
across multiple invocations of the system. Try quitting the client ("_quit;") and starting
it again. "list rels" should show you the same tables. Try quitting both the
client("_quit;") and the server("shutdown"), and then restarting both. The state of the
database should be unchanged.

Section 4: SQL Queries

[ Projection | Selection | Joins and Cross-Products | Aggregation and Grouping | Ordering Results | Subqueries | Views | Table Expressions ]

We will use the two tables Foo and Bar created in the previous section.
Since we dropped Bar at the end of the section, let us recreate it as follows:

CREATE TABLE Bar (field1 int, field2 double);

INSERT INTO Bar VALUES (1, 2.2);

INSERT INTO Bar SELECT * FROM Foo;

INSERT INTO Bar VALUES (1+2, 3.5*7.2);

;

Ensure using list rels that both Foo and Bar exist.

Projection

Projections are specified in the SELECT clause target list. We have already seen
a simple version (select *) of a projection --- this projects all the attributes of the
tables in the FROM clause. Here are some more sample projections: try them out:

;

;

; 

; 

#note the type cast from int to double
; 

; 

; 

;

"B.#0" refers to the 0'th field of B. Note that in almost all cases, we have referred to
attributes using the . syntax. We recommend sticking to this syntax wherever
possible. Although the very last query works without this syntax, we do not guarantee
that other more complex queries will work if the table name is omitted. We do not yet
support nested subqueries in the SELECT clause.

Selection

The WHERE clause is used to specify selection predicates. Here are some
examples. If you like, combine them with the projection examples:

; 

; 

; 

; 

; 

;

The predicates in the WHERE clause can get arbitrarily complicated, with nested
AND, OR and NOT. We also support nested subqueries in the WHERE clause, but
more about those later.

Join and Cross-Product

The FROM clause can specify multiple tables which will be joined using predicates
specified in the WHERE clause.

; 

; 

; 

;

The last query is an example of a cross-product; there is no predicate in the WHERE
clause that connects the two tables.

Aggregation and Grouping

Entire tables (or tables resulting from queries) may be aggregated to produce
composite answers.

; 

; 

;

The aggregates supported are sum(), avg(), prod(), max(), min() and count(). Note
that the count(*) syntax of standard SQL is not supported. Also, "distinct" aggregates
are not yet supported.

Groups of tuples may also be aggregated using a GROUP BY clause. All tuples that
have the same value of the expression in the GROUP BY clause form a single group
which is aggregated over, producing a single answer tuple. Duplicate elimination is a
special case of grouped aggregation that has a special syntax: all identical tuples
form a group, and only one of them is retained.

; 

; 

; 

; 

;

SQL places certain restrictions on the contents of the GROUP BY, HAVING and
SELECT clauses. Any non-aggregates in the SELECT clause must be named in the
GROUP BY clause. Likewise for any non-aggregates in the HAVING clause. Our
implementation does not allow aggregates in the HAVING clause.

Ordering Results

The results of any query can be ordered for presentation to the user by adding an
ORDER BY clause. The results can be ordered by an attribute that is projected out
of the final result! The results can also be ordered by any complex expression that
generates an ordered data type.

; 

;

; 

; 

; 

;

SubQueries

PREDATOR supports nested subqueries in the WHERE clause of an SQL query.
The subqueries may be correlated or uncorrelated with the enclosing blocks. We
begin with uncorrelated subqueries. Try the examples below and understand their
meaning.

## Example of IN subquery
;

## Example of ANY subquery
;

## Example of ALL subquery
;

## Example of scalar subquery (subquery cardinality must be 1)
## Not handled yet.
SELECT * FROM Foo F WHERE F.field1 = 
(SELECT max(B.field1) FROM Bar B);

Now for the correlated subqueries --- these subqueries can reference "correlation"
columns from the enclosing query block. The best way to think of this is to mentally
model each subquery invocation as a function call. The result of the function call is
the table produced by computing the subquery. In the uncorrelated examples so far,
the function takes no parameters (in other words, the subquery is constant across
multiple invocations). In fact, many commercial systems (but not PREDATOR yet)
will cache the results of such an uncorrelated subquery so that it does not need to
be recomputed. Correlated subqueries, on the other hand, take one or more
parameters. Try the examples below.

## Example of EXISTS subquery (F.field1 is the correlation value)
;

## Example of NOT EXISTS subquery 
;

## Example of UNIQUE subquery (true if subquery has exactly 1 tuple)
;

Finally, for those of you who have endured an introductory database course, the
famous "division" query. Assume there is a table FooBar that relates Foo tuples with
Bar tuples. We want to find those Foo tuples that are related to all Bar tuples that
satisfy certain conditions. Some real-world analogies

Find employees who have worked in all departments with budgets > 500,000.
Find those beanie baby toys that are carried by all the large toy stores in the
Ithaca area.

CREATE TABLE FooBar (FooField int, BarField int);

INSERT INTO FooBar VALUES (0, 0);
INSERT INTO FooBar VALUES (0, 2);
INSERT INTO FooBar VALUES (0, 3);

;

Views

So far, all the queries have involved a single query block. This makes it tough to
express complex queries. One option is to use views. A view is just a name for the
result table defined by a query -- except that it is just defined, not executed. Once
you define a view, that view may be used to construct another query, just as if it is a
stored table. Use the diagnostic tool described later to see how views are actually
processed. Here is a simple example.

CREATE VIEW PartialResult AS
 (SELECT F.field1 as attr1, max(F.field2) as attr2
  FROM Foo F, Bar B
  WHERE F.field2 > B.field2
  GROUP BY F.field1);

;

You can construct arbitrarily complicated queries using views.

Table Expressions

Table expressions are supposed to be "transient views" -- in other words, if you do
not want to have persistent views, but just want the convenience of structuring your
query into understandable blocks, table expressions are the mechanism of choice.

Section 5: Enhanced ADTs

[ Table and Index Creation | Load and Insert | Using Methods | ADT Aggregates | Ordering | Standalone Functions ]

Methods of ADTs

Every ADT can define methods which can then be used in SQL queries. The type
system is extensible, of course, allowing new data types to be added. Currently, there
are quite a few data type distributed with the system --- look here for a brief
description of them. In this and the next section, we will show you how to use ADTs
using a couple of canonical examples. One is of the image ADT. The other is of the
complex number ADT (one of the simplest possible ADTs). Issues to consider
include:

How does one create a table with a field of type ADT?
Can one build an index on an ADT field?
How does one "load" ADT data?
How does one insert ADT values?
How does one use ADT methods?
Can ADTs define their own aggregates?
Can ADTs specify data ordering?
Is there support for functions that are not attached to ADTs?

In this section, we address these questions.

Table and Index Creation

Table creation is straightforward -- in fact, in Section 1, we had an example of a
creating a table with a polygon field. Here, we use image and complex number as
examples.

CREATE TABLE ImageTest (id string, picture image("format jpeg")) key(id);

CREATE TABLE ComplexTest (id int, value complex) key(id);

CREATE B INDEX ComplRealIndx on ComplexTest (ComplexTest.value.real());

The key of a relation must be a field on which a unique B+ tree index can be built.
Currently, this is restricted to a few data types (int, double and string). Consequently,
an image or a complex number cannot be the key of a relation. We hope to ease this
restriction in the next release. As the example shows, an index can be created on the
result of the real() method of the complex number ADT. This method returns an
integer, which is the real component of the complex number. Indexes can be built on
any complex expression that returns a suitable type.

Load and Insert

Loading data to one of these tables is no different from the earlier examples.
However, the only question is: how are the ADT values represented? The ADT
developer usually makes a choice of whether the ADT objects are going to be
"large" or not. if they are "large" (as in the case of the image ADT), it doesn't make
sense to place the actual image within the load file. Instead, each image is placed
in its own data file, and the load file only contains the file name. Similarly, in order to
insert a single tuple, a "constant" image value can be specified by specifying a type
castof a file name. This file name could instead be a URL and it would be
downloaded into the database.

LOAD TABLE ImageTest from "data/renoir.data";

INSERT INTO ImageTest VALUES ("girl", [image("format jpeg")]"data/image_data/watercan.jpg");

;

Complex number values, on the other hand, are pretty small, and it is silly to put each
complex number into its own file. Instead, they are specified inline in the data file. It is
the responsibility of the ADT developer to decide how the values are going to be
represented.

LOAD TABLE ComplexTest from "data/complex.data";

INSERT INTO ComplexTest VALUES (10, [complex]"10.0+20.1i");

;

Using Methods

We have already seen the use of an ADT method in creating an index. The use of
methods in queries is similar. Wherever an expression, column value, or constant
occurs, a method expression that produces a valid data type may appear. Here are
some examples:


; 

; 

;

ADT Aggregates

ADTs can define their own aggregates. For instance, the complex number ADT
defines "csum" which computes the sum of a group of complex numbers. These can
be used like the other aggregates. The second example below shows the application
of functions to the result of an aggregation.

; 

# this computes the imag() method on the sum of the complex numbers
;

Ordering

PREDATOR uses sorting in order to compute GROUP BY clauses. Obviously,
sorting is needed for ORDER BY clauses. The data types of grouping or ordering
fields can belong to any ADT that is an "ordered ADT". The ADT developer specifies
this property. If the ADT is ordered, internal methods are provided to compare two
values. This method is used for sorting. Consequently, you can specify an ADT field
in an ORDER BY or GROUP BY clause, as long as it is of an ordered ADT. Neither
image nor complex fall into this category (though one could provide sort orders for
both, our implementation of the ADTs has not). However, try to construct your own
example with the date ADT and observe the sorting on dates.

Standalone Functions

Apart from ADT methods, there is also support for standalone functions. While the
mechanism is in place, we have only provided a couple of such functions as a proof
of concept, since our focus has been on ADTs. One function, intplus() adds two
integers, producing their sum. Another, matches , checks if the first string argument
is a substring of the second string argument. If so, it returns 1, else 0.

; 

;

Section 6: Transactions

[ Single Statement Xacts | Multiple Statement Xacts ]

Single Statement Xactions

By default, every "request" sent from the client to the server is treated as an ACID
transaction (i.e it either commits on success or aborts on failure). Failure could occur
due to errors or lock timeouts. If you are using the txtclient, a "request" is usually a
single command terminated by a semi-colon. From the Java GUI, it is possible to
send multiple commands as a single "request" -- however, if you want to do this, try
using explicit multi-statement xactions instead (see next section). Here are some
examples of single-statement transactions:

INSERT INTO Foo VALUES (10, 5.5);

#Hit a CTRL-C, kill the server, and restart it
#The inserted tuple will be in Foo
SELECT * FROM Foo;

#Now look for abort/rollback.
# Insertion of duplicate key is an error that causes
# a rollback
INSERT INTO Foo VALUES (10, 10.5);

SELECT * FROM Foo;

Multiple Statement Xactions

If you want to bracket several statements as a single transaction, you can do so.

#Begin a transaction
trans begin;

DELETE FROM Foo WHERE Foo.field1 = 10;

SELECT * FROM Foo;

#Abort the transaction
trans abort;

# Voila, the deleted tuple is back in place
SELECT * FROM Foo;

If there is an error in the middle of an multi-statement xaction, the transaction is
aborted. Any subsequent request is then run as a separate xaction.

Section 7: Other Useful Commands

[ Setting Preferences | Sundry Commands ]

Its often convenient to set flags to specify different kinds of behavior in the system.
This is especially useful in measuring the effects of various features. The preferences
for the relational engine can be viewed and set as follows:

Setting Preferences

#list the current settings
; 

#many of these are for internal use and debugging.

# Turn off display of results
setenv show_query_results 0;


# Turn on display of results
setenv show_query_results 1;

When running performance experiments, it is often very useful to set
"show_query_results" to 0. If you are extending the system you may want to add to
these flags. Look at the design document for details of how to do so.

Each E-ADT can provide its own set of options to control its behavior. Currently, there
are the following default options for large E-ADTs.

# show the options for the "image" E-ADT
;

# now prevent images from being displayed
set_adtenv image show_output 0;

One of the issues with large E-ADTs like images is: how will they be displayed when
a query result involves them. The solution to this is for the server to treat the large
E-ADT value as a MIME-type value and refer to it as a URL. As you see if you run the
query interactively through this page, the hyperlinks to images appear as long URLs
which have the form of hexadecimal strings. This is a simple encoding of the binary
image handle as an ASCII URL. The PREDATOR server also acts as an HTTP
server purely for the purpose of serving these large objects.

Sundry Commands

#collect statistics on a specific table Foo
runstats Foo;

#collect statistics on all tables
runstats;

The runstats command collects data distribution statistics which are used in query
optimization. As new data is inserted or existing data is deleted, the statistics become
inaccurate. Consequently, one should periodically execute runstats for good query
optimization.

Section 8: Monitor Commands

The window in which the server was started is called the "monitor" window --- here,
there is a separate thread reading input requests and processing them. Actually, the
language supported here is called "XXX" and it is processed a different engine from
the relational engine. Here are the commands that can be run from the monitor
window:

        adts       : list available ADTs
        clients    : list current clients
        engines    : list available engines
        help       : shows this list
        shutdown   : terminate server
        sys_stats  : print server statistics
        sys_disk   : show disk volume stats
        sys_shore  : show SHORE stats
        sys_reset  : reset all stats

The most useful of these is, of course, "shutdown", which is the way to shut down the
server (Ctrl-D works just as well). Some of the commands to print statistics are useful
for system developers trying to understand where disk space is being used, etc.
However, the only reasonable statistics currently gathered are by SHORE -- we hope
to put in statistics for the rest of PREDATOR in the next release.

Section 9: Studying Query Processing

We have a simple tool, QPDisp, that displays the effects of query optimization and
processing. In essence, it is a tree drawing package written in Java. When any query
is executed at the server, a log of query optimization and processing information is
dumped out to a diagnostic file (in a file query_proc_info.clientXX.txt in your XXXTMPDATADIR directory). Any such information is in the form of query trees.
Currently, three categories of trees are monitored, corresponding to three different
phases of query optimization. More categories will be added in the future. I really
encourage you to try this tool -- it is almost like opening up the innards of the system.

Query Rewrite : these correspond to heuristic transformations across SQL query
blocks (like merging adjacent blocks). Here, the tree nodes are SQL query
blocks.
E-ADT Rewrite : the E-ADT transformations that modify method expressions into
equivalent and possibly cheaper ones. The tree nodes are expressions.
Query Planning : Our relational query optimizer is not transformational, and so
we do not have trees showing the process of query optimization. However, the
final query plan is a tree of plan operators, and it is recorded.

QPDisp is a Java applet. You should set up a web server to run on the same machine
as the database server (there are a number of free database servers --- choose any
one that supports Java applets). We have provided a template HTML page qpdisp.html that embeds the QPDisp applet. You will need to specify the name of
the server dump file that will be visualized.

For example, consider the query:

Here is an example of the QPDisp applet when applied to the dump file generated
when executing this query. When the applet opens up, click on the OPEN SESAME
button. This brings up a window with a View menu. In this menu, focus on the
transformation of the image method expression in the SELECT clause. You can
walk step-by-step through the E-ADT transformations that occur in optimizing this
expression.

This tool is at an early stage of development. We plan to enhance it to visualize the
query optimization process and query execution as well. Currently, it behaves well with
Netscape 4.0, but oddly with IE 4.0 (we'll try to find out why and fix it).