Tutorial Introduction to PREDATOR

This is a tutorial and an online demonstration of PREDATOR. It leads you through several examples of database queries you can run on a PREDATOR 1.0 server. This tutorial should help you figure out the syntax for various commands and acts as the user manual for the system. We expect to keep modifying this tutorial as new functionality is introduced into the system. Details of design decisions and implementation issues are in a different document.

The layout is as follows:

  1. Overview
  2. SQL data definition.
  3. SQL data manipulation
  4. SQL queries
  5. Enhanced ADTs
  6. Transactions
  7. Other Useful Commands
  8. Monitor commands
  9. Studying query processing

Overview

PREDATOR is a client-server database system. Most of the database functionality exists at the server, which accepts SQL queries and returns the results to the client. Several different kinds of clients have been implemented (including a Java applet client, a Java application client, and clients build on ODBC and JDBC). All the examples presented here work from all the clients, unless explicitly mentioned.

If you click on the queries in buttons, you can interactively run them against a sample database maintained on a server at Cornell. The results of the interactive version of the tutorial will differ slightly from the text of the tutorial, because all the tables are pre-created and the data is pre-loaded. Update, create, delete and drop statements are not provided with buttons.

In order to try these examples on your local installation, start a server and then use one of the provided client programs to connect to the server. Type in any of the sample SQL queries and commands. Most of the examples in this document are also provided as part of the regression test scripts in the release. The regression tests provide more exhaustive examples of various features.


Section 2: SQL Data Definition

[ Listing Information | Creating Tables | Creating Indexes ]

Listing Information

The most useful initial command is "list rels" -- the sample run is shown below



Q> ;

_STABLES
_SATTRS
_SSTATS
_SASTATS
_SINDXS
_SVIEWS

This tells you that there are five tables present (all of which happen to be default catalog tables created automatically by the system). If you want to see the schemas and catalog information for all the tables, type ;. These two are very useful commands, and we recommend that you try them at various stages through the tutorial to understand the effects of various commands on the underlying tables. Other useful commands are:


list views;          ----- lists all views defined in the system
list schemas;        ----- lists all schemas of tables
list env;            ----- lists the execution environment settings

Creating Tables

New tables can be created using the CREATE TABLE command.


CREATE TABLE Foo (field1 int, field2 double) KEY(field1);

This creates a table called Foo -- check using "list rels" to see that it was indeed created. You can also check the schema of the new table using "list cats". The SQL parser is case-insensitive with respect to keywords, but table names and field names are case-sensitive. The key of this table is "field1". Physically, this means that a unique index is built on this table on "field1". The current support for keys is very minimal --- keys can be specified only on a single attribute, and only on integer, double or string data types. A table can also be created without a key specified.


CREATE TABLE Bar (field1 int, field2 double);

The fields or attributes of a table can belong to any of the data types known to the system. The following primitive data types are supported:

  int:        4 byte integer
  double:     8 byte double precision floating point
  char[X]:    fixed length (X) character string
  string:     variable length character string (upto XXX_TEXTSTRING_MAX:5000 chars)
  boolean:    boolean value

The type system is extensible and new data types can be added. The current release also has several other data types, as described here.

CREATE TABLE Locations (field1 int, field2 polygon("1000"), field3 string);
list rels;

Here, we create a table called Locations, which has a field of a polygon data type. Note that the polygon field has additional meta-information which specifies further information about the polygons -- in this case, it tells the system to expect an average of about 1000 points in each polygon.

Partial support for NULL values has been added; NULL values can be stored and created, but the specification of certain fields as "NOT NULL" is not supported. Further, indexes may not work correctly on NULL values.

Creating Indexes

An index can be created on any existing table, whether data has already been loaded into it or not. PREDATOR supports extensible indexing --- new kinds of indexes can be added to the system. Therefore, when creating an index, the nature of the index needs to be specified too. Currently, two types of indexes -- B+ tree ("B") and R* tree ("R") --- are supported. If the index type is not specified, it is assumed to be a B+ tree.


CREATE UNIQUE B INDEX Loc_fld1 ON Locations (Locations.field1);

CREATE B INDEX Loc_fld1plus ON Locations (Locations.field1 + 100);

CREATE R INDEX Loc_fld2 ON Locations (Locations.field2);

CREATE B INDEX Loc_fld3 ON Locations (Locations.field3);

;

B+ tree indexes can be created on integer, string, and double fields. They can also be built on any expression that returns one of these types. Note that the fields must be prefixed by the table name in the indexing expression. R* trees can be built on any data type that provides a bounding box. This is defined in the ADT method interface, and we will not go into the details here. Many of the geographic data types like point, polygon and box support this capability. All indexes can be defined to be "unique" --- this means that duplicate index key values will cause an insertion error.


Section 3: SQL Data Manipulation

[ Loading Data | Inserting Data | Updating Data | Deleting Data ]

Loading Data

There are two ways to add data to a table: by loading it from a file, or by inserting it one tuple at a time. Loading from a file is the preferred route, especially when there is a large amount of data to load. Since we are yet to implement bulk-loading of indexes, the tuples are still inserted into the indexes one at a time.


LOAD TABLE Foo FROM "data/foo.data";

The filename "foo.data" must represent a relative pathname from the directory where the server is running, or an absolute path name on the server machine. The next release will allow a URL to be specified instead. The format of "foo.data" is a simple tab-separated file with each tuple on a different line.

Inserting Data

INSERT INTO Bar VALUES (1, 2.2);

The use of "insert into" allows specific values to be added to a table. The values can be constructed using any arbitrary expressions that produce the correct types. To further illustrate this, consider the following:


INSERT INTO Bar VALUES (1+2, 3.5*7.2);

INSERT INTO Bar VALUES ("wrong", "types");

The first statement succeeds, while the second gives an error.


INSERT INTO Bar SELECT * FROM Foo;

The last example shows that the results of a query can be treated as input for insertion. "Select * from Foo" is the simplest kind of SQL query --- more complex ones are in the later sections. However, it is useful to be able to look at the data you just inserted. Try:

; 

; 

Updating Data

The values of columns of tuples in a table can be updated using the "update" statement.


UPDATE Bar SET field2 = field2*2 WHERE field1 = 1;

SELECT * FROM Bar;

Deleting and Dropping Data

You can delete data by deleting individual tuples or by totally removing a table.


DELETE FROM Bar WHERE Bar.field1 = 1;
SELECT * FROM Bar;

This deletes all tuples in Bar that have 1 as the first field. If you wish to remove the entire table:


DROP TABLE Bar;
list rels;

Of course, this is a persistent database system, and this means that data persists across multiple invocations of the system. Try quitting the client ("_quit;") and starting it again. "list rels" should show you the same tables. Try quitting both the client("_quit;") and the server("shutdown"), and then restarting both. The state of the database should be unchanged.


Section 4: SQL Queries

[ Projection | Selection | Joins and Cross-Products | Aggregation and Grouping | Ordering Results | Subqueries | Views | Table Expressions ]

We will use the two tables Foo and Bar created in the previous section. Since we dropped Bar at the end of the section, let us recreate it as follows:


CREATE TABLE Bar (field1 int, field2 double);

INSERT INTO Bar VALUES (1, 2.2);

INSERT INTO Bar SELECT * FROM Foo;

INSERT INTO Bar VALUES (1+2, 3.5*7.2);

;

Ensure using list rels that both Foo and Bar exist.

Projection

Projections are specified in the SELECT clause target list. We have already seen a simple version (select *) of a projection --- this projects all the attributes of the tables in the FROM clause. Here are some more sample projections: try them out:


;

;

; 

; 

#note the type cast from int to double
; 

; 

; 

; 

"B.#0" refers to the 0'th field of B. Note that in almost all cases, we have referred to attributes using the . syntax. We recommend sticking to this syntax wherever possible. Although the very last query works without this syntax, we do not guarantee that other more complex queries will work if the table name is omitted. We do not yet support nested subqueries in the SELECT clause.

Selection

The WHERE clause is used to specify selection predicates. Here are some examples. If you like, combine them with the projection examples:

; 

; 

; 

; 

; 

; 

The predicates in the WHERE clause can get arbitrarily complicated, with nested AND, OR and NOT. We also support nested subqueries in the WHERE clause, but more about those later.

Join and Cross-Product

The FROM clause can specify multiple tables which will be joined using predicates specified in the WHERE clause.

; 

; 

; 

; 

The last query is an example of a cross-product; there is no predicate in the WHERE clause that connects the two tables.

Aggregation and Grouping

Entire tables (or tables resulting from queries) may be aggregated to produce composite answers.

; 

; 

; 

The aggregates supported are sum(), avg(), prod(), max(), min() and count(). Note that the count(*) syntax of standard SQL is not supported. Also, "distinct" aggregates are not yet supported.

Groups of tuples may also be aggregated using a GROUP BY clause. All tuples that have the same value of the expression in the GROUP BY clause form a single group which is aggregated over, producing a single answer tuple. Duplicate elimination is a special case of grouped aggregation that has a special syntax: all identical tuples form a group, and only one of them is retained.

; 

; 

; 

; 

; 

SQL places certain restrictions on the contents of the GROUP BY, HAVING and SELECT clauses. Any non-aggregates in the SELECT clause must be named in the GROUP BY clause. Likewise for any non-aggregates in the HAVING clause. Our implementation does not allow aggregates in the HAVING clause.

Ordering Results

The results of any query can be ordered for presentation to the user by adding an ORDER BY clause. The results can be ordered by an attribute that is projected out of the final result! The results can also be ordered by any complex expression that generates an ordered data type.

; 

;

; 

; 

; 

; 

Views

So far, all the queries have involved a single query block. This makes it tough to express complex queries. One option is to use views. A view is just a name for the result table defined by a query -- except that it is just defined, not executed. Once you define a view, that view may be used to construct another query, just as if it is a stored table. Use the diagnostic tool described later to see how views are actually processed. Here is a simple example.

CREATE VIEW PartialResult AS
 (SELECT F.field1 as attr1, max(F.field2) as attr2
  FROM Foo F, Bar B
  WHERE F.field2 > B.field2
  GROUP BY F.field1);

; 

You can construct arbitrarily complicated queries using views.

Table Expressions

Table expressions are supposed to be "transient views" -- in other words, if you do not want to have persistent views, but just want the convenience of structuring your query into understandable blocks, table expressions are the mechanism of choice.

; 

Section 5: Enhanced ADTs

[ Table and Index Creation | Load and Insert | Using Methods | ADT Aggregates | Ordering | Standalone Functions ]

Methods of ADTs

Every ADT can define methods which can then be used in SQL queries. The type system is extensible, of course, allowing new data types to be added. Currently, there are quite a few data type distributed with the system --- look here for a brief description of them. In this and the next section, we will show you how to use ADTs using a couple of canonical examples. One is of the image ADT. The other is of the complex number ADT (one of the simplest possible ADTs). Issues to consider include:

In this section, we address these questions.

Table and Index Creation

Table creation is straightforward -- in fact, in Section 1, we had an example of a creating a table with a polygon field. Here, we use complex number as examples.

CREATE TABLE ComplexTest (id int, value complex) key(id);

CREATE B INDEX ComplRealIndx on ComplexTest (ComplexTest.value.real());

The key of a relation must be a field on which a unique B+ tree index can be built. Currently, this is restricted to a few data types (int, double and string). Consequently,  a complex number cannot be the key of a relation. We hope to ease this restriction in the next release. As the example shows, an index can be created on the result of the real() method of the complex number ADT. This method returns an integer, which is the real component of the complex number. Indexes can be built on any complex expression that returns a suitable type.

Load and Insert

Loading data to one of these tables is no different from the earlier examples. However, the only question is: how are the ADT values represented? The ADT developer usually makes a choice of whether the ADT objects are going to be "large" or not. if they are "large" (as in the case of the image ADT), it doesn't make sense to place the actual image within the load file. Instead, each image is placed in its own data file, and the load file only contains the file name. Similarly, in order to insert a single tuple, a "constant" image value can be specified by specifying a type castof a file name. This file name could instead be a URL and it would be downloaded into the database.

Complex number values, on the other hand, are pretty small, and it is silly to put each complex number into its own file. Instead, they are specified inline in the data file. It is the responsibility of the ADT developer to decide how the values are going to be represented.


LOAD TABLE ComplexTest from "data/complex.data";

INSERT INTO ComplexTest VALUES (10, [complex]"10.0+20.1i");

; 

Using Methods

We have already seen the use of an ADT method in creating an index. The use of methods in queries is similar. Wherever an expression, column value, or constant occurs, a method expression that produces a valid data type may appear. Here are some examples:

; 

; 

ADT Aggregates

ADTs can define their own aggregates. For instance, the complex number ADT defines "csum" which computes the sum of a group of complex numbers. These can be used like the other aggregates. The second example below shows the application of functions to the result of an aggregation.

; 

# this computes the imag() method on the sum of the complex numbers
; 

Ordering

PREDATOR uses sorting in order to compute GROUP BY clauses. Obviously, sorting is needed for ORDER BY clauses. The data types of grouping or ordering fields can belong to any ADT that is an "ordered ADT". The ADT developer specifies this property. If the ADT is ordered, internal methods are provided to compare two values. This method is used for sorting. Consequently, you can specify an ADT field in an ORDER BY or GROUP BY clause, as long as it is of an ordered ADT. Neither image nor complex fall into this category (though one could provide sort orders for both, our implementation of the ADTs has not). However, try to construct your own example with the date ADT and observe the sorting on dates.

Standalone Functions

Apart from ADT methods, there is also support for standalone functions. While the mechanism is in place, we have only provided a couple of such functions as a proof of concept, since our focus has been on ADTs. One function, intplus() adds two integers, producing their sum. Another, matches , checks if the first string argument is a substring of the second string argument. If so, it returns 1, else 0.

; 


Section 6: Transactions

[ Single Statement Xacts | Multiple Statement Xacts ]

Single Statement Xactions

By default, every "request" sent from the client to the server is treated as an ACID transaction (i.e it either commits on success or aborts on failure). Failure could occur due to errors or lock timeouts. If you are using the txtclient, a "request" is usually a single command terminated by a semi-colon. From the Java GUI, it is possible to send multiple commands as a single "request" -- however, if you want to do this, try using explicit multi-statement xactions instead (see next section). Here are some examples of single-statement transactions:


INSERT INTO Foo VALUES (10, 5.5);

#Hit a CTRL-C, kill the server, and restart it
#The inserted tuple will be in Foo
SELECT * FROM Foo;

#Now look for abort/rollback.
# Insertion of duplicate key is an error that causes
# a rollback
INSERT INTO Foo VALUES (10, 10.5);

SELECT * FROM Foo;

Multiple Statement Xactions

If you want to bracket several statements as a single transaction, you can do so.


#Begin a transaction
trans begin;

DELETE FROM Foo WHERE Foo.field1 = 10;

SELECT * FROM Foo;

#Abort the transaction
trans abort;

# Voila, the deleted tuple is back in place
SELECT * FROM Foo;

If there is an error in the middle of an multi-statement xaction, the transaction is aborted. Any subsequent request is then run as a separate xaction.


Section 7: Other Useful Commands

[ Setting Preferences | Sundry Commands ]

Its often convenient to set flags to specify different kinds of behavior in the system. This is especially useful in measuring the effects of various features. The preferences for the relational engine can be viewed and set as follows:

Setting Preferences


#list the current settings
; 

#many of these are for internal use and debugging.

# Turn off display of results
setenv show_query_results 0;


# Turn on display of results
setenv show_query_results 1;

When running performance experiments, it is often very useful to set "show_query_results" to 0. If you are extending the system you may want to add to these flags. Look at the design document for details of how to do so.

Each E-ADT can provide its own set of options to control its behavior. 

Sundry Commands


#collect statistics on a specific table Foo
runstats Foo;

#collect statistics on all tables
runstats;

The runstats command collects data distribution statistics which are used in query optimization. As new data is inserted or existing data is deleted, the statistics become inaccurate. Consequently, one should periodically execute runstats for good query optimization.


Section 8: Monitor Commands

The window in which the server was started is called the "monitor" window --- here, there is a separate thread reading input requests and processing them. Actually, the language supported here is called "XXX" and it is processed a different engine from the relational engine. Here are the commands that can be run from the monitor window:


        adts       : list available ADTs
        clients    : list current clients
        engines    : list available engines
        help       : shows this list
        shutdown   : terminate server
        sys_stats  : print server statistics
        sys_disk   : show disk volume stats
        sys_shore  : show SHORE stats
        sys_reset  : reset all stats

The most useful of these is, of course, "shutdown", which is the way to shut down the server (Ctrl-D works just as well). Some of the commands to print statistics are useful for system developers trying to understand where disk space is being used, etc. However, the only reasonable statistics currently gathered are by SHORE -- we hope to put in statistics for the rest of PREDATOR in the next release.


Section 9: Studying Query Processing

We have a simple tool, QPDisp, that displays the effects of query optimization and processing. In essence, it is a tree drawing package written in Java. When any query is executed at the server, a log of query optimization and processing information is dumped out to a diagnostic file (in a file query_proc_info.clientXX.txt in your XXXTMPDATADIR directory). Any such information is in the form of query trees. Currently, three categories of trees are monitored, corresponding to three different phases of query optimization. More categories will be added in the future. I really encourage you to try this tool -- it is almost like opening up the innards of the system.

QPDisp is a Java application. The Readme file in %InstallDir%\predator\java\qpdisp describes the QPDisp tool, how to install it and how to use it. 

For example, consider the query:

;