This is a tutorial and an online demonstration of PREDATOR.
It leads you through
several examples of database queries you can run on a PREDATOR
1.0 server.
This tutorial should help you figure out the syntax for various commands
and acts
as the user manual for the system. We expect to keep modifying this
tutorial as
new functionality is introduced into the system. Details of design
decisions and
implementation issues are in a different document.
The layout is as follows:
PREDATOR
is a client-server database system. Most of the database functionalityIf you click on the queries in buttons, you can interactively run them
against a
sample database maintained on a server at Cornell. The results of the
interactive
version of the tutorial will differ slightly from the text of the
tutorial, because all the
tables are pre-created and the data is pre-loaded.
Update, create, delete and drop
statements are not provided with buttons.
In order to try these examples on your local installation, start a server and
then use
one of the provided client programs to connect to the server. Type in
any of the
ample SQL queries and commands. Most of the examples in this
document are also
provided as part of the regression test scripts in the
release. The regression tests
provide more exhaustive examples of various
features.
In order to use PREDATOR, you need to take three steps:
Client-specific functionality is described in the individual client tutorial
pages listed
below. This tutorial will only describe the simplest client, the
"Text Client".
The Text Client Program is a very simple command prompt SQL interface to the
PREDATOR
server. You can start the Text Client Program in its own window by
typing
txtclient -s [Port Number]
<-h [Server Host Name]>
The port number you supply should be the port number of a running PREDATOR
server. If no host name is provided, the server is presumed to be running on the
same host as the client. If the connection to the server is made, you will see
the "Q>" prompt at the client, while there will be some diagnostic
messages at the
server window to indicate the connection establishment. Type
"list rels;" at the
client. If there were the appropriate environment
variables provided to the server,
you should see the names of the catalog files
in that device. Multiple clients can
connect at the same time to the same
server. They must all connect to the same
server port. All commands typed into
the Text Client Program must be terminated
with a semi-colon ";". You
can terminate the client connection by typing "_quit;".
Here are some possible causes of problems, and their symptoms:
The most useful initial command is "list rels" -- the sample run is shown below
Q> ; _STABLES _SATTRS _SSTATS _SASTATS _SINDXS _SVIEWS
This tells you that there are five tables present (all of which happen to be
default
catalog tables created automatically by the system). If you want to see
the schemas
and catalog information for all the tables, type ;.
These two are very useful
commands, and we recommend that you try them at
various stages through the
tutorial to understand the effects of various
commands on the underlying tables.
Other useful commands are:
list views; ----- lists all views defined in the system list schemas; ----- lists all schemas of tables list env; ----- lists the execution environment settings
New tables can be created using the CREATE TABLE command.
CREATE TABLE Foo (field1 int, field2 double) KEY(field1);
This creates a table called Foo -- check using "list rels" to see
that it was indeed
created. You can also check the schema of the new table using
"list cats". The SQL
parser is case-insensitive with respect to
keywords, but table names and field names
are case-sensitive. The key of this
table is "field1". Physically, this means that a unique
index is built
on this table on "field1". The current support for keys is very
minimal ---
keys can be specified only on a single attribute, and only on
integer, double or string
data types. A table can also be created without a key
specified.
CREATE TABLE Bar (field1 int, field2 double);
The fields or attributes of a table can belong to any of the data types known
to the
system. The following primitive data types are supported:
int: 4 byte integer double: 8 byte double precision floating point char[X]: fixed length (X) character string string: variable length character string (upto XXX_TEXTSTRING_MAX:5000 chars) boolean: boolean value
The type system is extensible and new data types can be added. The current
release also has several other data types, as described
CREATE TABLE Locations (field1 int, field2 polygon("1000"), field3 string);
list rels;
Here, we create a table called Locations, which has a field of a polygon data
type.
Note that the polygon field has additional meta-information which
specifies further
information about the polygons -- in this case, it tells the
system to expect an average
of about 1000 points in each polygon.
Partial support for NULL values has been added; NULL values can be stored and
created, but the specification of certain fields as "NOT NULL" is not
supported.
Further, indexes may not work correctly on NULL values.
An index can be created on any existing table, whether data has already been
loaded
into it or not. PREDATOR
supports extensible indexing --- new kinds of indexes can
be added to the
system. Therefore, when creating an index, the nature of the index
needs to be
specified too. Currently, two types of indexes -- B+ tree ("B") and R*
tree ("R") --- are supported. If the index type is not specified, it
is assumed to be a
B+ tree.
CREATE UNIQUE B INDEX Loc_fld1 ON Locations (Locations.field1); CREATE B INDEX Loc_fld1plus ON Locations (Locations.field1 + 100); CREATE R INDEX Loc_fld2 ON Locations (Locations.field2); CREATE B INDEX Loc_fld3 ON Locations (Locations.field3); ;
B+ tree indexes can be created on integer, string, and double fields. They
can also
be built on any expression that returns one of these types. Note that
the fields must
be prefixed by the table name in the indexing expression. R*
trees can be built on any
data type that provides a bounding box. This is
defined in the ADT method interface,
and we will not go into the details here.
Many of the geographic data types like point,
polygon and box support this
capability. All indexes can be defined to be "unique" ---
this means
that duplicate index key values will cause an insertion error.
There are two ways to add data to a table: by loading it from a file, or by
inserting it
one tuple at a time. Loading from a file is the preferred route,
especially when there
is a large amount of data to load. Since we are yet to
implement bulk-loading of
indexes, the tuples are still inserted into the
indexes one at a time.
LOAD TABLE Foo FROM "data/foo.data";
The filename "foo.data" must represent a relative pathname from the
directory where
the server is running, or an absolute path name on the server
machine. The next
release will allow a URL to be specified instead. The format
of "foo.data" is a simple
tab-separated file with each tuple on a
different line.
INSERT INTO Bar VALUES (1, 2.2);
The use of "insert into" allows specific values to be added to a
table. The values can
be constructed using any arbitrary expressions that
produce the correct types. To
further illustrate this, consider the following:
INSERT INTO Bar VALUES (1+2, 3.5*7.2);
INSERT INTO Bar VALUES ("wrong", "types");
The first statement succeeds, while the second gives an error.
INSERT INTO Bar SELECT * FROM Foo;
The last example shows that the results of a query can be treated as input
for
insertion. "Select * from Foo" is the simplest kind of SQL query
--- more complex
ones are in the later sections. However, it is useful to be
able to look at the data you
just inserted. Try:
; ;
The values of columns of tuples in a table can be updated using the
"update"
statement.
UPDATE Bar SET field2 = field2*2 WHERE field1 = 1; SELECT * FROM Bar;
You can delete data by deleting individual tuples or by totally removing a table.
DELETE FROM Bar WHERE Bar.field1 = 1; SELECT * FROM Bar;
This deletes all tuples in Bar that have 1 as the first field. If you wish to
remove the
entire table:
DROP TABLE Bar; list rels;
Of course, this is a persistent database system, and this means that data
persists
across multiple invocations of the system. Try quitting the client
("_quit;") and starting
it again. "list rels" should show
you the same tables. Try quitting both the
client("_quit;") and the
server("shutdown"), and then restarting both. The state of the
database should be unchanged.
We will use the two tables Foo and Bar created in the previous section.
Since
we dropped Bar at the end of the section, let us recreate it as follows:
CREATE TABLE Bar (field1 int, field2 double); INSERT INTO Bar VALUES (1, 2.2); INSERT INTO Bar SELECT * FROM Foo; INSERT INTO Bar VALUES (1+2, 3.5*7.2); ;
Ensure using list rels that both Foo and Bar exist.
Projections are specified in the SELECT clause target list. We have already
seen
a simple version (select *) of a projection --- this projects all the
attributes of the
tables in the FROM clause. Here are some more sample
projections: try them out:
; ; ; ; #note the type cast from int to double ; ; ; ;
"B.#0" refers to the 0'th field of B. Note that in almost all
cases, we have referred to
attributes using the
possible. Although the
very last query works without this syntax, we do not guarantee
that other more
complex queries will work if the table name is omitted. We do not yet
support
nested subqueries in the SELECT clause.
The WHERE clause is used to specify selection predicates. Here are some
examples. If you like, combine them with the projection examples:
; ; ; ; ; ;
The predicates in the WHERE clause can get arbitrarily complicated, with
nested
AND, OR and NOT. We also support nested subqueries in the WHERE clause,
but
more about those later.
The FROM clause can specify multiple tables which will be joined using
predicates
specified in the WHERE clause.
; ; ; ;
The last query is an example of a cross-product; there is no predicate in the
WHERE
clause that connects the two tables.
Entire tables (or tables resulting from queries) may be aggregated to produce
composite answers.
; ; ;
The aggregates supported are sum(), avg(), prod(), max(), min() and count().
Note
that the count(*) syntax of standard SQL is not supported. Also,
"distinct" aggregates
are not yet supported.
Groups of tuples may also be aggregated using a GROUP BY clause. All tuples
that
have the same value of the expression in the GROUP BY clause form a single
group
which is aggregated over, producing a single answer tuple. Duplicate
elimination is a
special case of grouped aggregation that has a special syntax:
all identical tuples
form a group, and only one of them is retained.
; ; ; ; ;
SQL places certain restrictions on the contents of the GROUP BY, HAVING and
SELECT clauses. Any non-aggregates in the SELECT clause must be named in the
GROUP BY clause. Likewise for any non-aggregates in the HAVING clause. Our
implementation does not allow aggregates in the HAVING clause.
The results of any query can be ordered for presentation to the user by
adding an
ORDER BY clause. The results can be ordered by an attribute that is
projected out
of the final result! The results can also be ordered by any
complex expression that
generates an ordered data type.
; ; ; ; ; ;
PREDATOR
supports nested subqueries in the WHERE clause of an SQL query.## Example of IN subquery ; ## Example of ANY subquery ; ## Example of ALL subquery ; ## Example of scalar subquery (subquery cardinality must be 1) ## Not handled yet. SELECT * FROM Foo F WHERE F.field1 = (SELECT max(B.field1) FROM Bar B);
Now for the correlated subqueries --- these subqueries can reference
"correlation"
columns from the enclosing query block. The best way to
think of this is to mentally
model each subquery invocation as a function call.
The result of the function call is
the table produced by computing the subquery.
In the uncorrelated examples so far,
the function takes no parameters (in other
words, the subquery is constant across
multiple invocations). In fact, many
commercial systems (but not PREDATOR
yet)
will cache the results of such an uncorrelated subquery so that it does not
need to
be recomputed. Correlated subqueries, on the other hand, take one or
more
parameters. Try the examples below.
## Example of EXISTS subquery (F.field1 is the correlation value) ; ## Example of NOT EXISTS subquery ; ## Example of UNIQUE subquery (true if subquery has exactly 1 tuple) ;
Finally, for those of you who have endured an introductory database course,
the
famous "division" query. Assume there is a table FooBar that
relates Foo tuples with
Bar tuples. We want to find those Foo tuples that are
related to all Bar tuples that
satisfy certain conditions. Some real-world
analogies
CREATE TABLE FooBar (FooField int, BarField int); INSERT INTO FooBar VALUES (0, 0); INSERT INTO FooBar VALUES (0, 2); INSERT INTO FooBar VALUES (0, 3); ;
So far, all the queries have involved a single query block. This makes it
tough to
express complex queries. One option is to use views. A view is just a
name for the
result table defined by a query -- except that it is just defined,
not executed. Once
you define a view, that view may be used to construct another
query, just as if it is a
stored table. Use the
CREATE VIEW PartialResult AS (SELECT F.field1 as attr1, max(F.field2) as attr2 FROM Foo F, Bar B WHERE F.field2 > B.field2 GROUP BY F.field1); ;
You can construct arbitrarily complicated queries using views.
Table expressions are supposed to be "transient views" -- in other
words, if you do
not want to have persistent views, but just want the
convenience of structuring your
query into understandable blocks, table
expressions are the mechanism of choice.
;
Every ADT can define methods which can then be used in SQL queries. The type
system is extensible, of course, allowing new data types to be added. Currently,
there
are quite a few data type distributed with the system --- look
In this section, we address these questions.
Table creation is straightforward -- in fact, in Section 1, we had an example
of a
creating a table with a polygon field. Here, we use image and complex
number as
examples.
CREATE TABLE ImageTest (id string, picture image("format jpeg")) key(id);
CREATE TABLE ComplexTest (id int, value complex) key(id);
CREATE B INDEX ComplRealIndx on ComplexTest (ComplexTest.value.real());
The key of a relation must be a field on which a unique B+ tree index can be
built.
Currently, this is restricted to a few data types (int, double and
string). Consequently,
an image or a complex number cannot be the key of a
relation. We hope to ease this
restriction in the next release. As the example
shows, an index can be created on the
result of the real() method of
the complex number ADT. This method returns an
integer, which is the real
component of the complex number. Indexes can be built on
any complex expression
that returns a suitable type.
Loading data to one of these tables is no different from the earlier
examples.
However, the only question is: how are the ADT values represented? The
ADT
developer usually makes a choice of whether the ADT objects are going to be
"large" or not. if they are "large" (as in the case of the
image ADT), it doesn't make
sense to place the actual image within the load
file. Instead, each image is placed
in its own data file, and the load file only
contains the file name. Similarly, in order to
insert a single tuple, a
"constant" image value can be specified by specifying a type
castof a
file name. This file name could instead be a URL and it would be
downloaded into
the database.
LOAD TABLE ImageTest from "data/renoir.data";
INSERT INTO ImageTest VALUES ("girl", [image("format jpeg")]"data/image_data/watercan.jpg");
;
Complex number values, on the other hand, are pretty small, and it is silly
to put each
complex number into its own file. Instead, they are specified inline
in the data file. It is
the responsibility of the ADT developer to decide how
the values are going to be
represented.
LOAD TABLE ComplexTest from "data/complex.data"; INSERT INTO ComplexTest VALUES (10, [complex]"10.0+20.1i"); ;
We have already seen the use of an ADT method in creating an index. The use
of
methods in queries is similar. Wherever an expression, column value, or
constant
occurs, a method expression that produces a valid data type may appear.
Here are
some examples:
; ; ;
ADTs can define their own aggregates. For instance, the complex number ADT
defines "csum" which computes the sum of a group of complex numbers.
These can
be used like the other aggregates. The second example below shows the
application
of functions to the result of an aggregation.
; # this computes the imag() method on the sum of the complex numbers ;
PREDATOR
uses sorting in order to compute GROUP BY clauses. Obviously,Apart from ADT methods, there is also support for standalone functions. While
the
mechanism is in place, we have only provided a couple of such functions as a
proof
of concept, since our focus has been on ADTs. One function, intplus() adds
two
integers, producing their sum. Another, matches , checks if the
first string argument
is a substring of the second string argument. If so, it
returns 1, else 0.
; ;
By default, every "request" sent from the client to the server is
treated as an ACID
transaction (i.e it either commits on success or aborts on
failure). Failure could occur
due to errors or lock timeouts. If you are using
the txtclient, a "request" is usually a
single command terminated by a
semi-colon. From the Java GUI, it is possible to
send multiple commands as a
single "request" -- however, if you want to do this, try
using
explicit multi-statement xactions instead (see next section). Here are some
examples of single-statement transactions:
INSERT INTO Foo VALUES (10, 5.5); #Hit a CTRL-C, kill the server, and restart it #The inserted tuple will be in Foo SELECT * FROM Foo; #Now look for abort/rollback. # Insertion of duplicate key is an error that causes # a rollback INSERT INTO Foo VALUES (10, 10.5); SELECT * FROM Foo;
If you want to bracket several statements as a single transaction, you can do so.
#Begin a transaction trans begin; DELETE FROM Foo WHERE Foo.field1 = 10; SELECT * FROM Foo; #Abort the transaction trans abort; # Voila, the deleted tuple is back in place SELECT * FROM Foo;
If there is an error in the middle of an multi-statement xaction, the
transaction is
aborted. Any subsequent request is then run as a separate xaction.
Its often convenient to set flags to specify different kinds of behavior in
the system.
This is especially useful in measuring the effects of various
features. The preferences
for the relational engine can be viewed and set as
follows:
#list the current settings ; #many of these are for internal use and debugging. # Turn off display of results setenv show_query_results 0; # Turn on display of results setenv show_query_results 1;
When running performance experiments, it is often very useful to set
"show_query_results"
to 0. If you are extending the system you may want to add to
these flags. Look
at the design document for details of how to do so.
Each E-ADT can provide its own set of options to control its behavior.
Currently, there
are the following default options for large E-ADTs.
# show the options for the "image" E-ADT ; # now prevent images from being displayed set_adtenv image show_output 0;
One of the issues with large E-ADTs like images is: how will they be
displayed when
a query result involves them. The solution to this is for the
server to treat the large
E-ADT value as a MIME-type value and refer to it as a
URL. As you see if you run the
query interactively through this page, the
hyperlinks to images appear as long URLs
which have the form of hexadecimal
strings. This is a simple encoding of the binary
image handle as an ASCII URL.
The PREDATOR server also
acts as an HTTP
server purely for the purpose of serving these large objects.
#collect statistics on a specific table Foo runstats Foo; #collect statistics on all tables runstats;
The runstats command collects data distribution statistics which are
used in query
optimization. As new data is inserted or existing data is deleted,
the statistics become
inaccurate. Consequently, one should periodically execute runstats
for good query
optimization.
The window in which the server was started is called the "monitor"
window --- here,
there is a separate thread reading input requests and
processing them. Actually, the
language supported here is called "XXX"
and it is processed a different engine from
the relational engine. Here are the
commands that can be run from the monitor
window:
adts : list available ADTs
clients : list current clients
engines : list available engines
help : shows this list
shutdown : terminate server
sys_stats : print server statistics
sys_disk : show disk volume stats
sys_shore : show SHORE stats
sys_reset : reset all stats
The most useful of these is, of course, "shutdown", which is the
way to shut down the
server (Ctrl-D works just as well). Some of the commands to
print statistics are useful
for system developers trying to understand where
disk space is being used, etc.
However, the only reasonable statistics currently
gathered are by SHORE -- we hope
to put in statistics for the rest of PREDATOR
in the next release.
We have a simple tool, QPDisp, that displays the effects of query
optimization and
processing. In essence, it is a tree drawing package written in
Java. When any query
is executed at the server, a log of query optimization and
processing information is
dumped out to a diagnostic file (in a file query_proc_info.clientXX.txt
in your
XXXTMPDATADIR directory). Any such information is in the form
of query trees.
Currently, three categories of trees are monitored,
corresponding to three different
phases of query optimization. More categories
will be added in the future. I really
encourage you to try this tool -- it is
almost like opening up the innards of the system.
QPDisp is a Java applet. You should set up a web server to run on the same
machine
as the database server (there are a number of free database servers ---
choose any
one that supports Java applets). We have provided a template HTML
page
qpdisp.html that embeds the QPDisp applet. You will need to
specify the name of
the server dump file that will be visualized.
For example, consider the query:
;
Here
is an example of the QPDisp applet when applied to the dump file generatedThis tool is at an early stage of development. We plan to enhance it to
visualize the
query optimization process and query execution as well. Currently,
it behaves well with
Netscape 4.0, but oddly with IE 4.0 (we'll try to find out
why and fix it).