This is a tutorial and an online demonstration of PREDATOR. It leads you through several examples of database queries you can run on a PREDATOR 1.0 server. This tutorial should help you figure out the syntax for various commands and acts as the user manual for the system. We expect to keep modifying this tutorial as new functionality is introduced into the system. Details of design decisions and implementation issues are in a different document.
The layout is as follows:
PREDATOR
If you click on the queries in buttons, you can interactively run them against a sample database maintained on a server at Cornell. The results of the interactive version of the tutorial will differ slightly from the text of the tutorial, because all the tables are pre-created and the data is pre-loaded. Update, create, delete and drop statements are not provided with buttons.
In order to try these examples on your local installation, start a server and then use one of the provided client programs to connect to the server. Type in any of the sample SQL queries and commands. Most of the examples in this document are also provided as part of the regression test scripts in the release. The regression tests provide more exhaustive examples of various features.
The most useful initial command is "list rels" -- the sample run is shown below
Q> ; _STABLES _SATTRS _SSTATS _SASTATS _SINDXS _SVIEWS
This tells you that there are five tables present (all of which happen to be default catalog tables created automatically by the system). If you want to see the schemas and catalog information for all the tables, type ;. These two are very useful commands, and we recommend that you try them at various stages through the tutorial to understand the effects of various commands on the underlying tables. Other useful commands are:
list views; ----- lists all views defined in the system list schemas; ----- lists all schemas of tables list env; ----- lists the execution environment settings
New tables can be created using the CREATE TABLE command.
CREATE TABLE Foo (field1 int, field2 double) KEY(field1);
This creates a table called Foo -- check using "list rels" to see that it was indeed created. You can also check the schema of the new table using "list cats". The SQL parser is case-insensitive with respect to keywords, but table names and field names are case-sensitive. The key of this table is "field1". Physically, this means that a unique index is built on this table on "field1". The current support for keys is very minimal --- keys can be specified only on a single attribute, and only on integer, double or string data types. A table can also be created without a key specified.
CREATE TABLE Bar (field1 int, field2 double);
The fields or attributes of a table can belong to any of the data types known to the system. The following primitive data types are supported:
int: 4 byte integer double: 8 byte double precision floating point char[X]: fixed length (X) character string string: variable length character string (upto XXX_TEXTSTRING_MAX:5000 chars) boolean: boolean value
The type system is extensible and new data types can be added. The current release also has several other data types, as described
CREATE TABLE Locations (field1 int, field2 polygon("1000"), field3 string); list rels;
Here, we create a table called Locations, which has a field of a polygon data type. Note that the polygon field has additional meta-information which specifies further information about the polygons -- in this case, it tells the system to expect an average of about 1000 points in each polygon.
Partial support for NULL values has been added; NULL values can be stored and created, but the specification of certain fields as "NOT NULL" is not supported. Further, indexes may not work correctly on NULL values.
An index can be created on any existing table, whether data has already been loaded into it or not. PREDATOR supports extensible indexing --- new kinds of indexes can be added to the system. Therefore, when creating an index, the nature of the index needs to be specified too. Currently, two types of indexes -- B+ tree ("B") and R* tree ("R") --- are supported. If the index type is not specified, it is assumed to be a B+ tree.
CREATE UNIQUE B INDEX Loc_fld1 ON Locations (Locations.field1); CREATE B INDEX Loc_fld1plus ON Locations (Locations.field1 + 100); CREATE R INDEX Loc_fld2 ON Locations (Locations.field2); CREATE B INDEX Loc_fld3 ON Locations (Locations.field3); ;
B+ tree indexes can be created on integer, string, and double fields. They can also be built on any expression that returns one of these types. Note that the fields must be prefixed by the table name in the indexing expression. R* trees can be built on any data type that provides a bounding box. This is defined in the ADT method interface, and we will not go into the details here. Many of the geographic data types like point, polygon and box support this capability. All indexes can be defined to be "unique" --- this means that duplicate index key values will cause an insertion error.
There are two ways to add data to a table: by loading it from a file, or by inserting it one tuple at a time. Loading from a file is the preferred route, especially when there is a large amount of data to load. Since we are yet to implement bulk-loading of indexes, the tuples are still inserted into the indexes one at a time.
LOAD TABLE Foo FROM "data/foo.data";
The filename "foo.data" must represent a relative pathname from the directory where the server is running, or an absolute path name on the server machine. The next release will allow a URL to be specified instead. The format of "foo.data" is a simple tab-separated file with each tuple on a different line.
INSERT INTO Bar VALUES (1, 2.2);
The use of "insert into" allows specific values to be added to a table. The values can be constructed using any arbitrary expressions that produce the correct types. To further illustrate this, consider the following:
INSERT INTO Bar VALUES (1+2, 3.5*7.2); INSERT INTO Bar VALUES ("wrong", "types");
The first statement succeeds, while the second gives an error.
INSERT INTO Bar SELECT * FROM Foo;
The last example shows that the results of a query can be treated as input for insertion. "Select * from Foo" is the simplest kind of SQL query --- more complex ones are in the later sections. However, it is useful to be able to look at the data you just inserted. Try:
; ;
The values of columns of tuples in a table can be updated using the "update" statement.
UPDATE Bar SET field2 = field2*2 WHERE field1 = 1; SELECT * FROM Bar;
You can delete data by deleting individual tuples or by totally removing a table.
DELETE FROM Bar WHERE Bar.field1 = 1; SELECT * FROM Bar;
This deletes all tuples in Bar that have 1 as the first field. If you wish to remove the entire table:
DROP TABLE Bar; list rels;
Of course, this is a persistent database system, and this means that data persists across multiple invocations of the system. Try quitting the client ("_quit;") and starting it again. "list rels" should show you the same tables. Try quitting both the client("_quit;") and the server("shutdown"), and then restarting both. The state of the database should be unchanged.
We will use the two tables Foo and Bar created in the previous section. Since we dropped Bar at the end of the section, let us recreate it as follows:
CREATE TABLE Bar (field1 int, field2 double); INSERT INTO Bar VALUES (1, 2.2); INSERT INTO Bar SELECT * FROM Foo; INSERT INTO Bar VALUES (1+2, 3.5*7.2); ;
Ensure using list rels that both Foo and Bar exist.
Projections are specified in the SELECT clause target list. We have already seen a simple version (select *) of a projection --- this projects all the attributes of the tables in the FROM clause. Here are some more sample projections: try them out:
; ; ; ; #note the type cast from int to double ; ; ; ;
"B.#0" refers to the 0'th field of B. Note that in almost all cases, we have referred to attributes using the . syntax. We recommend sticking to this syntax wherever possible. Although the very last query works without this syntax, we do not guarantee that other more complex queries will work if the table name is omitted. We do not yet support nested subqueries in the SELECT clause.
The WHERE clause is used to specify selection predicates. Here are some examples. If you like, combine them with the projection examples:
; ; ; ; ; ;
The predicates in the WHERE clause can get arbitrarily complicated, with nested AND, OR and NOT. We also support nested subqueries in the WHERE clause, but more about those later.
The FROM clause can specify multiple tables which will be joined using predicates specified in the WHERE clause.
; ; ; ;
The last query is an example of a cross-product; there is no predicate in the WHERE clause that connects the two tables.
Entire tables (or tables resulting from queries) may be aggregated to produce composite answers.
; ; ;
The aggregates supported are sum(), avg(), prod(), max(), min() and count(). Note that the count(*) syntax of standard SQL is not supported. Also, "distinct" aggregates are not yet supported.
Groups of tuples may also be aggregated using a GROUP BY clause. All tuples that have the same value of the expression in the GROUP BY clause form a single group which is aggregated over, producing a single answer tuple. Duplicate elimination is a special case of grouped aggregation that has a special syntax: all identical tuples form a group, and only one of them is retained.
; ; ; ; ;
SQL places certain restrictions on the contents of the GROUP BY, HAVING and SELECT clauses. Any non-aggregates in the SELECT clause must be named in the GROUP BY clause. Likewise for any non-aggregates in the HAVING clause. Our implementation does not allow aggregates in the HAVING clause.
The results of any query can be ordered for presentation to the user by adding an ORDER BY clause. The results can be ordered by an attribute that is projected out of the final result! The results can also be ordered by any complex expression that generates an ordered data type.
So far, all the queries have involved a single query block. This makes it tough to express complex queries. One option is to use views. A view is just a name for the result table defined by a query -- except that it is just defined, not executed. Once you define a view, that view may be used to construct another query, just as if it is a stored table. Use the
CREATE VIEW PartialResult AS (SELECT F.field1 as attr1, max(F.field2) as attr2 FROM Foo F, Bar B WHERE F.field2 > B.field2 GROUP BY F.field1); ;
You can construct arbitrarily complicated queries using views.
Table expressions are supposed to be "transient views" -- in other words, if you do not want to have persistent views, but just want the convenience of structuring your query into understandable blocks, table expressions are the mechanism of choice.
;
Every ADT can define methods which can then be used in SQL queries. The type system is extensible, of course, allowing new data types to be added. Currently, there are quite a few data type distributed with the system --- look
In this section, we address these questions.
Table creation is straightforward -- in fact, in Section 1, we had an example of a creating a table with a polygon field. Here, we use complex number as examples.
CREATE TABLE ComplexTest (id int, value complex) key(id); CREATE B INDEX ComplRealIndx on ComplexTest (ComplexTest.value.real());
The key of a relation must be a field on which a unique B+ tree index can be built. Currently, this is restricted to a few data types (int, double and string). Consequently, a complex number cannot be the key of a relation. We hope to ease this restriction in the next release. As the example shows, an index can be created on the result of the real() method of the complex number ADT. This method returns an integer, which is the real component of the complex number. Indexes can be built on any complex expression that returns a suitable type.
Loading data to one of these tables is no different from the earlier examples. However, the only question is: how are the ADT values represented? The ADT developer usually makes a choice of whether the ADT objects are going to be "large" or not. if they are "large" (as in the case of the image ADT), it doesn't make sense to place the actual image within the load file. Instead, each image is placed in its own data file, and the load file only contains the file name. Similarly, in order to insert a single tuple, a "constant" image value can be specified by specifying a type castof a file name. This file name could instead be a URL and it would be downloaded into the database.
Complex number values, on the other hand, are pretty small, and it is silly to put each complex number into its own file. Instead, they are specified inline in the data file. It is the responsibility of the ADT developer to decide how the values are going to be represented.
LOAD TABLE ComplexTest from "data/complex.data"; INSERT INTO ComplexTest VALUES (10, [complex]"10.0+20.1i"); ;
We have already seen the use of an ADT method in creating an index. The use of methods in queries is similar. Wherever an expression, column value, or constant occurs, a method expression that produces a valid data type may appear. Here are some examples:
ADTs can define their own aggregates. For instance, the complex number ADT defines "csum" which computes the sum of a group of complex numbers. These can be used like the other aggregates. The second example below shows the application of functions to the result of an aggregation.
; # this computes the imag() method on the sum of the complex numbers ;
Apart from ADT methods, there is also support for standalone functions. While the mechanism is in place, we have only provided a couple of such functions as a proof of concept, since our focus has been on ADTs. One function, intplus() adds two integers, producing their sum. Another, matches , checks if the first string argument is a substring of the second string argument. If so, it returns 1, else 0.
By default, every "request" sent from the client to the server is treated as an ACID transaction (i.e it either commits on success or aborts on failure). Failure could occur due to errors or lock timeouts. If you are using the txtclient, a "request" is usually a single command terminated by a semi-colon. From the Java GUI, it is possible to send multiple commands as a single "request" -- however, if you want to do this, try using explicit multi-statement xactions instead (see next section). Here are some examples of single-statement transactions:
INSERT INTO Foo VALUES (10, 5.5); #Hit a CTRL-C, kill the server, and restart it #The inserted tuple will be in Foo SELECT * FROM Foo; #Now look for abort/rollback. # Insertion of duplicate key is an error that causes # a rollback INSERT INTO Foo VALUES (10, 10.5); SELECT * FROM Foo;
If you want to bracket several statements as a single transaction, you can do so.
#Begin a transaction trans begin; DELETE FROM Foo WHERE Foo.field1 = 10; SELECT * FROM Foo; #Abort the transaction trans abort; # Voila, the deleted tuple is back in place SELECT * FROM Foo;
If there is an error in the middle of an multi-statement xaction, the transaction is aborted. Any subsequent request is then run as a separate xaction.
Its often convenient to set flags to specify different kinds of behavior in the system. This is especially useful in measuring the effects of various features. The preferences for the relational engine can be viewed and set as follows:
#list the current settings ; #many of these are for internal use and debugging. # Turn off display of results setenv show_query_results 0; # Turn on display of results setenv show_query_results 1;
When running performance experiments, it is often very useful to set "show_query_results" to 0. If you are extending the system you may want to add to these flags. Look at the design document for details of how to do so.
Each E-ADT can provide its own set of options to control its behavior.
#collect statistics on a specific table Foo runstats Foo; #collect statistics on all tables runstats;
The runstats command collects data distribution statistics which are used in query optimization. As new data is inserted or existing data is deleted, the statistics become inaccurate. Consequently, one should periodically execute runstats for good query optimization.
The window in which the server was started is called the "monitor" window --- here, there is a separate thread reading input requests and processing them. Actually, the language supported here is called "XXX" and it is processed a different engine from the relational engine. Here are the commands that can be run from the monitor window:
adts : list available ADTs clients : list current clients engines : list available engines help : shows this list shutdown : terminate server sys_stats : print server statistics sys_disk : show disk volume stats sys_shore : show SHORE stats sys_reset : reset all stats
The most useful of these is, of course, "shutdown", which is the way to shut down the server (Ctrl-D works just as well). Some of the commands to print statistics are useful for system developers trying to understand where disk space is being used, etc. However, the only reasonable statistics currently gathered are by SHORE -- we hope to put in statistics for the rest of PREDATOR in the next release.
We have a simple tool, QPDisp, that displays the effects of query optimization and processing. In essence, it is a tree drawing package written in Java. When any query is executed at the server, a log of query optimization and processing information is dumped out to a diagnostic file (in a file query_proc_info.clientXX.txt in your XXXTMPDATADIR directory). Any such information is in the form of query trees. Currently, three categories of trees are monitored, corresponding to three different phases of query optimization. More categories will be added in the future. I really encourage you to try this tool -- it is almost like opening up the innards of the system.
QPDisp is a Java application. The Readme file in %InstallDir%\predator\java\qpdisp describes the QPDisp tool, how to install it and how to use it.
For example, consider the query: