CS 433 Fall 2006

Project 4: Search Engine Query Processor

This assignment is due on December 1, 2006.

��

Introduction

In this assignment, you will implement the query processor for MiniSearch keyword-based search engine. Since some of the materials are not in the textbook, read this assignment description carefully, and make sure you understand the design.

After you implement your query processor, we will hold a mini-contest as follows. You are required to measure the performance of your query processor running a given set of queries against the given input data, and report the average time cost. The group that builds the fastest query processor will get an extra 10% credit for this assignment!

Database Files

In a high level, MiniSearch stores and indexes over a collection of web pages. Each query to MiniSearch is a set of key words. Given a query, the search engine returns a set of web page, where each such web page contains all key words in the query. A naive query processing algorithm will be to sequentially scan all stored web pages, and for each web page decide whether all key words in the query occurs. This is however not efficient. We would like to build indexes to speed up query processing.

An important type of index in search engine is called Inverted List. It is a collection of mappings from key words to the documents in which they occur, and their positions in the document. For example, for a document D1 with content "hello world," the inverted list will have two entries. The first entry maps word "hello" to a pair of document ID, position (D1, 1), and the second entry maps word "world" to (D2, 2).

We encode such inverted lists as database heap files, referred in this assignment to as key word heap files, as follows. For each word w, we create a heap file named "word_w", storing all the (document ID, position) pairs of this word in the inverted list. Note here w denotes a variable, and its corresponding key word heap file name should contain the variable content, not the character 'w' itself. Also, note that the prefix "word_" for the heap file name can be replaced with any other fixed prefix. In the above example, we will have a heap file named "word_hello", containing one record (D1, 1), and another heap file named "word_world", containing one record (D1, 2). It is easy to see that these heap files together encode all information in the inverted list. Note that since there is usually a large number of distinct key words in the web pages managed by the search engine, the scheme above for encoding inverted list into heap files could result in a large number of heap files in MiniSearch. It is crutial for MiniSearch to be able to manage a large number of database files efficiently. Recall in the last assignment, you have implemented a B+ tree that supports string keys, which enables efficient catalog management.

Besides the above key word heap files, we have another heap file in MiniSearch, referred to as WebPage. It stores the URLs of the web pages indexed by the search engine. That is, these web page URLs are stored as heap page records. Given this heap file, the document IDs stored in the records of key word heap files are essentially the record IDs of the corresponding records in the heap file WebPage.

Document loading and query processing algorithms

So far we have introduced the database files used to store the web page URLs and inverted list indexes. In this assignment, we require you to implement loading web pages into these files, and using these files for query processing in class QueryProcessor. You are free to add member functions and variables into QueryProcessor to help with the implemenetation.

Loading documents

You need to implement the document loading logic in QueryProcessor::loadWebPages(const string& webPagesFileName).

In an industrial search engine, web page documents are usually obtained by a front end component called web crawler, which continuously crawls the target web sites (e.g. the entire Internet) for updated web pages. Each retrieved web page is then parsed and loaded into the inverted list. The web page content can be cached in the search engine as well. In MiniSearch, we load the web pages into inverted list (the key word heap files), and then store its URL in "WebPage" heap file.

In this assignment, we did not provide you with a web page crawler or parser. Instead, the web pages to load into the engine have been preprocessed to retain only their URL and entries in the inverted list. The information is encoded in a binary stream, and stored in a disk file, with default file name "docStream.bin" stored under directory "input". This binary stream format is as follows. It consists of a sequence of entries (we use the word entry to differ from record) with the following structure: (url, num_pairs, (word, pos)*). Each such entry corresponds to one web page, where url field stores its URL, num_pairs field is an integer denoting the number of (word, pos) pairs to follow, and each (word, pos) pair encodes a position of the particular word in this web page. For example, to encode document D1 in the example above, the binary stream will consist of only one entry, which looks like (the_url_of_document_D1, 2, hello, 1, world, 2). For more details, refer to DocRecord.cpp for how this binary stream can be read and written.

Each entry of the binary stream will be read by invoking DocRecord::Read(). The URL will be stored in DocRecord::url, and the (word, pos) pairs DocRecord::word_pos_pairs. The algorithm for loading this binary stream is as follows. For each such entry e, insert a new record into "WebPage" heap file with url = e.url. Let docID denote the record ID of the record you just created in "WebPage". For each word, position pair (w, p) in e, insert a new record into key word heap file "word_w" with content (docID, p).

After all the key word heap files have been loaded, you will need to implement external merge sort algorithm to sort each file on docID attribute. We keep each file sorted in order to reduce the time cost of query processing to be described in next subsection.

Here we provide some design guidance for external merge sort. First we give a brief overview of the external merge sort process described in the textbook. Let the input heap file be inF. In the first merge sort iteration, you can break inF into a set of initial runs, and sort each run individually. Next, you merge B-1 input runs into an output run, and repeat this process, until the number of runs remaining is 1. In this assignment, we require you to set B, the number of buffer pages to use in external merge sort process, to MINIBASE_BM->GetNumOfBuffers() / 5. Each run is stored in a heap file. Remember to delete them after you use them.

Now we provide some details in implementing external merge sort.

First Iteration: As is described in the textbook, each initial run is B pages long. To create each such initial run, you can read K records from inF into a C++ container object (e.g. a C/C++ array, or an STL container such as std::vector, std::set), and sort these records in-memory. K can be set by B*MINIBASE_PAGESIZE/size_of_record_in_inF. In a real database system, such a C++ container object will be created in database buffer pool. However, in this assignment, you do not have to put it into the buffer pool of minibase -- just create an object with malloc or new. After records in this container are sorted, read them out sequentially and store them into the initial run.

Later Iterations: For each later iteration, you will each time sort B-1 unprocessed input runs to create an output run. Repeat this process until you have processed all input runs. Then go to next iteration, unless there is only one output run created in this iteration, in which case you know merge sort has been finished. You just need to store records in that single output run into the output heap file.

In order to sort B-1 input runs into an output run, you can open a scan on each input run. Create an array denoted as recs of B-1 records to hold the current record from each input run. In order to decide which record to output next, scan the recs to find a record of minimal value as the output record. Then read another record into recs from the input run where the output record is from.

Processing queries

You need to implement the query processing logic in QueryProcessor::query(const vector<string>& words).

Here is the high level query processing strategy using inverted list. For each key word w in the query, retrieve the content of key word heap file "word_w". Now, intersect all these heap files based on docID value. The result of the intersection is a collection of docIDs. These "surviving" docIDs of the intersection denote the documents that contain all the key words in the query. For simplicity, during query processing we are not concerned with the position field in key word heap files in this assignment. For example, given two key word heap files, where the first one contains two tuples {(D1, 1), (D1, 2)}, and the second one contains one tuple {(D1, 10)}, the intersection of these two files based on DocID should contain docID D1. You will need to remove duplicates. That is, {D1} will be valid output, but {D1, D1} will not. Finally, you need to go through this collection of docIDs, and find the corresponding URL for each of them from WebPage heap file, and feed the obtained URL into the function _ioMgr->genResultURL(recURL, recURLLen) to output it to the user. Remember to free up any temporary resources (e.g. heap files) that you have allocated for processing each query.

Recall that when loading key word heap files, you have already sorted each of them. Therefore, suppose the query involves k key words, here you only need to implement the intersection of the k key word heap files corresponding to the key words in that query. You could choose to implement a separate intersection function, or re-use the "merge" logic of the external merge sorted you implemented for loading documents.

Measurement

There are not test cases to pass in this assignment. Instead, you are required to process the set of queries stored in file "input/query.txt" against input document binary stream "input/DocStream_big.bin", and report

Their results
the time cost of processing all the queries, averaged over 5 tries. You should use the "Release" mode in Visual Studio to measure the time cost. It can be set from menu [Build]->[Configuration Manager ...].

Note that you are also required to measure the total cost of loading documents from "input/DocStream_big.bin" into minibase. Specifically, measure the cost of function QueryProcessor::loadWebPages.

The TA will measure these numbers again in his own machine, and use that machine as the final judge for which group has built the fastest query processor (minimum in the the sum of document loading cost and query processing cost).

TA Evaluation

Please fill out the final TA evaluation form when it is available. This will account for 5% of the assignment score this time. Also, we would appreciate your feedback. :-)

Getting Started!

The zip file for the project can be downloaded using the course management system. Once you unzip the file, the directory contains the following files (among others):

SeachEngineFront: this directory contains the code for the web front end of the search engine
QueryProcessor/input: this directory contains input files for two binary document streams docStream.bin and docStream_big.bin, and an input query file query.txt. docStream.bin contains only two small, contrived documents, while docStream_big.bin contains a collection of about 70 web pages downloaded from Cornell CS web site. We recommend that you start with docStream.bin in your development process, and later test your system with docStream_big.bin. An easy way to use the latter file will be to rename it to docStream.bin.
QueryProcessor/extlib, QueryProcessor/lib: libraries used in this project
QueryProcessor/include, QueryProcessor/QueryProcessor, QueryProcessor/IOMgr, QueryProcessor/DocRecord: header and source files used in this project.

You are free to add addtional member fields and methods in class QueryProcessor. The files you could modify are QueryProcessor.h, QueryProcessor.cpp and blockjoin.cpp.

Front ends

You are provided with three frontend components to interact with the search engine. A console based front end lives in the same process as the search engine, and passes user input obtained from standard input directly to the query processor via function calls. This is a light-weight approach for you to interact with the search engine, and you may find it convenient to use during debugging process. To use the console front end, start the executable with an input parameter 0. In Visual Studio development envrionment, you can set run time input parameter(s) to the compiled executable with the following steps.

Right click the "QueryProcessor" project in "Solution Explorer"
choose "Properties" from the pop up menu
Navigate to "Configuration Properties"->"Debugging"
Set your input parameter in "Command Arguments" text box

A file based front end also lives in the same process as the search engine, and reads input queries from a disk file whose default file name query.txt, stored under "input" directory. To use the file front end, start the executable with an input parameter 1. This front end provides an way to batch process a number of queries, and measure to total time cost of query processing.

A web based front end lives in a separate process, and passes user input to query engine via TCP socket communication (with port 13000). To use the web front end, start the executable with an input parameter 2. Note that a console will pop up to display the status of the search engine. At this moment, the console for the search engine will display "[query server] waiting for network connection". Next, start the provided Micrdosoft ASP.NET-based front end by opening SearchEngineFront.sln. Set the page Search.aspx to start page by right clicking Search.aspx in "Solution Explorer", and choosing "Set As Start Page." Now start the front end with F5 or Ctrl+F5, and you should be able to see the MiniSearch home page being loaded into the default web browser. Type queries into the text box and hit either the Search button or Enter in the keyboard to start the search. After you issue you first query, the console for search engine will display "[query server] connection setup" to indicate that connection as being established. The web front end is just a fancier way for you to interact with the search engine, and does not provide more functionality compared to the console front end. It is therefore entirely up to you for whether to use the web front end.

Creating and Uploading the Zip File

You should submit exactly the same set of files given to you in this assignment, as well as the report file containing query result and time cost mentioned above. You can also include an optional project document.

Zip only the above files in your directory and upload them into the course management system by the deadline. No late submissions will be accepted.

FAQ of Assignment 4

Reading Binary Stream

Q: There are some key words that are larger than 50 bytes. When we try to create key word heap files for them, the heap file names are too long, which causes the database to return error. What should we do?

A: If the words are larger than 50 bytes, please feel free to discard them, since they are most likely noise such as http links, which are not of our concern during key word search. The error you encounter will not be treated as your fault in the assignment.

Q: There are some key words from the input binary stream that seem not "noise". For example, they contain commas and other punctuations, or http links. Can we discard them when reading the binary stream?

A: Unfortunately we cannot allow you to do such kind of optimization, since what to discard is not well-defined. In an extreme case you may want to throw away anything "irrelevant" to the given set of queries, thus reducing your document loading time. This is not what we expect for this assignment. In short, you will need to load all the key words (less than 50 bytes) into key word heap files.

Q: When we read the input binary stream, do we need to worry about duplicate documents (i.e., documents that have the same URL)?

A: No. You can assume the input binary stream does not contain two documents of the same URL. That is, when you insert records into WebPage heap file, you can assume these records all have different URLs.

Q: The file docStream_big.bin does contain some duplicate URLs, such as http://www.cs.cornell.edu/gries/. How should we deal with these duplicate URLs?

A: In this case you could just insert one record for each such duplicate URL into WebPage heap file, because of the assumption stated in the previous FAQ entry. Consequently there might be some queries that could produce duplicate URLs as output. This will be treated as being fine.

External Merge Sort

Q: For the external merge sort, during a pass, do each run have to be stored on a heap file or can we use a c++ container object?

A: You will have to use heap files to store runs, since it is not clear all the intermediate runs can fit into the size of buffer pool. Feel free to come up with your naming convention for these intermediate heap files, such as "tmp_1", "tmp_2" and so on.

Q: How do I create a new heap file, or open an existing heap file? Do I need to access the database catalog by calling MINI_DB->getFileEntry and so on?

A: You just need to create a HeapFile object with the input heap file name. If the heap file with your specified name does not exist in the database, a new file will be created, and a new entry will be inserted into the database catalog by the HeapFile constructor. Otherwise, the existing file will be opened. The semantics of the HeapFile constructor is thus fairly similar to fopen() and ifstream constructor respectively in C and C++ standard library.

When you want to delete the heap file, invoke HeapFile::DeleteFile(). The heap pages in that file will be deallocated, and the heap file entry will be deleted from the database catalog by HeapFile::DeleteFile().

Query Processing

Q: In query processing, we need to intersection a set of heap files. Usually the number of key words is less than B, the number of available buffer pages; in that case we will only need to intersect once. Could can we assume that this is always the case, or should we implement a general k-way intersection strategy, where k is a parameter that can be set to between 2 and B-1?

A: You should implement the general k-way intersection. The minimum requirement would be to implement a iterative binary intersection strategy. In this case k is always set to 2, and cannot be changed. Binary intersection produces correct result, but its cost will be higher than that of a B-1 way intersection.

Measurement

Q: In doing time measurement, can we just use the front end with parameter 1 to do the measurements?

A: Yes, that is what we expect you to do, so that any delay caused by manual input is excluded. Note that you can measure the document loading and query processing time separately, and then add up the two numbers. This way there is no delay involved in waiting for manual input. (Measuring document loading cost is optional, but it is easy to do anyhow).

Front Ends

Q: The web front end does not seem to display results of the input queries. What could go wrong?

A: Recall that you use _ioMgr->genResultURL(recURL, recURLLen) to output results. After all URL results are generated for the current query being processed, please invoke the function once more with _ioMgr->genResultURL(NULL, 0) to mark the end of the results sent to the web front end. This way, the front end will know when it has received all the results, and display them in the HTML web page.

Mingsheng Hong
Last modified: 11/30/2006