312-oogle |
Part of the code (in file server-fn.sml) implements a simple web server. This web server is implemented as a functor ServerFn parametrized over a query engine, which we will develop later in this problem set. The server serves whichever pages are stored in the local_pages directory in z:\cs312\ps4 (or whatever directory in which you downloaded your source code). A sample set of web pages is installed by default in that directory, namely the documentation pages for the SML Basis Library.
Warning: do not leave the web server running -- anyone can connect to it and start browsing through the directory local_pages. In particular, do not start the server in some other directory as you're opening up the contents of your machine to the outside world.
To start the server, first instantiate the functor with a query engine. The structure BasicQuery is the skeleton of a query engine which you will fill out in exercises. Currently, it does nothing, but the server will still function correctly if you use it (it just won't answer to search queries). Instantiate the functor by typing, for example,
structure S = ServerFn (structure Q = BasicQuery);
The only function exported by the server is a function start which takes a single integer argument, the port on which the server will respond to HTTP requests. For the sake of using a non-standard port, use port 8080:
S.start (8080);
If all worked correctly, the web server should be running and awaiting HTTP requests from the port. If the server failed to execute, it may be because a previous execution of the server failed to liberate the port. Try again with a port number increased by 1 until you find an available port.
To connect to the web server, fire up your favorite web browser, and point it to your machine. Type the URL http://127.0.0.1:8080/ in your browser (using the port number corresponding to the port on which your started the server, of course). This should bring up a simple page offering you an option to browse the library or search the library. The searching option is the one you will implement in this problem set.
The URL given above uses the loopback address, an IP address that always refers to the local machine. You can also connect to the server using a hostname. Say your machine is called fizbin. Typing the URL http://fizbin:8080/ in your browser should get you the same page as above.
IMPORTANT: to shut down the web server gracefully, either press the button marked SHUTDOWN on the initial web page, or invoke the URL http://127.0.0.1:8080/SHUTDOWN from your browser. This will stop the web server. (And let us not mention the security problems caused by allowing browsers to shut down web servers. We never claimed this was a production server!)
Note that whenever you make changes to your code that affect the server, not only do you need to recompile the code using CM.make(), but you also need to reinstantiate the server by applying the functor ServerFn to a query engine. (Can you explain why?)
Recall the foldl function for lists:
fun foldl (f:'a*'b->'b) (accum:'b) (x:'a list) : 'b =
case x of
nil => accum
| hd::tl => foldl f (f(accum,hd)) tl
Fold can be used to crawl over a list, visiting each node, to accumulate some information about the list. For instance, to calculate the sum of the elements in a list, we simply write:
val sum : int list -> int = foldl (op +) 0
Calling sum on a list, say [1,2,3,4] = 1::2::3::4::nil results in:
sum (1::2::3::4::nil) =
foldl (op +) 0 (1::2::3::4::nil) =
foldl (op +) (0+1) (2::3::4::nil) =
foldl (op +) (0+1+2) (3::4::nil) =
foldl (op +) (0+1+2+3) (4::nil) =
foldl (op +) (0+1+2+3+4) nil =
(0+1+2+3+4) = 10
If we wanted to collect all of the elements of a list into a set (where the set is implemented using a structure SetInt with signature ORD_SET), we might write something like this:
fun collect (x: int list) : SetInt.set =
foldl SetInt.add' SetInt.empty x
Really, we should write this as:
val collect : int list -> SetInt.set = foldl SetInt.add' SetInt.empty
If we wanted to add up all the elements in the list and collect a set of all of the elements in one pass, then we can do this as follows:
let
fun combine (i:int,(sum:int, s:SetInt.set)):(int,SetInt.set) =
(sum + i, SetInt.add(s,i))
in
val length_and_sum : int list -> (int * SetInt.set) =
foldl combine (0,SetInt.empty)
end
Try out these definitions in SML and play around with them. Understanding how they work will help a lot with this assignment.
The SML Basis Library provides some efficient implementations of sets and maps that might be very helpful. They are based on fully functional Red-Black trees, which are balanced binary trees. You've seen Red-Black trees, so you know why they are so nice.
The SML/NJ library provides functors to construct sets and maps over any type for which a compare function can be defined.
To create a structure implementing sets over a type T with an operation compareT, we use the declaration:
structure SetT = RedBlackSetFn (struct
type ord_key = T
val compare = compareT
end)
The resulting structure has the signature ORD_SET.
To create a structure implementing maps (dictionaries) over a type T with an operation compareT, we use the similar declaration:
structure MapT = RedBlackMapFn (struct
type ord_key = T
val compare = compareT
end)
The resulting structure has the signature ORD_MAP.
Exercise 1: Graph crawling
For this problem, you will define a function crawl that is similar
to foldl, except that it works on (rooted) directed graphs instead of
lists. A directed graph consists of a set of nodes and a set of edges
between nodes. For example, we can think of web pages as nodes, and the
URL's on a page as edges leading to other nodes (i.e., pages). The signature
for directed graphs that we will be using is in the file
sig-graph.sml and looks like this:
(* signature for directed graphs *)
signature GRAPH =
sig
type edge
type node
(* comparison on edges *)
val compare_edge : edge * edge -> order
(* return the node that the edge is pointing to *)
val node_of_edge : edge -> node
(* return the list of edges leaving a node *)
val edges_of_node: node -> edge list
end
The crawl function you write will allow us to fold a function across a set of nodes to accumulate information about a graph, such as the number of URL's in the graph, or the set of all words that occur on pages, etc. The crawl function should have type:
val crawl : (edge * node * 'a -> 'a) -> 'a -> edge -> 'a
(Note that we have the accumulating function take as argument both the current edge being traversed and the node that edge points to. This is purely for efficiency issues, to avoid having to recompute the node in the accumulating function.)
We're not going to tell you how to write crawl -- you should figure this out for yourself. But in general, crawl should visit each edge e in the graph, applying a function to the edge and accumulator to get a new accumulator. You'll need to use the functions node_of_edge and edges_of_node to traverse the graph. We don't care what order you use to traverse the edges of the graph.
The tricky part in all of this is making sure that your crawler (a) terminates and (b) doesn't process an edge more than once. Note that (b) is different from saying that crawl doesn't process an node more than once (can you explain why?).
TURN IN: in the file crawler-fn.sml, you will
find a partially completed functor CrawlerFn which takes a
GRAPH structure as an argument. You should complete the definition of
crawl as described above.
The signature INT_GRAPH specializes the GRAPH signature to graphs where the nodes contain integers:
(* Specialized signature for graphs where the nodes contain integers. *)
signature INT_GRAPH = sig
include GRAPH
val root : edge
val node_value : node -> int
end
The "include GRAPH" in the signature is a short-hand for writing down everything that appears in the GRAPH signature. From this, it is clear that any functor expecting a GRAPH can also be passed an INT_GRAPH, because an INT_GRAPH is a sub-type of GRAPH. An example of an INT_GRAPH structure is the following:
structure SampleIntGraph :> INT_GRAPH = struct
type edge = string
type node = int * edge list
val graph : (edge * node) list =
[ ("foo",(1,["bar","baz"])),
("bar",(2,[])),
("baz",(3,["foo","bar","baf"])),
("baf",(4,["baz"])) ]
val root : edge = "foo"
exception Bad_Edge
fun node_of_edge(e:edge):node =
case List.find (fn (e2,_) => e2 = e) graph of
NONE => raise Bad_Edge
| SOME(_,n) => n
fun edges_of_node((_,es):node):edge list = es
fun node_value ((v,_):node):int = v
val compare_edge = String.compare
end
Here, we represent edges as strings, and nodes as a pair of an integer (the value of the node) and a list of edges. The entire graph is hidden in the structure.
The file crawl-demo-fn.sml declares a functor
IntGraphCrawlDemoFn that takes as a parameter an integer graph and
returns a structure implementing three functions on that integer graph. Your
task for this exercise is to implement those three functions. The structure
internally defines a structure IntGraphCrawler which is an instance
of the functor CrawlerFn (implementing your crawling function)
applied to the integer graph supplied as a parameter to
IntGraphCrawlDemo. Here are the functions you need to implement:
Make sure to test your code (e.g., using the SampleIntGraph structure). For example, declare a structure:
structure Test = IntGraphCrawlDemoFn (structure I = SampleIntGraph);
and try calling the functions in the resulting structure Test. You don't have to turn in SampleIntGraph, so feel free to modify the int graph to help you test your functions.
TURN IN: submit your code contained in
crawl-demo-fn.sml.
We're done with the warm-ups. Now you're going to write a web crawler. The
signature WEB_GRAPH refines the GRAPH signature for the web and
can be found in the file sig-web-graph.sml:
(* Specialized signature for graphs which are actually web
* graphs.
*
* Extends the signature GRAPH by specifying the types of
* both edges and nodes, and by an extra operation words_of_node
*)
signature WEB_GRAPH = sig
type url = string
type page
(* Edges are simply URLs *)
type edge = url
(* Nodes are page options. This is to account for the fact that
* fetching a page may fail (i.e. return NONE).
* A node n with value NONE is such that edges_of_node (n) = []
* and words_of_node (n) = [] *)
type node = page option
(* As in the GRAPH signature *)
val compare_edge : edge * edge -> order
val node_of_edge : edge -> node
val edges_of_node: node -> edge list
(* Returns the words on the page, filtering out tags
* but leaving the words in order otherwise.
*)
val words_of_node: node -> string list
end
Here, edges are URL's and nodes are web pages. Calling node_of_edge fetches the web page associated with the URL. The edges_of_node function pulls out all of the URL's on the a given page. The function words_of_node pulls out all of the "words" on a web page. For this assignment, a "word" is something not in an HTML tag that is surrounded by white space (i.e., spaces, tabs, newlines, etc.) We'll provide you the code to do all of this as part of the web server that you're going to extend.
Your job is to implement an indexer for the web in order to support web searching. When you go to your favorite search engine, you can type in a query like "foo or bar" and you get back a web page listing all of the URL's of web pages that contain those words. In this case, the search engine should return the URL's of any pages that contain the words "foo" or "bar".
What really happens is that when you type in your query, a special HTTP request is sent to the server. The server parses the request and then tries to answer the query. When it has figured out the set of URL's that satisfy the query, it builds a new HTML document and then sends that back to the client's web browser to be displayed.
In order to answer the queries, the search engine first builds an index for (some fraction of) the web. An index is some data structure that provides enough information that we can quickly and easily determine whether a given page satisfies a given query. In real search engines, we must be extremely careful to make the process of building the index fast, and the data structure for representing the index efficient. In addition, we must be able to answer queries quickly.
The indexer and query engine is a parameter to the ServerFn
functor, and has a signature QUERY found in the file
sig-query.sml. The signature is defined as follows:
(* signature for query engines *)
signature QUERY = sig
(* the type of parsed queries *)
type query
(* the type of compiled indexes *)
type index
(* Crawl the web starting from the given URL and compile an
* index of all the words on all the pages visited *)
val indexWeb : StringSet.set -> WebGraph.url -> index
(* Parse a query from a string into the internal
* representation for queries. Returns NONE in case of parse
* error, SOME (q) for success *)
val parseQuery : string -> query option
(* Given a query and a compiled index, returns the list of
* all the URLs pointing to a page matching the query *)
val processQuery : query * index -> WebGraph.url list
end
The file basic-query.sml contains a structure
BasicQuery implementing the above signature. For now, we ask you to
implement the indexing part of that structure. The system defines the
functor WebCrawler (in file web-crawler.sml) which uses
the functor defined previously to crawl over web pages. Specifically, you'll
need to:
Warning: Don't try crawling the real web -- for two reasons. First, it's huge and you don't have enough space on your machine to do the indexing. Second, real crawlers obey certain protocols that we have not described here. In particular, they look for files (ROBOTS.TXT) that direct which files they can and can't look at. For more information, read these guidelines. Here's what may happen when you don't follow the rules.
The user sends queries to the query engine typically by interacting via a form on a web page, like the one on the simple web page initially served by the server. When you type a query in the text box and send it off, this query is sent to the web server.
BasicQuery supports queries written as follows:
When the web server receives a text string representing a query from the user, it first calls the function parseQuery in the query engine to translate the string representation to an internal representation of queries. We have chosen a datatype representation for queries, with the following definition:
datatype query = Word of string
| Or of query * query
| And of query * query
The web sever calls parseQuery to obtain a query value representing the request. The function parseQuery returns a list of urls corresponding to pages matching the query.
A web page p matches a query q as follows:
p matches <word> if <word> occurs on the page
p matches (q1 and q2) if p matches q1 and p matches q2
p matches (q1 or q2) if p matches q1 or p matches q2
Your queries should be case insensitive (like Google). Note that the words returned by the function words_of_node are in whatever case they were found in on the webpage, and the words in the datatype query are also as the user entered them.
Implement the function processQuery, using the index (constructed
during crawling and passed in as an argument).
Test your implementation by instantiating the web server using the basic query engine, as follows:
Experiment with basic queries.structure S = ServerFn (structure Q = BasicQuery);
TURN IN: submit your code in
basic-query.sml.
Let us now add support for 'not' queries. The file
negative-query.sml defines a structure NegativeQuery
(with signature QUERY) implementing a query engine that recognizes
the following grammar for queries:
NegativeQuery supports
queries written as follows:
The corresponding datatype definition, specifying the value returned by NegativeQuery.parseQuery, is as follows:
datatype query = Word of string
| Or of query * query
| And of query * query
| Not of query
We have provided an implementation of parseQuery that translates a string representation of a query into a value of the query type.
Complete the implementation of the structure NegativeQuery to
support negative queries. This will probably require you to use an index
type different from the one you used in BasicQuery, and therefore
force you to write a slightly different indexing function than the
indexWeb you implemented in BasicQuery. However, the changes
are minimal. (Don't worry about code duplication here. We want the
implementations separate so that we can more easily test them.) There is
more than one way to do this. The function processQuery will need to
take into account the new constructor Not: A web page p
matches the query !q if p does not match the query
q.
Test your implementation by instantiating the web server using this new query engine, as follows:
structure S = ServerFn (structure Q = NegativeQuery);
Experiment with queries containing 'not' subqueries.
TURN IN: submit your code in
negative-query.sml.
If the previous exercise was a simple muscle-flexing exercise, this is
the bench press exercise. Let us now add support for 'phrase' queries.
Intuitively, a query "<word 1>...<word n>" is
matched by any file that contains words <word1>, ..., <word2>
immediately next to each other in the given order.
The file phrase-query.sml defines a structure
PhraseQuery (with signature QUERY) implementing a query
engine.
PhraseQuery supports queries written as follows:
The corresponding datatype definition, specifying the value returned by NearQuery.parseQuery, is as follows:
datatype query = Word of string
| Or of query * query
| And of query * query
| Not of query
| Phrase of string list
We have provided an implementation of parseQuery that translates a string representation of a query into a value of the query type.
Complete the implementation of the structure PhraseQuery to support phrase queries. You have full freedom to use whatever type of index you want. Implement the function indexWeb accordingly (still using WebCrawler.crawl). The function processQuery will need to take into account the new constructor Phrase.
Test your implementation by instantiating the web server using this new query engine, as follows:
structure S = ServerFn (structure Q = PhraseQuery (structure P = Parser3));
You'll notice that PhraseQuery is a functor that takes in Parser3. This is because in the next exercise, you'll be modifying a different parser, ImplicitAndParser, which can be used instead of Parser3 to provide even more functionality to phrase queries.
Experiment with queries containing phrases.
TURN IN: submit your code in
phrase-query.sml.
So far, all the code for parsing has been given to you. As this course does not teach all the details of parsing (for that you should take CS412), you won't have to deal with parsing issues that much. However, it is impossible to implement a language (such as the set of all possible queries) without parsing. For this, it is a good experience to at least attempt to figure out how a parser works. The queries we are using lend themselves to a class of parsers known as "recursive descent parsers." Parsers of this type are easy to code by hand in comparison to other parsers.
You will be modifying the PhraseQuery parser so that it assumes two words next to each other are supposed to have an implicit AND between them. These queries would be rejected by the parser used in problem 5. It should be the case that all the queries accepted by problem 5 have exactly the same results in this problem.
Test your implementation by instantiating the web server on PhraseQuery using this new parser, as follows:
structure S = ServerFn (structure Q = PhraseQuery (structure P = ImplicitAndParser));
The file implicit-and-parser.sml is currently the same as
the parser3.sml that you used in the previous problem. You
should change it so that it adds implicit AND's between words. This change
should only amount to a few lines of code (probably three to five lines,
depending on how you space it). Please insert a comment above the section of
code that you change.
TURN IN: submit your updated code in
implicit-and-parser.sml.
Create yet another parser based off the implicit-and-parser.sml that does not add implicit ANDs between a common word (such as "and", "or", "where", "how", "of", "a", "an", etc) and a non-common word. If there are only common words, then implicit ANDs will be added between all of them. For example:
karma-parser.sml.Problem set prepared by Frances, Saikat, and Jeff.