CS211 Assignment 5: Finding Website Links

Due Thursday, 11 October





What to hand in: We will tell you later what to hand in.

Purpose of assignment: You will write a program that reads a website (a set of web pages reachable from a given one) and constructs the set of all links on the website. Here's how it woks. The user types a Url u1 (say) into the first String field of the GUI, types a second Url u2 into the second String field, and clicks button Ready. The program then prints on the Java console all links on webpages that are reachable from Url u1 and that have u2 as a prefix.

For example, if u1 and u2 are

http://www.cs.cornell.edu/Courses/cs211/2001fa/
http://www.cs.cornell.edu/Courses/cs211/2001fa/
then only webpages reachable from http://www.cs.cornell.edu/Courses/cs211/2001fa/ whose Urls begin with http://www.cs.cornell.edu/Courses/cs211/2001fa/ will be processed --those are all the webpages on our course website.  If u2 is
http://www.cs.cornell.edu/Courses/cs211/
then all reachable pages on http://www.cs.cornell.edu/Courses/cs211/2001fa/ will be processed. If u2 is

             http://www.cs.cornell.edu/

then all reachable pages on the CS Department's domain name will be processed. Finally, if u2 is the empty sequence of characters, all reachable websites will be processed. So, u2 gives the user a chance to narrow the search for webpages to process. We suggest that when you first try your program, you restrict the reachable sites processed to just one file; then, gradually widen it.

You may want to test using our course website. You can also try some html files on your own machine --if you don't know what url to use for them, drag one of them over an internet-explorer or netscape window, and it will open in that browser. Then, you can copy the url from the Address or Location field.

Html tags and links

For assignment 4, you wrote two enumerations, TagEnumeration and LinkEnumeration. These shouldbe used in this project. If you are not sure of your own answer, you can use ours, which will be available Thursday morning: TagEnumeration.java and LinkEnumeration.java.

The program

You are free to write the program any way you wish. However, it can take more time than you have available to develop this program from scratch, so we provide you with a number of things.

Class Set. A "set" is an (unordered) collection of Objects (with no duplicates). An instance of class Set, in file Set.java, represents a set. There are methods for

An instance of class Set could contain the set of links (as Strings) on a web page, for example.

You don't have to implement class Set; we give you the whole thing. However, you should study its implementation, so that you could write it yourself if you had to. Basically, the elements of the set are maintained in a Vector. Note that the Enumeration used for the set is simply the one provided in Vector.

Classes JLiveWindow and MyJLiveWindow. We give you these two classes, in files JLiveWindow.java (which you don't have to change) and MyJLiveWindow.java. Method buttonPressed of class MyJLiveWindow is already set up to get the values from the first two String Fields of the GUI, construct the Set of links, and print the links on the Java Console. Read it to gain some understanding.

Class Webpage. An instance of class Webpage represents a webpage (an html page, a jpeg file, etc.). The constructor has a String argument, which is the Url for the webpage. In file Webpage.java (bring up this document, from the course webpage, in your browser to use the links),we give you a skeleton of this class, with all the methods defined but some method bodies uncompleted. One of your first tasks is to complete this class. CHECK IT OUT: Make sure it is correct before proceeding.

Class LinkChecker. Class LinkChecker in file LinkChecker.java is the main one for you to program. You can start with our version of it from which some pieces have been removed. We want you only to fill in the bodies of the following methods; the rest is done for you.

(1) Method shouldProcess. This method, as its specification indicates, yields true iff its parameter is a web page that can have links. The specification defines what this means.

(2) Method addAllLinkedURLs, with looks like this:

private void addAllLinkedURLs(Webpage wp, Webpage onWp)
is the recursive procedure that processes the graph of web pages that are reachable from the root web page, placing all links on them in Set variable links. The complete specification is given on the method iteself (or in the Javadoc comments), but this picture and the discussion below it should help:
This picture shows the representation whenever the method is called (recursively). Web page wp is the next one to process, and it is known that
there is a link to it on webpage onWp --that's what the red arrow signifies. It is also known that web page wp is not in set urls --it hasn't been processed yet-- and that its name has prefixString as a prefix.

Everything in black represents webpages that have not yet been processed.

Everything in blue-green represents web pages that have already been processed, so their urls are all in set urls (as Strings).

In order to make progress in this situation, the body of this method has to:

(1) Add the url of wp (as a String) to urls, denoting that it has (or is being) processed;

(2) Save the set of links on webpage wp in a local Set variable --if this turn out to be null, meaning that there is something wrong with the url of wp --malformed, or can't be found, or whatever-- print an error message to the effect that the url of wp, which is on web page onWp, is bad. Then return, terminating the method body.

(3) Add to Set links all the links that are found on web page wp.

(4) For each link L (say) on web page wp, if L should be processed as a page that may contain links (that's what method shouldProcess checks for), then use a recursive call to add all links reachable from L.

That's it. To start off, one would:

Set field linkSet to contain a single element: the url (as s String) for the root of the web site;
Set field urls to represent the empty set;
addAllLinkedURLs(url for the root of the web site, null);