CS 430 / INFO 430
Information Retrieval
Fall 2006

Hints on Extracting URLS


Forms of URLs

Various forms of URL may point to the same file. The pages referenced from test4.txt are fairly straightforward, but you must be prepared for the following:

Relative URLS

The full form of a URL begins with a protocol name, e.g., http, followed by two slashes, //. For example:

  http://www.cs.cornell.edu/wya/index.html

However, within a web site it is usual to refer to pages relative to the current directory. Thus, if a page is stored in the directory www.cs.cornell.edu/wya/, the relative URL:

   index.html

refers to the file:

  www.cs.cornell.edu/wya/index.html

The notation ../ at the beginning of a relative URL refers to the parent directory. Thus, if a page is stored in the directory www.cs.cornell.edu/wya/, the relative URL:

   ../index.html

refers to the file:

   www.cs.cornell.edu/index.html

Default files

Sometimes a URL specifies a directory, but does not specify a file within that directory, as in the following:

   http://www.cs.cornell.edu/wya/

In this situation, the URL refers to a default file within the specified directory. The commonest defaults are files named index.html or index.htm. Thus this example refers to:

   http://www.cs.cornell.edu/wya/index.html

Anchors within a file

Usually URLs refer to the beginning of a page. However, it is possible to refer to an anchor within a page, by appending the # sign followed by the name of the anchor. Thus the following two URLs refer to the same page, though they reference different locations within the page:

   http://www.cs.cornell.edu/wya/papers.html
   http://www.cs.cornell.edu/wya/papers.html#year2000


[Home]


William Y. Arms
(wya@cs.cornell.edu)
Last changed: November 10, 2006