Speaker: Junghoo Cho
Affiliation: Stanford University
Time and Location: 4:15 PM, B11 Kimball Hall
Title: Crawling the Web: Discovery and Maintenance of Large-Scale Web Data
In this talk I will discuss the challenges and issues faced in implementing an effective Web crawler. A crawler is a program that retrieves Web pages, commonly for a Web search engine. Often, a crawler has to download hundreds of millions of pages in a short period of time and has to constantly monitor and refresh the downloaded pages. In addition, the crawler should avoid putting too much pressure on the visited Web sites and the crawler's local network, because they are intrinsically shared resources.
These requirements pose many interesting challenges in the design and implementation of a Web crawler. For example, how can we parallelize the crawling activity to achieve maximal download rate with minimal overhead? How should the crawler revisit pages to maintain the highest "freshness" of pages? What pages should the crawler download to improve the "quality" of the downloaded pages?
In the talk I will first go over these challenges and present some solutions that I have developed. In particular, I will describe results from an experiment, in which I monitored more than half a million pages for 4 months. I also present some theoretical results that show how to design and operate a Web crawler.