Date: Fri, 24 Sep 2004 13:49:34 -0400 (EDT) Subject: Wed 9/22 Web Archive project meeting report From: "Jerrin Kallukalam" To: mg327@cornell.edu, kj44@cornell.edu, pcr4@cornell.edu, ncw4@cornell.edu, dmitriev@cs.cornell.edu, lagoze@cs.cornell.edu, wya@cs.cornell.edu, asr32@cornell.edu, cjl2@cs.cornell.edu, mk227@cornell.edu X-Priority: 3 (Normal) Importance: Normal Minutes from the meeting on 9/22/2004 By Jerrin. We had a talk with Prof Dan Huttenlocher in the meeting. He talked about how we plan to use the data made available from Internet Archives. Excerpts: We plan to obtain a temporal pattern of data evolution and visits on the internet over time. We should make web graphs of the internet, and develop models to analyze it. At least three crawls are needed for this purpose. We have some difficulty using URLs as identities while making the graph, since we may have multiple URLs pointing to the same place. Also, we need good hashes that can detect similarities of pages, and if data on a page has changed. We plan to make several webgraphs, and make them available to the researchers. We need to find out how to make graph from a crawl. We can do a sequential scan of data, and hash the URLs to a table. URLs takes only a few bytes to store, so we should be able to keep the entire graph structure in RAM. We need to think of an efficient representation of a graph, like an adjascency matrix. Such a graph would be incredibly sparse. We are also interested in the data (ASCII, mainly). We should be able to analyize this data, and recognizes sections of data that are important. Prof Dan said a compressed ASCII snapshot of the web will take only a few hundread GB of space. Questions: Is it the case than IA is making only one crawl available to us? How feasible it is to put some low power machines at the Internet Archives center to format and compress the data (keeping only the ASCII) and transfer it to through internet? Since some researchers may need the images/video, we have to ask the researchers for their preferences in this case. What is the fee structure of network transfer costs using Internet2? What is the bandwidth available to the theory center? Other proceedings: Nurwati has made a website for the Web Search Project: URL: http://www.people.cornell.edu/pages/ncw4/ All progress reports and other important information will be uploaded to this website. Patt said he downloaded sample data from Internet Archives, consisting of 8 ARC files, and corresponding DATs. Mayank, Nurwati and Patt presented a report on the analysis of the costs of transfering data from Internet Archives, and feature extraction. Jerrin presented a report on the HW facilities available on the CIT, mainly the TAPE drives. Karthik and Pavel has made a questionare for interviewing the researchers about how their preferences for using the web archives. This weeks plan ------------ Ari, Mayank Interview the faculty/researchers individually about what they would like to do with the internet archives. Jerrin, Nurwati Look at the sample data from Internet Archives. - How much information is there? - How clean is the data? - What is the format of data/How to interpret it? Karthik Find an efficient representation for creating and storing the web graph. Explore different possibilities and analyze performance for creating and accessing the graphs. Patt Work with Prof Arms to - Contact IA about how much data we have access to. Investigate if they plan to give us only one snapshot. - Contact CIT and ask about the fee structure of Internet 2.