Date: Sat, 18 Sep 2004 22:26:55 -0400 (EDT) Subject: Wed 15th Report From: "Karthik Jeyabalan" To: jmk227@cornell.edu, mg327@cornell.edu, kj44@cornell.edu, pcr4@cornell.edu, ncw4@cornell.edu, dmitriev@cs.cornell.edu, lagoze@cs.cornell.edu, wya@cs.cornell.edu, asr32@cornell.edu, cjl2@cs.cornell.edu X-Priority: 3 (Normal) Importance: Normal Wed 15th Report: By Karthik A new member is added to the team: Nurwati Widodo, email- ncw4@cornell.edu. A report should contain a Title, Name(s) of the author(s), and page number. Following contains some of the things that were presented by the two groups; most of what was presented is in the reports given by the groups. Pat presented Ari's and his work, following is some of the things that he said. Each snapshot of the web is between 40 to 50 TB. Web data is archived in an Arc file of size 100 MB. This file makes the archiving process easier. It is also streamable and concatable. Every ARC file has a corresponding file called a DAT file that can be generated from the ARC file. It contains the Meta data of the ARC file. It also helps with faster access to ARC file. Some Potential Problems Given a data they cannot retrieve all the URLs that were active at that time. Due to lack of natural features, it will take lots of processing time. Mirror: Egypt - Only 3 year old files are there Amsterdam - All the current data is archived there. OAI is it for DAT file or Content file? Shipping disk is an unlike alternative according to people in cA. So, is there money for Dispatches? Possible that could be done with getting the data to Cornell. -Mail Disk -Download from Internet. -Do feature extraction at CA Their Bandwidth 1.5 Gbps - they are using 900 Mbps Use Internet 2 to send data? 2'd groups paper presented by Jerrin - look at the paper sent out by Jerrin. Nurwati joins Internet Archive team. She will also be building our groupâs web page Jerrin's group Job for Next week: Suppose there is some data, what programming environment will be used? Find out the actual throughput for the Machines. Pat's group: More clarifications. Karthik: prepare set of questions with Pavel to ask CS Faculty who will be interested in using the Internet archive.