Date: Wed, 8 Sep 2004 16:36:20 -0400 (EDT) From: Ariel Shemaiah Rabkin To: mg327@cornell.edu, kj44@cornell.edu, jmk227@cornell.edu, asr32@cornell.edu, pcr4@cornell.edu, dmitriev@cs.cornell.edu, wya@cs.cornell.edu, dph@cs.cornell.edu, cjl2@cornell.edu Subject: minutes from research group meeting Hi folks, here are the minutes of our group meeting earlier today. 0: Notes from the note-taker I: Members II: Methodology III: Tasks IV: This week's tasks 0: Notes from the Notetaker I'm not entirely sure who of the below listed should get weekly minutes, so I've sent them to everyone at our meeting today, plus Dan Huttenlocher. I'd appreciate guidance about which other faculty should get them. Also, if any of you know Jon Aizen's email address, I'd appreciate it if you could pass that along. I: Who's involved Students involved: Mayank Gandhi mg327@cornell.edu Karthik Jeyabalan kj44@cornell.edu Jerrin Kallukalam jmk227@cornell.edu Ari Rabkin asr32@cornell.edu Patrick Reilly pcr4@cornell.edu Faculty who may be involved: Bill Arms wya@cs.cornell.edu Alan Demers ademers@cs.cornell.edu Dan Huttenlocher dph@cs.cornell.edu Carl Lagoze cjl2@cornell.edu Jai jai@cs.cornell.edu Johannes E. Gehrke johannes@cs.cornell.edu PhD student and guru: Pavel Dmitriev dmitriev@cs.cornell.edu Our contact at the Internet Archive (www.archive.org): Jonathan Aizen I've been unable to find an email address for him, I'll do some more digging later. II: Procedures A) We're likely going to be working in smaller ad-hoc groups on a week to week basis. B) At the weekly meetings, each member will comment on what was accomplished in the last week C) Goals will be set for the future. D) Until further notice, we'll be meeting at 1:15 on Wednesdays, in the IS building small conference room. III: Tasks This research project,( the "web research infrastructure" project until someone finds a better name) is pursuing three broad avenues of inquiry. A: What data does the Internet Archive supply, and how can we get it here? 1) How is the Internet Archive organized? 2) What does it contain? 3) How large is it? 4) Can they break out subsets for us? 5) How do they do mirroring, and can we pull data as though we were a mirror? B: How to handle the data here? 1) What hardware resources will we have, and when will we have them? 2) How quickly can we absorb data from them? 3) Do we do any preprocessing? 4) What sort of feature extraction? 5) Can we break out subsets? C: How do use the data? 1) What sort of interface do we export to researchers? 2) What do researchers need? For this week, we're concentrating on the first two of these. IV: This week's tasks. Ari and Pat are getting in touch with the Internet Archive and finding out what they have, and what they can do for us. In particular, we want a sense of how large their datasets are, how large the monthly updates are, how they update their mirrors, how they move large volumes of data, and what format their data is in. We're hoping to arrange a phone conference call later this week, hopefully late Thursday afternoon. We'll also be in email touch with the IA project. Mayank, Karthik, and Jerrin are getting in touch with the Theory Center here to find out what resources they have, what they're acquiring, and are doing some rough calculations on throughput. We're hoping for 4-10 pages from you folks about what they've got, what they're getting, and what sort of capacity we're going to need. It doesn't have to be pretty, a lot of it will just be the performance metrics of the TC systems, what the programming environment is like, how much can be fitted into RAM, how much on disk. [There may also be other topics you guys want to look at] Also, and this applies to everyone, it's worth looking at the Internet Archive [archive.org] and at the NSF grant proposal that's going to be mailed to you all. Anything I've ommitted, feel free to point out.