About the Internet Archive and its collection Each "snapshot"of the web is between 40 and 50 TB, and these are taken every two months. The full collection is currently around 300-400 TB. Web pages are aggregated into ARC files, each of which is associated with a DAT file, and various indexes. Each ARC is approximately 100 MB, the DAT files and indexes are a small fraction of this. The assignment of web pages to ARC files is essentially random. The assignment of ARC files to individual machines is also random. Internet Archive does not do much mirroring for their web content. The Alexandria site was shipped a complete datacenter, and has not been updated. Amsterdam was also shipped complete machines, and mirrors some things, but not everything. The mirroring that they do is done via OAI harvests and Rsync. Jon's opinion is that shipping disks should be a "last resort", and expects that they would have sufficient bandwidth on their end (1.5 Gbps, of which they typically use over 60%). Most of the ARC data sits on idle machines. On the other hand, they don't have the staffing or inclination to copy the data to hard disks and ship them to us. Jon thinks we'll be able to get permission to rip most of their data via rsync. They have the bandwidth. A quick calculation suggests it would be price and time-comparable with moving hundreds of hard disks. The bandwidth tariff is currently $2 per GB, although we may get a bulk price. Disks are currently on the order of $1 per GB, but I suspect that shipping will increase that substantially, and my hunch is that, if they're willing to copy to disks for us at all, Internet Archive might expect compensation for their time. The largest obstacle is likely to be legal. They don't have permission to redistribute all their web content, and the non-redistributable content is mixed with the other stuff. The details of how we'd do the rsync haven't yet been worked out; they're pending their internal discussions. We'll email them and ask what they decided within a few days. It also depends on whether we're doing this as once-only or on a regular basis. But probably they can grant us access, and we just pull the whole contents of their system, machine by machine. Link to the Open Archive Initiative website with info on harvesting metadata, etc: http://www.openarchives.org/OAI/openarchivesprotocol.html Format of the ARC files: http://www.archive.org/web/researcher/ArcFileFormat.php