About the Internet Archive and its collection


Each "snapshot"of the web is between 40 and 50 TB, and these are taken 
every two months. The full collection is currently around 300-400 TB. Web 
pages are aggregated into ARC files, each of which is associated with a DAT 
file, and various indexes. Each ARC is approximately 100 MB, the DAT files 
and indexes are a small fraction of this. The assignment of web pages to 
ARC files is essentially random. The assignment of ARC files to individual 
machines is also random.


Internet Archive does not do much mirroring for their web content. The 
Alexandria site was shipped a complete datacenter, and has not been 
updated. Amsterdam was also shipped complete machines, and mirrors some 
things, but not everything. The mirroring that they do is done via OAI 
harvests and Rsync.


Jon's opinion is that shipping disks should be a "last resort", and expects 
that they would have sufficient bandwidth on their end (1.5 Gbps, of which 
they typically use over 60%). Most of the ARC data sits on idle machines. 
On the other hand, they don't have the staffing or inclination to copy the 
data to hard disks and ship them to us.


Jon thinks we'll be able to get permission to rip most of their data via 
rsync. They have the bandwidth. A quick calculation suggests it would be 
price and time-comparable with moving hundreds of hard disks.


The bandwidth tariff is currently $2 per GB, although we may get a bulk 
price. Disks are currently on the order of $1 per GB, but I suspect that 
shipping will increase that substantially, and my hunch is that, if they're 
willing to copy to disks for us at all, Internet Archive might expect 
compensation for their time.


The largest obstacle is likely to be legal. They don't have permission to 
redistribute all their web content, and the non-redistributable content is 
mixed with the other stuff.


The details of how we'd do the rsync haven't yet been worked out; they're 
pending their internal discussions.  We'll email them and ask what they 
decided within a few days. It also depends on whether we're doing this as 
once-only or on a regular basis. But probably they can grant us access, and 
we just pull the whole contents of their system, machine by machine.


Link to the Open Archive Initiative website with info on harvesting 
metadata, etc: http://www.openarchives.org/OAI/openarchivesprotocol.html


Format of the ARC files: 
http://www.archive.org/web/researcher/ArcFileFormat.php