Internet Archive Research Group - COM S 490

Background

The Internet Archive is building a digital library of Internet sites and other cultural artifacts in digital form. The collections currently include books, movies, concerts and web sites. From the beginning, the Internet Archive has called upon researchers to mine the collections both for scholarly purposes regarding historical content and to find ways to improve the accessibility of information in the Archive. Various scholars have been involved in mining the web collection, but until this Fall little work had been done on highlighting interesting content to the website's users.

This past summer Jon Aizen began working with the Internet Archive to transform their previously static website into a true digital library, complete with catalog, search, review, discussion capabilities and dynamically updated statistics for items. The Archive was expanded to include books, some software, and concerts, as well as doubling the size of the movies collection.

Jon's research with Jon Kleinberg and Dan Huttenlocher began in the Fall of 2002. The two professors had visited the Internet Archive in San Francisco and were familiar with the website and the organization's ideals.

User Download Behavior and Engagement Research

Our research began with a focus on the behavior of users when visiting details pages. It quickly became apparent there exist patterns of behavior exhibited by users at these pages. We began by first examining the amount of time users spend at a given item's details page, with particular regard to their decisions of whether or not to download an item. We started by trying to plot this information visually and ended up with graphs like this one:

Makes a lot of sense, right? Well, we knew what the graphs were trying to say, but had to constantly remind overselves. "So what does it mean if a details page is in this corner?" was our question, over and over again. Here is an annotated version of the graph (each red dot is a details page):

Unfortunately, we never really made this annotation automatic, so we constantly had to remind ourselves. But we soon found a better solution which took the form of text based visualizations:

               details--TO--movie.downloads (18%) ---------|-----------------------------------------
-------------------X----------------------------------------------------------------------------X----
                   19                                                                           96

The above example provided us with a visual representation of the following information: 18% of users who visited a details page, continued on to download a movie. Of those 18% that actually proceeded to download the movie, 19% spent less than or equal to 5 seconds on the details page, 77% spent between 6 and 180 seconds at the details page, and 4% spent more than 180 seconds on the details page (which we took to be an indication of lack of interest). We fiddled with the 5 and 180 second values, terming this the engagement window. Analyzing the statistics across all the pages, we found that users were engaged when they spent between 8 and 45 seconds on a given page. This period of time was long enough to not be a "click through" to another page, yet short enough to suggest that the user didn't get up to get a cup of coffee. Here are some additional examples (all from September 28th, using 5 and 180 seconds as the boundaries):

               details-audio--TO--downloads (30%) ---------------|-----------------------------------
---------------------------------X----------------------------------------------------------------X--
                                33                                                               98


               details-texts--TO--downloads (23%) ------------|--------------------------------------
----------------------------X----------------------------------------------------------------------X-
                            28                                                                     99

Batting Averages

These graphs proved pretty interesting to stare at, but presented little information that would be of relevance to the users of the website. We were looking for something that could provide some useful information to users about the specific interest of a given item. This led us to the notion of (what we would soon call) engaged batting averages. Every item could be given an EBA computed by: number of visits to the item's details page where the visit duration was in some range (15 to 75 seconds was one range used) divided by the total number of visits to the item's details page.

As it turned out, the engaged batting averages were less informative than a simpler notion which we called download batting averages (or just batting averages). Analogous to batting averages in baseball (the number of hits divided by the times at bat of a particular player), download batting averages are computed by dividing the number of downloads (coming from the item's details page) by the number of visits to the item's details page. We looked at lists such as this one from November 6th, a sample of which is provided here:

URL	Total Visits	Total Downloads	Batting Average
audio 734 /audio/etree-details-db.php?id=734	95	55	0.57
audio 928 /audio/etree-details-db.php?id=928	11	6	0.54
texts 53332 /texts/texts-details-db.php?id=53332	11	6	0.54
texts 58554 /texts/texts-details-db.php?id=58554	11	6	0.54
texts 58648 /texts/texts-details-db.php?id=58648	11	6	0.54
texts 58609 /texts/texts-details-db.php?id=58609	15	8	0.53
texts 53372 /texts/texts-details-db.php?id=53372	13	7	0.53
texts 58579 /texts/texts-details-db.php?id=58579	13	7	0.53
audio 888 /audio/etree-details-db.php?id=888	34	18	0.52
audio 626 /audio/etree-details-db.php?id=626	156	81	0.51
audio 860 /audio/etree-details-db.php?id=860	29	15	0.51
audio 350 /audio/etree-details-db.php?id=350	181	92	0.50
audio 885 /audio/etree-details-db.php?id=885	22	11	0.5
texts 59525 /texts/texts-details-db.php?id=59525	20	10	0.5
movies 2636 /movies/details-db.php?id=2636	16	8	0.5
texts 54939 /texts/texts-details-db.php?id=54939	16	8	0.5
texts 55805 /texts/texts-details-db.php?id=55805	16	8	0.5
texts 59158 /texts/texts-details-db.php?id=59158	16	8	0.5
texts 56583 /texts/texts-details-db.php?id=56583	14	7	0.5
texts 53975 /texts/texts-details-db.php?id=53975	12	6	0.5
texts 54589 /texts/texts-details-db.php?id=54589	12	6	0.5
movies 2571 /movies/details-db.php?id=2571	10	5	0.5
texts 10611 /texts/texts-details-db.php?id=10611	10	5	0.5
texts 53140 /texts/texts-details-db.php?id=53140	10	5	0.5
texts 53604 /texts/texts-details-db.php?id=53604	10	5	0.5

We soon realized that these figures would be of interest to users. Specifically, we thought that ranked content boxes of items sorted by batting averages would be more useful than the previous (highly visible) standard of ranking items by number of downloads. Ranking items by number of downloads makes "the rich get richer" by ensuring that the most downloaded items will remain the most downloaded items (because they are the most visible). Because the batting averages are affected by visits as well as downloads, they change much more frequently (in fact, one click to an item's details page can remove an item from the top five!). Furthermore batting averages can be very high for items that have been viewed by very few users (so long as the majority of those few users actually downloaded the item). This feature of batting averages was extremely exciting to us because it provided a way to highlight information that might otherwise remain burried in the Archive. In late November we incorporated ranked batting average lists into the live website. However, before doing so we felt like we need to correct items with relatively few downloads. It just didn't sit right with us to rank an item with 5 downloads out of 10 visits (and hence a batting average of 50%) higher than an item with 490 downloads out of 1,000 visits (batting average of 49%). Yet, at the same time, we still wanted to highlight items that hadn't been viewed that much. To do this we entered an adjusment to the formula, making the computation of batting averages equal to [(# of downloads) / (# of visists)] - (1.1/sqrt(# of visits)). The resulting "normalized" batting averages we initially captured in this list, a sample of which follows:

URL	Total Visits	Total Downloads	Batting Average	Normalized
audio 734 /audio/etree-details-db.php?id=734	95	55	0.57	0.476349533212537
audio 200 /audio/etree-details-db.php?id=200	1089	535	0.49	0.460973370064279
audio 626 /audio/etree-details-db.php?id=626	156	81	0.51	0.439166692328226
audio 350 /audio/etree-details-db.php?id=350	181	92	0.50	0.433957878192963
audio 270 /audio/etree-details-db.php?id=270	1112	508	0.45	0.426846525178897
audio 612 /audio/etree-details-db.php?id=612	591	268	0.45	0.412334193634033
audio 366 /audio/etree-details-db.php?id=366	1442	629	0.43	0.409865689950003
movies 372 /movies/details-db.php?id=372	26912	10885	0.40	0.398370660923184
audio 517 /audio/etree-details-db.php?id=517	164	78	0.47	0.397522875153258
audio 315 /audio/etree-details-db.php?id=315	190	89	0.46	0.395873427620578
audio 497 /audio/etree-details-db.php?id=497	178	83	0.46	0.391338965931875
audio 470 /audio/etree-details-db.php?id=470	148	70	0.47	0.390773479320294
audio 547 /audio/etree-details-db.php?id=547	235	107	0.45	0.390086341630826
audio 793 /audio/etree-details-db.php?id=793	315	140	0.44	0.388100827462543
audio 212 /audio/etree-details-db.php?id=212	2459	993	0.40	0.383656646981346
audio 336 /audio/etree-details-db.php?id=336	6672	2600	0.38	0.37744569672862
audio 421 /audio/etree-details-db.php?id=421	466	197	0.42	0.376422675654672
audio 459 /audio/etree-details-db.php?id=459	2312	909	0.39	0.372368831695205
audio 684 /audio/etree-details-db.php?id=684	504	209	0.41	0.370138999363802
movies 5 /movies/details-db.php?id=5	7422	2821	0.38	0.368478707872035
audio 216 /audio/etree-details-db.php?id=216	549	225	0.40	0.367157105596139
audio 456 /audio/etree-details-db.php?id=456	137	62	0.45	0.367118978753871
movies 129 /movies/details-db.php?id=129	13088	4879	0.37	0.364043184812913
audio 548 /audio/etree-details-db.php?id=548	440	181	0.41	0.363690506901357

Progressive Batting Averages

As we developed the notion of batting averages and examined the figures for different pages, both before and after the incorporation on the live website, we became interested in the way batting averages change over time, and the forces which might influence these highly volatile figures. We started by examining progressive batting averages, a snapshot of an item's batting average at each visit in the item's history. The resulting graphs looked like this:

Looking at progressive batting averages proved interesting and worked to enhance our understanding of batting averages. In particular, these graphs lead to an experiment where we determined the average number of visits before an item's batting average stablized.

Sliding Windows

Moving on, we wanted to look at batting averages over sliding windows. The notion was fairly simple: during visit i through i+x, plot the batting average. We graphed these windows for all values of i between 1 and the number of visits per page. The motivation behind this concept was to visually isolate windows of time when the batting average was changing dramatically. We got pretty interesting results:

We were highly optimistic about the rises and the drops until we realized that random data looks strikingly similar. Nevertheless, we decided to plot the appearance of reviews in these graphs to see if such an event might describe a major change in an item's batting average. We felt that the appearance of a review (which consequently pushed the item into the spotlight by making a temporary link on the main page and the collection landing page in the "recently updated reviews" content box) was likely to greatly reduce an item's batting average by driving a lot of traffic to the item (and therefore increasing the denominator in the batting averages equation). The resulting graphs looked like the following example (green lines represent reviews, and the height of the green line denotes the number of the stars associated with the review):

Since, as mentioned earlier, the rises and drops were also present in random data which we plotted, we decided to devise a method of the impact of each review. We came up with a number for each review which we termed review sigmas. The number essentially represented how different the batting average for the item was during some period after the appearance of the review. This number was calculated as the number of standard deviations away from the overall mean batting average. The resulting graphs were interesting but not much more informative:

Finally, we wanted to make one more change to the graphs. Until this point the X values were spaced equally for each visit, regardless of the time between visits (so essentially the X-axis represented visit number, not time of visit, although we did label the axis with visit time to be a little informative). As mentioned earlier, the sliding windows were based on number of visits, not a duration of time. So one window (with z visits) might have spanned three weeks and then another window for the same details page (also with z visits) might have spanned three days, depending on traffic to the page. Using an algorithm devised by Jon K., we changed this notion of sliding windows based on visits to sliding windows based on fixed time. We proceeded to plot graphs with a sliding window of one week. The results were a lot easier to conceptualize. Compare the following graph to the one directly about this paragraph to see the difference.

At this point we were ready to start doing some experiments to really determine the impact of the reviews on the batting average of an item. All our work thus far had been to just get a sense of how we might go about measuring impact and the above graphs were certainly a catalyst for this thought process. From the beginning we had wanted to find a way to use Jon K.'s recent work with burst analysis, and it seemed like we had found the perfect application...

Burst Analysis

Inspired by Jon Kleinberg's work on burst analysis (Bursty and Hierarchical Structure in Streams. Proc 8th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, 2002: PS and PDF). From Jon's page: "the high-level idea is to analyze a stream of documents and find features whose behavior is `bursty': they occur with high intensity over a limited period of time." When applied to user behavior on the Internet Archive website, we can think of bursty activity as periods of very many or few visits to a given details page. The idea is that if we can identify bursts of activity for a given details page, then the events that caused these bursts are very likely to be interesting. We've started to do burst analysis compared with the appearance of reviews. This research is still in the first steps, but an example graph looks like this:

The above graph depicts the hidden state (which correspondings to batting average) at each visit to the details page of an item. Issues with the above graph, such as the massive, short period, drops in state are currently being dealt with. In particular, Dan's recent work on improving the performance of the dyanmic programming algorithm which generates this state transition information has been crucial in making this work possible with a large numver of states (we are currently using 80 states, a number only feasible with a linear time algorithm).

Our goal is to eventually, using the burst analysis information, come up with numbers representing the long and short term impacts of each review of an item. This could potentially allow us to automatically "rate" a review on the website, telling users that the review did or did not influence user download behavior. This is a particuarly novel concept which is of great excitement to us.

Batting Averages Comparison

At the end of the semester we implemented a feature on the website which allows us to track where users are coming to details pages from (more accurately than simply using referers, this feature allows to know what part of a refering page the user clicked on to get to the details page). Appropriately, we've been calling this specific information an item's "from" variable. Out of interest we decided to see which "froms" are more likely to result in a download at the details page and the resulting data and graphs were very interesting. We began by examining the batting average for each item for each "from." Some sample results from early February can be accessed here. We then made graphs for the sake of visual comparison.

For each value of "from", f, we made a graph like the one above. The green bars are "froms", y, which, when a user came to a details page from y resulted in lower batting averages than when users came from f. The opposite holds for the bars. A full list of graphs is available here.

In particular, this research revealed several interesting facts:

When users see the thumbnails for a movie and then click through to the details page, they are more likely to download the movie than if they came from any other place.
When users browse through categories for items they are more likely to download an item than if they searched for it or if they saw the item in a content box.
The list of links to items with recent reviews on the main page of the site fares poorest in comparison to other "froms."
Content boxes are less likely to induce downloads than users browsing or searching. This result is not surprising because users tend to be looking for something specific when they search or browse. On the other hand, when they click links in a content box they are most likely just looking around.

Rank Histories

At the same time that we generation the information described above we also performed a short experiment (the results are available here) on the number of hours an item appears in the top 5 items (ranked by batting average) for its corresponding collection and for its corresponding mediatype. This is of particular interest because when an item has one of the top five batting averages, it appears in the "Top Batting Averages" content box on its collection landing page and on its mediatype landing page. This results in very high visibility for the item.

One item that was of paritcular interest is a movie called A Visit to Santa in the Prelinger Archives collection. This item was "spotlighted" during Christmas time, which means that a staff member picked it to be featured in the spotlight content box on the Prelinger landing page and on the movies landing page. The result was a dramatic increase in the items batting average and pushed it into the top 5 for both mediatype and collection. When the holidays passed the item lost steam and was replaced by a different film. The graph of these changes is provided here:

Batting Averages Visibility Experiment

Finally, one of our most recent experiments (still in progress) concerns the visibility of batting averages. On every details page the batting average for that item appears above the links to download the item. We were curious what impact, if any, displaying the batting average there has on likelihood of downloading the item. So we decided to randomly show the batting average to half of the users and to hide it from the other half. The results (which dynamically update with every click to a details page or download) are available at the Archive: http://www.archive.org/~jon/onoff.php. Preliminary results suggest that the difference between batting averages when the figure is displayed versus when it is not are quite negligible. Nevertheless, the experiments provides the framework for future trials on what features of the details page are likely to induce downloads.

Future Research

This summer Jon A. will be graduating and leaving to work full-time with the Internet Archive. Dan and the two Jons plan to continue working together in the future, both through telelcommunication and in person during down-times in school and work. The Archive is committed to providing access to the website for researchers and finding new and productive means for interacting with scholars and researchers.

If you're interested in conducting research with the Archive, one well-established program is research access to the Wayback Machine collection (terrabytes of archived web pages dating back to 1996. Researchers are given accounts on Internet Archive servers, with access to a vast array of powerful tools (such as p2, a particularly interesting piece of code written by Andy Jewel, which allows users to perform tasks in parallel across many machines with one command).

Researchers

Jon Aizen - joa4@cornell.edu
Cornell Computer Science Major (2003)
Internet Archive Web Engineer

Antal Novak - afn2@cornell.edu
Cornell Computer Science Major (2003)

Jon Kleinberg - kleinber@cs.cornell.edu
Cornell Computer Science Professor

Daniel Huttenlocher - dph@cs.cornell.edu
Cornell Computer Science Professor

Terms and Definitions

Details pages: At the Internet Archive every item (except for archived web pages) has its own "details page." Similar to details pages at Amazon, these serve as the source of all information for a given item. The details pages contain recent reviews and ratings, download counts, links to download the item, descriptions, release dates, synopses, keywords, categories, and other information. For example, here are some links to details pages:

The popular Cold-War era movie, Duck and Cover.
A recent show by a popular jam-band, Strangefolk.
Lewis Carroll's Alice in Wonderland.

Content boxes: Each mediatype and collection landing page on the website is composed of content boxes featuring lists of items, most of which are sorted by some criteria. These lists include the most downloaded items, the most frequently accessed categories, items related to this day in history, and items with the highest download batting averages. When a content box appears on a mediatype landing page, the list is composed of items from all the collections belonging to that mediatype. Naturally, when a content box appears on a landing page for a collection, the items are exclusively items from that collection.

Landing pages: Every mediatype in the Archive (audio, texts, moving images, and software) has its own page. This page is termed a landing page because it is where users "land" when they click on a link in the navigation bar. Additionally, each mediatype is composed of different collections (except for software and audio which each only have one collection). Each of these collections has its own landing page as well. The landing pages for collections show information and navigational links only for items and categories in that collection. On the other hand, the landing pages for mediatypes include information and links for all items of that mediatype. The following are some examples of landing pages:

The Moving Images mediatype landing page.
The Prelinger Archives collection landing page (moving images).
The Project Gutenberg collection landing page (texts).

Reviews: Every item in the Archive can be reviewed by a registered user. Users are required to give items a star rating between 1 and 5. When an item is reviewed (or when a review is updated), it appears on lists of recently updated reviews. One of these lists exists on the main page of the website and another such list exists on the collection landing page of collection to which the item belongs.

Links

Internet Archive Homepage
Why the Internet Archive is building a digital library