Newswire - Collaborative real-time content delivery

Overview

Below is an early description of the system after we did the first design earlier in 2003. It made extensive use of Astrolabe and other tools which by now have been replaced by custom technology. Still this overview serves as a reasonable introduction to the problems and gives a high-level view of the technology.

The early design of Newswire was also presented at Internet2 workshop on "Collaborative Computing in Higher Education: Peer-to-Peer and Beyond".

This description was prepared for and presented at the IEEE Workshop on Resource Sharing in Massively Distributed Systems (RESH'02), Vienna, Austria, July 2002. A pdf version can be found here

A Collaborative Infrastructure for Scalable and Robust News Delivery

In this overview we describe the model used for the NewsWire collaborative content delivery system. The system builds on the robustness and scalability of Astrolabe to weave a peer-to-peer infrastructure for real-time delivery of news items. The goal of the system is to deliver news updates to hundreds of thousands of subscribers within tens of seconds of the moment of publishing. The system significantly reduces the compute and network load at the publishers and guarantees delivery even in the face of publisher overload or denial of service attacks

The Current Model for Internet Content Delivery

The traditional model of web-based publish/subscribe is poorly matched to websites that update their information very frequently. The pull-model requires subscribers to return to the publisher periodically to retrieve new information. The information they receive may be not have changed since their last access, and even if it has changed, will often contain a large redundant subset when compared to an earlier retrieval action. For example a community news site such as Slashdot.org, where the front-page summarizes recently added news articles, receives about a million hits a day on this page, many from returning consumers. It is estimated that a consumer who returns 4 times during a day receives about 70% redundant data. Consumers who return more frequently (and Slashdot.org has many) receive a much higher rate of redundant data. From an end-to-end perspective, the bandwidth necessary to support such a site is heavily underutilized, as it is primarily employed to transport redundant data. Slashdot.org has a policy that requests its consumers, and certainly the automated consumers, to not pull the site more than once per hour.

Another problem with the centralized approach is that the publishers are very sensitive to overload and denial of service attacks. As we have seen during the terrorist attacks in September 2001, Internet news sites become completely useless under overload, failing even to service a small percentage of the visitors.

One way to address these problems involves what are called RSS channels. Here an automated consumer pulls a summary of the available information in XML format, which can be used to determine whether full information needs to be retrieved. Again using Slashdot as our example, the RSS channel data would contain only headlines and URLs to full articles. A second approach in dealing with frequent pulls by subscribers is to use of a “last-modified/if-modified-since/not-modified” http request/response sequence combined with a delta-encoding for transmitting only the changes to the information source. In both the RSS and delta-encoding case the model for accessing the summaries and the data, namely a pull-mechanism, remains unchanged.

To guarantee robust and scalable access to information, one would prefer to use a push mechanism for publishing data that changes frequently. In this approach, the consumer receives exactly the desired information without any unnecessary overhead, in a timely manner, and avoids having to frequently pull the producer. Such a service is actually offered by some news site through proprietary means (for considerable fees). Other sites offer e-mail based services to provide client with latest news updates. Many of the highest-volume news sites use a hybrid push/pull approach to push their information to geographically distributed content delivery nodes, from which the consumer still has to pull the data.

The Case for Collaborative Content Delivery

Our premise is that current push solutions fail to take advantage of the collaborative power of the Internet. The solutions are often proprietary, and employ a one-to-many model where the producer is expected to deliver “personalized” content directly to each of the consumers. The approach clearly has scalability limitations. Yet despite these problems, there has been little activity by publishers of the many real-time news sites to provide a coordinated solution for this problem. We believe that the time has come for an Internet-wide infrastructure for efficient real-time content delivery.

The approach we favor is based on a scalable and robust push-based publish/subscribe system, which delivers data directly to consumers. The system would support multiple publishers (news sources), each of which would input updates into the system. A consumer would subscribe to the appropriate subjects. This position paper tackles the distribution aspects of the problem, using peer-to-peer techniques. Notice that existing off-the-shelf publish/subscribe systems are not scalable to Internet-wide use, and often require extensive manual configuration and dedicated server infrastructure.

This paper describes a peer-to-peer publish/subscribe system that, we believe, really could operate at Internet scale. Our solution is robust to disruption, rapid, offers new ways of customizing delivery to subsets of the subscribers, and needs no centralized infrastructure or dedicated servers. The system is based on a set of peer-to-peer component technologies, with which it weaves an infrastructure for the delivery of message updates through the direct use of cooperating end-nodes. Specifically, our solution is based on Astrolabe; a software system designed for ultra-scalable infrastructure and distributed systems management. Astrolabe already exists, and our new publish-subscribe technology is under development now. We expect to make a prototype available for public download in the spring of 2003.

Astrolabe – Scalable Distributed Management and Control

Astrolabe is technology that is designed for the monitoring, management and data-mining of large-scale distributed systems. The technology has four principal properties:

1. Robust, through the use of epidemic techniques

2. Scalable, through the use of information aggregation and fusion

3. Secure, through pervasive use of certificates

4. Flexible, through secure mobile code.

A detailed description of Astrolabe is outside the scope of this paper and can be found in [1]. In the remained of this section we will try to give sufficient background to establish an intuitive notion of Astrolabe’s capabilities.

Astrolabe is best visualized as a collection of hierarchical database tables. At the leaf table, a row is assigned to a particular process or user, which is allowed to update this row with attributes & values. These leaf tables (and there may be a great many of them, perhaps tens of thousands) are aggregated, with each leaf table contributing a read-only summary row to its parent table Each of these tables is limited to some small size (say, 64-rows); thus the hierarchy may be several levels deep. The same aggregation mechanism is used between each level. We use the term zone to denote one of these tables (DNS has a similar abstraction, the domain).

Stepping back, Astrolabe can be understood as a database distributed through the network, but not residing on any server. In contrast, the database in question is virtual: like a jigsaw puzzle, each participant stores just a part of the data structure, and the illusion of a tree of tables is constructed at runtime through a peer-to-peer protocol. The protocol also disseminates updates, which trigger re-computation of parent tables much as a spreadsheet updates dependent cells when the cells on which they depend are updated. In a similar sense, one could say that a peer-to-peer file system like PAST [2] resides in the network, but is not fully represented by any server. Details can be found in [1].

Our use of Astrolabe in this paper focuses on the data associated with each computer connected to the system. This data is organized into a single row per machine (or per user), and can be understood as containing a time-varying list of attributes exported by the machine. For our purposes, those attributes define the machine’s subscription. Attributes could be classical “subjects”, or could be very remote from classical subjects, containing any sort of value or descriptive information or even program-generated information associated with the machine.

Recall that parent tables contain summaries or aggregations of information in their child tables. Astrolabe computes these summaries using aggregation functions, which are expressions in SQL that take any number of attributes from the child table and produce new attributes for inclusion into the appropriate row in the parent table. These aggregation functions are recomputed whenever a row changes in a child table. The aggregations functions are thus a form of mobile code, distributed throughout system using the same epidemic techniques as are used for updates to the data in the rows themselves, and executed on the machines that run the system – the leaf nodes. The system is fully secured through the use of public key certificates.

Astrolabe’s epidemic communication techniques guarantee that the state represented is eventually consistent, e.g. if one were to freeze the system, all nodes would eventually enter into consistent states. Of course, through the information hierarchy imposed by the zoning structure, each instance will only “see” the state of its neighbors and of tables between itself and the root.

Astrolabe as an infrastructure management service

Earlier, we commented that a weakness of off-the-shelf publish-subscribe systems is that they require extensive configuration. One of the premier applications of Astrolabe technology is in the realm of infrastructure management. The fact that Astrolabe represents state, instead of state-transitions, and provides dynamic aggregation of the individual state into information hierarchies, makes the system highly appropriate for managing a distributed infrastructure. For example, we have explored the use of Astrolabe to manage dynamic bandwidth in media applications, and for flow management and control in telecommunications settings. The robustness and scalability of the underlying epidemic technology make it a very attractive system for environments where guaranteed eventual consistency is essential to the operation of a critical infrastructure. Thus, one use for Astrolabe in a scalable publish-subscribe setting is to simply manage the publish-subscribe subsystem. In doing so, we can overcome one of the important obstacles to performing publish-subscribe at Internet scale.

Examples of infrastructure management attributes that can easily be stored in Astrolabe include the availability and configuration of local communication paths, as well as performance measurements of local networking and computing elements. The aggregation functions used in this setting would typically compute aggregated availability and performance of network, and might offer real-time guidance concerning which elements are in the min/max category, and hence represent targets for new operations.

Application level multicast service based on Astrolabe

A second use of Astrolabe is to support an application level multicast service. The basic primitive used here is a method SendToZone(zone,data), where the data is disseminated through all the children zones to all the leaf nodes that are in the tree under this zone. If the method is executed with the root zone as the parameter, all the nodes in the Astrolabe system will receive the data.

The information that the multicast system requires from Astrolabe is the set of multicast representatives in each zone. The representatives are selected in each zone through an aggregation function that combines the local knowledge of availability of independent network paths to a node, the load on those paths and the load on each node. This function will post the results to its entry in the parent zone; together with some basic attributes on which higher-level zone aggregation can be performed.

When a SendToZone is executed the system will visit each of the entries in zone table, each representing a child of this zone. For each of the entries the attribute with the set of multicast representatives will be retrieved and the data will be forwarded to one of the representatives based on a set of local criteria such as where there currently are open connections to one of the representatives. At the arrival of the data at the representative, the process is repeated recursively for all the children in the zone it represents, until the data arrives at the leaf nodes, where it will be delivered to the application Astrolabe is part of. In effect, multicast is performed as a kind of recursive computation on the aggregation in the zone.

For reasons of brevity, we omit additional details of this mechanism. The actual implementation includes a number of optimizations that accelerate multicasts in cases where the same aggregation function is used by a series of SendToZone operations. Although work remains to be done, the protocol thus obtained should have many of the properties of Bimodal Multicast, a peer-to-peer reliable multicast protocol developed by our group several years ago.

Publish/Subscribe based on Astrolabe

Jointly, these mechanisms permit us to define a publish/subscribe architecture based on Astrolabe. Basically, the solution extends the Astrolabe-based application-level multicast with a selective forwarding mechanism. To support this leaf nodes publish attributes that represent the subscriptions in which the associated machine is interested. Aggregation functions than perform a simple summary that posts, in the parent zone, a list of the child zones in which there are nodes interested in this subscription. Eventually (within tens of seconds) the root zone will have all the information on whether there are leaf nodes in the system that have subscribed to particular publications.

Publishing data into the system is similar to the multicast summarized above, except for that the decision to forward the data to a child zone now is conditional on a leaf below that child zone to have subscribed to the publication. At the forwarding node this is a simple test that inspects the subscription attribute in the aggregated information for that child zone. If the attribute is present and it shows active subscribers, the data will be forwarded to a child zone representative.

Having an attribute for each possible subscription would be poorly scalable because the work done for purposes of filtering would be at least linear in the number of subscriptions. Accordingly, our system replaces the attributes with a Bloom filter that represents all the subscriptions in the system. A Bloom filter is a probabilistic mechanism for rapidly testing membership in a large set using multiple hash functions into a single array of bits. In the pub/sub system we can use a large single bit array in the order of a thousand bits or more. At a leaf node a subscription is hashed to a single bit in the array, and the subscription arrays are aggregated into parent zones trough a simple binary-or operation on the child arrays. At the publishing node an attribute is added to the data representing the bit position in the subscription array this publication corresponds to. This information is then used at each of the forwarding nodes to test whether the particular bit position in the subscription array is set, and whether the data should thus be forwarded.

The use of Bloom filters is not perfect, insofar as multiple subscriptions can hash to the same bit in the array. Accordingly, a final test is needed at the leaf node whether the data that arrives at the node truly matches a subscription. However, the accuracy can be made as good as desired by varying the size of the bit array, and we believe that a relatively small array will be more than adequate for the target domain of our effort: Internet news services.

News item subscriptions

Our publish/subscribe system is, among other things, a research vehicle for understanding the trade-offs in different mappings between news article meta-data and subscription expressions. The news articles are published in the ICE, NITF and NewsML formats [3], which are all XML standards used in the news industry, which not only deal with the description of the content but also provide a mechanism for the standard description of the news-item meta-data that is used in the construction of subscriptions.

An early internal prototype of the system, built as a proof of concept, uses the simpler NITF format, with the subscriptions expressed as a set of interest areas on a per-publisher basis. Each available publisher is represented as an attribute in Astrolabe, where the value of the attribute is a small bit mask that corresponds to a specific set of news categories this publisher provides. The bit masks are aggregated in the same way as the Bloom filters, as described in the previous section. This prototype has limited scalability in the selection of publishers and is not flexible in term of the expressiveness of subscriptions. However, we expect to do much more as we move towards NewsML and begin to enrich the subscription “space” within which our Bloom filters operate.

A Collaborative Delivery Infrastructure for News Items

Our publish-subscribe system is intended as a single application that people can download and use to insert themselves into the Collaborative Content Delivery Network. Users would subscribe to a set of publishers and provide more complex selection criteria based on the meta-data associated with the news-items, in the form of an SQL query. By inserting themselves into the network they possibly provide forwarding services to other nodes in the network depending on the criteria used to construct set of zone representatives. A user will have access to a set of configuration parameters that provides input into the selection process.

The automatic configuration of application instances into zones and the location in the zone hierarchy, as well as the configuration necessary to handle firewalls, has been addressed in the context of our overall Astrolabe research effort, but is outside of the scope of this paper [1].

News producers would download and run a different application capable of publishing information according to a restrictive set of rules. These restrictions are necessary to handle the authentication of publishers, to assure the authenticity of the data they publish, and to perform flow control. The infrastructure necessary to support these functions is still a research topic.

Under the covers of the publisher is an application identical to the subscriber application core, insofar as it is just another Astrolabe leaf node, using its local aggregation zone tables to drive the dissemination of its data. The selection and filtering mechanisms used in each forwarding component protect the system from flooding by publishers.

A publisher is able to restrict the scope of the dissemination of the data by selecting another zone than the root zone to publish data into. This for example allows the publisher to disseminate localized news items in Asia. A future feature planned for the system is to allow the publisher more control over the dissemination by adding a predicates to the metadata that needs to be evaluated using the attribute values of a child zone before it can be forwarded to that zone. This would allow the publisher to select the set of subscribers to which an item will be delivered. For example, a publisher could send some item only to “premium” subscribers, or onto to subscribers which have previously shown an interest in certain products.

The Forwarding & End System Components

Our system seeks to deliver news items to the subscribers in the order of tens of seconds, even if tens or hundreds of thousands of subscribers are active. Each forwarding component maintains a log file and a set of forwarding queues, one for each of the representatives at a child zone. The best strategy to fill queues is still under research. We are experimenting with weighted round-robin strategies, as well as some more aggressive techniques. In this context we are also looking at what information the representative can post into their tables to aid the selection process.

News items are uniquely identified by the publisher as part of the news item meta-data; this can be used to remove duplicates, when (in the manner of the MIT scalable publish-subscribe work [4]) we use multiple representatives to forward a new item, to increase the robustness of the delivery.

At the end system the news items are delivered to a message cache, which is feeds the applications that use the news items. Automatic cache management can be configured to provide item management based on the meta-data of the news items, which includes information about item revision history. On the basis of this metadata, the news item can be garbage collected, or fused or aggregated into a more compact form.

The same cache is used for assisting in achieving end-to-end reliability in the case of forwarding node failures, and for a limited state transfer to participants that are joining the system.

Experimentation and deployment

As described earlier in this paper we are currently working with an early prototype which is to serve as a proof of concept for the system. We are experimenting to understand issues such as the complexity of forwarding node selection, the use of redundant message publishing, node failure & automatic zone reconfiguration and the impact of those issues on end-to-end reliability, publisher authentication, and of course the overall performance of the system.

In parallel with this research we have started to build a first production version of the system that is targeted for wide-scale experimentation by actual users. We intent to make two system configurations available late in the spring of 2002; the first will be targeted towards the publishing of technical news articles by sites such as Slashdot.org, Wired, The Register, SilliconValley.com, News.com, etc. The second configuration will be targeted towards the general news distribution with publishing by Reuters, Associated Press, the New York Times, etc. We are working to get the collaboration of the major news sites, but we have already developed some agents that are capable of transforming the current RSS/HTML information from some publishers into message streams for the system to bootstrap it.

The interface to the system is currently still under development but it will be a full user control application in the same style as many of the current file sharing applications, with an additional web interface for access. We are also looking for integration into popular content aggregation systems such as Radio Userland using XML-RPC mechanisms

References

[1] Robbert van Renesse and Kenneth Birman, “Astrolabe, A Robust and Scalable Technology for Distributed Monitoring, Management and Data Mining”, submitted for publication, 2002.

[2] Rowstron and P. Druschel, Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility, in Proceedings of the 18th ACM Symposium on Operating Systems Principles, pp. 160-173, Banff, Canada, October 2001

[3] International Press and Telecommunication Counsil, http://www.iptc.org.

[4] Alex C. Snoeren, Kenneth Conley, and David K. Gifford, “Mesh-Based Content Routing using XML”, in Proceedings of the 18th ACM Symposium on Operating Systems Principles, pp. 160-173, Banff, Canada, October 2001.