NewsWire - Overview
Collaborative real-time information delivery
Overview
Below is an early description of the system after we did the
first design earlier in 2003. It made extensive use of Astrolabe and
other tools which by now have been replaced by custom technology.
Still this overview serves as a reasonable introduction to the
problems and gives a high-level view of the technology.
The early design of Newswire was also presented at Internet2
workshop on "Collaborative Computing in Higher Education:
Peer-to-Peer and Beyond".
This description was prepared for and presented at the IEEE
Workshop on Resource Sharing in Massively Distributed Systems
(RESH'02), Vienna, Austria, July 2002. A pdf version can
be found here
A Collaborative Infrastructure for Scalable and Robust News
Delivery
In this overview we describe the model used for the NewsWire
collaborative content delivery system. The system builds on the
robustness and scalability of Astrolabe to weave a peer-to-peer
infrastructure for real-time delivery of news items. The goal of the
system is to deliver news updates to hundreds of thousands of
subscribers within tens of seconds of the moment of publishing. The
system significantly reduces the compute and network load at the
publishers and guarantees delivery even in the face of publisher
overload or denial of service attacks
The Current Model for Internet Content Delivery
The traditional model of web-based
publish/subscribe is poorly matched to websites that update their
information very frequently. The pull-model requires
subscribers to return to the publisher periodically to retrieve new
information. The information they receive may be not have changed
since their last access, and even if it has changed, will often
contain a large redundant subset when compared to an earlier
retrieval action. For example a community news site such as
Slashdot.org, where the front-page summarizes recently added news
articles, receives about a million hits a day on this page, many
from returning consumers. It is estimated that a consumer who
returns 4 times during a day receives about 70% redundant data.
Consumers who return more frequently (and Slashdot.org has many)
receive a much higher rate of redundant data. From an end-to-end
perspective, the bandwidth necessary to support such a site is
heavily underutilized, as it is primarily employed to transport
redundant data. Slashdot.org has a policy that requests its
consumers, and certainly the automated consumers, to not pull the
site more than once per hour.
Another problem with the centralized approach
is that the publishers are very sensitive to overload and denial of
service attacks. As we have seen during the terrorist attacks in
September 2001, Internet news sites become completely useless under
overload, failing even to service a small percentage of the
visitors.
One way to address these problems involves what
are called RSS channels. Here an automated consumer pulls a summary
of the available information in XML format, which can be used to
determine whether full information needs to be retrieved. Again
using Slashdot as our example, the RSS channel data would contain
only headlines and URLs to full articles. A second approach in
dealing with frequent pulls by subscribers is to use of a
“last-modified/if-modified-since/not-modified” http request/response
sequence combined with a delta-encoding for transmitting only the
changes to the information source. In both the RSS and
delta-encoding case the model for accessing the summaries and the
data, namely a pull-mechanism, remains unchanged.
To guarantee robust and scalable access to
information, one would prefer to use a push mechanism for publishing
data that changes frequently. In this approach, the consumer
receives exactly the desired information without any unnecessary
overhead, in a timely manner, and avoids having to frequently pull
the producer. Such a service is actually offered by some news site
through proprietary means (for considerable fees). Other sites offer
e-mail based services to provide client with latest news updates.
Many of the highest-volume news sites use a hybrid push/pull
approach to push their information to geographically distributed
content delivery nodes, from which the consumer still has to pull
the data.
The
Case for Collaborative Content Delivery
Our premise is that current push solutions fail to
take advantage of the collaborative power of the Internet. The solutions
are often proprietary, and employ a one-to-many model where the producer
is expected to deliver “personalized” content directly to each of the
consumers. The approach clearly has scalability limitations. Yet despite
these problems, there has been little activity by publishers of the many
real-time news sites to provide a coordinated solution for this problem.
We believe that the time has come for an Internet-wide infrastructure
for efficient real-time content delivery.
The approach we favor is based on a scalable and
robust push-based publish/subscribe system, which delivers data directly
to consumers. The system would support multiple publishers (news
sources), each of which would input updates into the system. A consumer
would subscribe to the appropriate subjects. This position paper tackles
the distribution aspects of the problem, using peer-to-peer techniques.
Notice that existing off-the-shelf publish/subscribe systems are not
scalable to Internet-wide use, and often require extensive manual
configuration and dedicated server infrastructure.
This paper describes a peer-to-peer
publish/subscribe system that, we believe, really could operate at
Internet scale. Our solution is robust to disruption, rapid, offers new
ways of customizing delivery to subsets of the subscribers, and needs no
centralized infrastructure or dedicated servers. The system is based on
a set of peer-to-peer component technologies, with which it weaves an
infrastructure for the delivery of message updates through the direct
use of cooperating end-nodes. Specifically, our solution is based on
Astrolabe; a software system designed for ultra-scalable infrastructure
and distributed systems management. Astrolabe already exists, and our
new publish-subscribe technology is under development now. We expect to
make a prototype available for public download in the spring of 2003.
Astrolabe – Scalable Distributed Management and Control
Astrolabe is technology that is designed for the
monitoring, management and data-mining of large-scale distributed
systems. The technology has four principal properties:
1.
Robust, through the use of epidemic techniques
2.
Scalable, through the use of information aggregation and fusion
3.
Secure, through pervasive use of certificates
4.
Flexible, through secure mobile code.
A detailed description of Astrolabe is outside the
scope of this paper and can be found in [1]. In the remained of this
section we will try to give sufficient background to establish an
intuitive notion of Astrolabe’s capabilities.
Astrolabe is best visualized as a collection of
hierarchical database tables. At the leaf table, a row is assigned to a
particular process or user, which is allowed to update this row with
attributes & values. These leaf tables (and there may be a great many of
them, perhaps tens of thousands) are aggregated, with each leaf table
contributing a read-only summary row to its parent table Each of these
tables is limited to some small size (say, 64-rows); thus the hierarchy
may be several levels deep. The same aggregation mechanism is used
between each level. We use the term zone to denote one of these
tables (DNS has a similar abstraction, the domain).
Stepping back, Astrolabe can be understood as a
database distributed through the network, but not residing on any
server. In contrast, the database in question is virtual: like a jigsaw
puzzle, each participant stores just a part of the data structure, and
the illusion of a tree of tables is constructed at runtime through a
peer-to-peer protocol. The protocol also disseminates updates, which
trigger re-computation of parent tables much as a spreadsheet updates
dependent cells when the cells on which they depend are updated. In a
similar sense, one could say that a peer-to-peer file system like PAST
[2] resides in the network, but is not fully represented by any server.
Details can be found in [1].
Our use of Astrolabe in this paper focuses on the
data associated with each computer connected to the system. This data
is organized into a single row per machine (or per user), and can be
understood as containing a time-varying list of attributes exported by
the machine. For our purposes, those attributes define the machine’s
subscription. Attributes could be classical “subjects”, or could be
very remote from classical subjects, containing any sort of value or
descriptive information or even program-generated information associated
with the machine.
Recall that parent tables contain summaries or
aggregations of information in their child tables. Astrolabe
computes these summaries using aggregation functions, which are
expressions in SQL that take any number of attributes from the child
table and produce new attributes for inclusion into the appropriate row
in the parent table. These aggregation functions are recomputed whenever
a row changes in a child table. The aggregations functions are thus a
form of mobile code, distributed throughout system using the same
epidemic techniques as are used for updates to the data in the rows
themselves, and executed on the machines that run the system – the leaf
nodes. The system is fully secured through the use of public key
certificates.
Astrolabe’s epidemic communication techniques guarantee that the state
represented is eventually consistent, e.g. if one were to freeze
the system, all nodes would eventually enter into consistent states. Of
course, through the information hierarchy imposed by the zoning
structure, each instance will only “see” the state of its neighbors and
of tables between itself and the root.
Astrolabe
as an infrastructure management service
Earlier, we commented that a weakness of
off-the-shelf publish-subscribe systems is that they require extensive
configuration. One of the premier applications of Astrolabe technology
is in the realm of infrastructure management. The fact that Astrolabe
represents state, instead of state-transitions, and provides dynamic
aggregation of the individual state into information hierarchies, makes
the system highly appropriate for managing a distributed infrastructure.
For example, we have explored the use of Astrolabe to manage dynamic
bandwidth in media applications, and for flow management and control in
telecommunications settings. The robustness and scalability of the
underlying epidemic technology make it a very attractive system for
environments where guaranteed eventual consistency is essential to the
operation of a critical infrastructure. Thus, one use for Astrolabe in
a scalable publish-subscribe setting is to simply manage the
publish-subscribe subsystem. In doing so, we can overcome one of the
important obstacles to performing publish-subscribe at Internet scale.
Examples of infrastructure management attributes
that can easily be stored in Astrolabe include the availability and
configuration of local communication paths, as well as performance
measurements of local networking and computing elements. The aggregation
functions used in this setting would typically compute aggregated
availability and performance of network, and might offer real-time
guidance concerning which elements are in the min/max category, and
hence represent targets for new operations.
Application level multicast service based on Astrolabe
A second use of Astrolabe is to support an
application level multicast service. The basic primitive used here is a
method SendToZone(zone,data), where the data is disseminated
through all the children zones to all the leaf nodes that are in the
tree under this zone. If the method is executed with the root
zone as the parameter, all the nodes in the Astrolabe system will
receive the data.
The information that the multicast system requires
from Astrolabe is the set of multicast representatives in each zone. The
representatives are selected in each zone through an aggregation
function that combines the local knowledge of availability of
independent network paths to a node, the load on those paths and the
load on each node. This function will post the results to its entry in
the parent zone; together with some basic attributes on which
higher-level zone aggregation can be performed.
When a SendToZone is executed the system
will visit each of the entries in zone table, each representing a child
of this zone. For each of the entries the attribute with the set of
multicast representatives will be retrieved and the data will be
forwarded to one of the representatives based on a set of local criteria
such as where there currently are open connections to one of the
representatives. At the arrival of the data at the representative, the
process is repeated recursively for all the children in the zone it
represents, until the data arrives at the leaf nodes, where it will be
delivered to the application Astrolabe is part of. In effect, multicast
is performed as a kind of recursive computation on the aggregation in
the zone.
For reasons of brevity, we omit additional details
of this mechanism. The actual implementation includes a number of
optimizations that accelerate multicasts in cases where the same
aggregation function is used by a series of SendToZone
operations. Although work remains to be done, the protocol thus
obtained should have many of the properties of Bimodal Multicast, a
peer-to-peer reliable multicast protocol developed by our group several
years ago.
Publish/Subscribe based on Astrolabe
Jointly, these mechanisms permit us to define a
publish/subscribe architecture based on Astrolabe. Basically, the
solution extends the Astrolabe-based application-level multicast with a
selective forwarding mechanism. To support this leaf nodes publish
attributes that represent the subscriptions in which the associated
machine is interested. Aggregation functions than perform a simple
summary that posts, in the parent zone, a list of the child zones in
which there are nodes interested in this subscription. Eventually
(within tens of seconds) the root zone will have all the information on
whether there are leaf nodes in the system that have subscribed to
particular publications.
Publishing data into the system is similar to the
multicast summarized above, except for that the decision to forward the
data to a child zone now is conditional on a leaf below that child zone
to have subscribed to the publication. At the forwarding node this is a
simple test that inspects the subscription attribute in the aggregated
information for that child zone. If the attribute is present and it
shows active subscribers, the data will be forwarded to a child zone
representative.
Having an attribute for each possible subscription
would be poorly scalable because the work done for purposes of filtering
would be at least linear in the number of subscriptions. Accordingly,
our system replaces the attributes with a Bloom filter that represents
all the subscriptions in the system. A Bloom filter is a probabilistic
mechanism for rapidly testing membership in a large set using multiple
hash functions into a single array of bits. In the pub/sub system we
can use a large single bit array in the order of a thousand bits or
more. At a leaf node a subscription is hashed to a single bit in the
array, and the subscription arrays are aggregated into parent zones
trough a simple binary-or operation on the child arrays. At the
publishing node an attribute is added to the data representing the bit
position in the subscription array this publication corresponds to. This
information is then used at each of the forwarding nodes to test whether
the particular bit position in the subscription array is set, and
whether the data should thus be forwarded.
The use
of Bloom filters is not perfect, insofar as multiple subscriptions can
hash to the same bit in the array. Accordingly, a final test is needed
at the leaf node whether the data that arrives at the node truly matches
a subscription. However, the accuracy can be made as good as desired
by varying the size of the bit array, and we believe that a relatively
small array will be more than adequate for the target domain of our
effort: Internet news services.
News item subscriptions
Our publish/subscribe system is, among other
things, a research vehicle for understanding the trade-offs in different
mappings between news article meta-data and subscription expressions.
The news articles are published in the ICE, NITF and NewsML formats [3],
which are all XML standards used in the news industry, which not only
deal with the description of the content but also provide a mechanism
for the standard description of the news-item meta-data that is used in
the construction of subscriptions.
An early internal prototype of the system, built as
a proof of concept, uses the simpler NITF format, with the subscriptions
expressed as a set of interest areas on a per-publisher basis. Each
available publisher is represented as an attribute in Astrolabe, where
the value of the attribute is a small bit mask that corresponds to a
specific set of news categories this publisher provides. The bit masks
are aggregated in the same way as the Bloom filters, as described in the
previous section. This prototype has limited scalability in the
selection of publishers and is not flexible in term of the
expressiveness of subscriptions. However, we expect to do much more as
we move towards NewsML and begin to enrich the subscription “space”
within which our Bloom filters operate.
A Collaborative Delivery Infrastructure for News Items
Our publish-subscribe system is intended as a
single application that people can download and use to insert themselves
into the Collaborative Content Delivery Network. Users would subscribe
to a set of publishers and provide more complex selection criteria based
on the meta-data associated with the news-items, in the form of an SQL
query. By inserting themselves into the network they possibly provide
forwarding services to other nodes in the network depending on the
criteria used to construct set of zone representatives. A user will have
access to a set of configuration parameters that provides input into the
selection process.
The automatic configuration of application
instances into zones and the location in the zone hierarchy, as well as
the configuration necessary to handle firewalls, has been addressed in
the context of our overall Astrolabe research effort, but is outside of
the scope of this paper [1].
News producers would download and run a different
application capable of publishing information according to a restrictive
set of rules. These restrictions are necessary to handle the
authentication of publishers, to assure the authenticity of the data
they publish, and to perform flow control. The infrastructure necessary
to support these functions is still a research topic.
Under the covers of the publisher is an application
identical to the subscriber application core, insofar as it is just
another Astrolabe leaf node, using its local aggregation zone tables to
drive the dissemination of its data. The selection and filtering
mechanisms used in each forwarding component protect the system from
flooding by publishers.
A publisher is able to restrict the scope of the
dissemination of the data by selecting another zone than the root zone
to publish data into. This for example allows the publisher to
disseminate localized news items in Asia. A future feature planned for
the system is to allow the publisher more control over the dissemination
by adding a predicates to the metadata that needs to be evaluated using
the attribute values of a child zone before it can be forwarded to that
zone. This would allow the publisher to select the set of subscribers
to which an item will be delivered. For example, a publisher could send
some item only to “premium” subscribers, or onto to subscribers which
have previously shown an interest in certain products.
The
Forwarding & End System Components
Our system seeks to deliver news items to the
subscribers in the order of tens of seconds, even if tens or hundreds of
thousands of subscribers are active. Each forwarding component maintains
a log file and a set of forwarding queues, one for each of the
representatives at a child zone. The best strategy to fill queues is
still under research. We are experimenting with weighted round-robin
strategies, as well as some more aggressive techniques. In this context
we are also looking at what information the representative can post into
their tables to aid the selection process.
News items are uniquely identified by the publisher
as part of the news item meta-data; this can be used to remove
duplicates, when (in the manner of the MIT scalable publish-subscribe
work [4]) we use multiple representatives to forward a new item, to
increase the robustness of the delivery.
At the end system the news items are delivered to a
message cache, which is feeds the applications that use the news items.
Automatic cache management can be configured to provide item management
based on the meta-data of the news items, which includes information
about item revision history. On the basis of this metadata, the news
item can be garbage collected, or fused or aggregated into a more
compact form.
The
same cache is used for assisting in achieving end-to-end reliability in
the case of forwarding node failures, and for a limited state transfer
to participants that are joining the system.
Experimentation and deployment
As described earlier in this paper we are currently
working with an early prototype which is to serve as a proof of concept
for the system. We are experimenting to understand issues such as the
complexity of forwarding node selection, the use of redundant message
publishing, node failure & automatic zone reconfiguration and the impact
of those issues on end-to-end reliability, publisher authentication,
and of course the overall performance of the system.
In parallel with this research we have started to
build a first production version of the system that is targeted for
wide-scale experimentation by actual users. We intent to make two
system configurations available late in the spring of 2002; the first
will be targeted towards the publishing of technical news articles by
sites such as Slashdot.org, Wired, The Register, SilliconValley.com,
News.com, etc. The second configuration will be targeted towards the
general news distribution with publishing by Reuters, Associated Press,
the New York Times, etc. We are working to get the collaboration of the
major news sites, but we have already developed some agents that are
capable of transforming the current RSS/HTML information from some
publishers into message streams for the system to bootstrap it.
The interface to the system is currently still
under development but it will be a full user control application in the
same style as many of the current file sharing applications, with an
additional web interface for access. We are also looking for integration
into popular content aggregation systems such as Radio Userland using
XML-RPC mechanisms
References
[1]
Robbert van Renesse and Kenneth Birman, “Astrolabe, A Robust and
Scalable Technology for Distributed Monitoring, Management and Data
Mining”, submitted for publication, 2002.
[2]
Rowstron and P. Druschel, Storage management and caching in PAST,
a large-scale, persistent peer-to-peer storage utility, in
Proceedings of the 18th ACM Symposium on Operating Systems Principles,
pp. 160-173, Banff, Canada, October 2001
[3]
International Press and Telecommunication Counsil, http://www.iptc.org.
[4]
Alex C. Snoeren, Kenneth Conley, and David K. Gifford,
“Mesh-Based Content Routing using XML”, in Proceedings of the 18th
ACM Symposium on Operating Systems Principles, pp. 160-173, Banff,
Canada, October 2001.