Creating a global knowledge network

Paul Ginsparg
Los Alamos National Laboratory, Los Alamos, NM

Invited contribution for Conference held at UNESCO HQ, Paris, 19-23 Feb 2001, Second Joint ICSU Press - UNESCO Expert Conference on Electronic Publishing in Science, during session Responses from the scientific community, Tue 20 Feb 2001

Abstract
Some arXiv background
The real lesson of the electronic distribution format
The near future?
References

Abstract: If we were to start from scratch today to design a quality-controlled archive and distribution system for research findings, would it be realized as a set of "electronic clones" of print journals? Could we imagine instead some form of incipient knowledge network for our research communications infrastructure? What differences should be expected in its realization for different scientific research fields? Is there an obvious alternative to the false dichotomy of "classical peer review" vs. no quality control at all? What is the proper role of governments and their funding agencies in this enterprise, and what might be the role of suitably configured professional societies? These are some of the key questions raised by the past decade of initial experience with new forms of electronic research infrastructure. In the below, I will suggest only some partial answers to the above, with more complete answers expected on the 5-10 year timescale.

Some arXiv background

Since my talk at the first conference in this series five years ago [1], "Electronic Publishing in Science" has evolved in perception from intriguing possibility to inevitability. This period has also seen widespread acceptance of the internet as a communications medium, both inside and outside of academia, fostered largely by applications such as the WorldWideWeb. While progress on some fronts has been more rapid than might have been anticipated, the core structure and policies of scientific publishing remain essentially unchanged, as are the conclusions and recommendations of this follow-up meeting. In what follows, I will nonetheless suggest some looming instabilities of the current system, and reasons to anticipate much further evolution in the coming decade.

The essential question for "Electronic Publishing in Science" is how our scientific research communications infrastructure should be reconfigured to take maximal advantage of newly evolving electronic resources. Rather than "electronic publishing" which connotes a rather straightforward cloning of the paper methodology to the electronic network, many researchers would prefer to see the new technology lead to some form of global "knowledge network", and sooner rather than later.

Some of the possibilities offered by a unified global archive are suggested by the e-print arXiv (where "e-print" denotes self-archiving by the author), which since its inception in 1991 has become a major forum for dissemination of results in physics and mathematics. This resource has been entirely scientist driven, and is flexible enough either to co-exist with the pre-existing publication system, or to help it evolve to something better optimized for researcher needs. The arXiv is an example of a service created by a group of specialists for their own use: when researchers or professionals create such services, the results often differ markedly from the services provided by publishers and libraries. It is also important to note that the rapid dissemination it provides is not in the least inconsistent with concurrent or post facto peer review, and in the long run offers a possible framework for a more functional archival structuring of the literature than is provided by current peer review processes.

As argued by Odlyzko [2], the current methodology of research dissemination and validation is premised on a paper medium that was difficult to produce, difficult to distribute, difficult to archive, and difficult to duplicate -- a medium that hence required numerous local redistribution points in the form of research libraries. The electronic medium is opposite in each of the above regards, and, hence, if we were to start from scratch today to design a quality-controlled distribution system for research findings, it would likely take a very different form both from the current system and from the electronic clone it would spawn without more constructive input from the research community.

An overview of the growth of the arXiv can be found in the monthly submission statistics, showing the number of submissions received during each month since the inception of service in August 1991. The total number of submissions received during the first 10 years of operation is roughly 170,000. The submission rate continues to increase, and roughly 35,000 new submissions are expected during calendar year 2001. Additional signal is contained in the submission data sorted by subject area. The primary observation is that submission growth during the period 1995-2000 was dominated by new users in Condensed Matter physics and Astrophysics: the sum of whose submission rates grew to exceed those in High Energy Physics by late 1997. Extrapolating current growth rates, within a very few years Condensed Matter submissions alone will likely exceed those of High Energy physics, which had begun to reach saturation (i.e., 100% participation of the community) by the mid 1990's. This suggests that the widespread preexisting practice of exchanging hard copy preprints, as was the case in High Energy physics, may not be essential for the electronic analog of this behavior to be adopted by other research communities.

Where do all the submissions come from? According to the submission statistics sorted by e-mail domain of the submitting author, roughly 30% of the submissions come from United States based submitters; 12% from Germany; 6% from each of the U.K., Italy, Japan; 5% from France; and submissions overall arrive from about 100 different countries. The distribution is similar to that for refereed physics journals, and the participation of any given country is typically proportional to its Gross Domestic Product. Reflecting the international nature of the enterprise, the arXiv maintains a 16 country mirror network to facilitate remote access, and which together with the main site typically handles in aggregate many millions of accesses per week.

The real lesson of the electronic distribution format

The need to reexamine the current methodology of scholarly publishing is reinforced by considering the hierarchy of costs and revenues in Figure 1. The figure depicts five orders of magnitude in US dollars per scientific article. To first approximation in an all-electronic future, the editorial cost per article should be roughly independent of length (as are already the bulk of the current costs, with the exception of the time and energy spent by referees, a significant hidden cost omitted from the numbers below).

Figure 1. Hierarchy of per article costs and revenues (log scale), explained in text

At the top of the scale is the minimum $50,000 on average to produce the underlying research for the article, money typically in the form of salary and overhead, and also for experimental equipment. This sets a scale for the overall order of magnitude of the funding involved, and is roughly independent of whether the research is conducted at a university, government lab, or industrial lab.

The next figure on the scale is a rough estimate of the revenues for "high end" commercial journals. (In this case "high end" refers to the pricing, rather than to any additional services provided.) The $10,000--$20,000/article published range is obtained by multiplying the subscription cost per year for some representative "high end" journals by an estimated number of institutional library subscribers and dividing by the number of articles published per year.

Odlyzko's estimate [2] for average aggregate publisher revenues in his survey of Mathematics and Computer Science journals is of the order of $4000/article. (Odlyzko has also pointed out that since acquisition costs are typically 1/3 of library budgets, the current system expends an additional $8000/article in other library costs, another cost omitted from these considerations.)

If that average holds more generally, then there must be publishers operating at well below that value. At least one professional society publisher in Physics brings in about $2000/article in revenue. (Note as an aside that insofar as revenues=costs for non-profit operations, and if the level of services provided is the same as by the "high end" commercial publishers, then it is possible to estimate the potential profit margin of the latter.) By eliminating the print product, and by restructuring the workflow to take greater advantage of electronically facilitated efficiencies, it is likely that the costs for a relatively large existing publisher could be brought down closer to $1000/article. This number includes long-term infrastructural needs, such as an editor-in-chief for long-range planning, a small research and development (R&D) staff, and maintenance of an archival database.

We can also ask whether an idealistic electronic start-up venture, without the legacy problems of an existing publisher, might be even more efficient. At least one such in Physics, currently publishing about 700 articles per year, operates in the $500/article range, including support of computer operations and overhead for use of space and network connections. But private communications suggest that this number is likely to creep upward rather than downward, as some of the labor volunteered from initial enthusiasm is replaced by paid labor and salaries for existing labor are adjusted to competitive levels for retention, so might also move closer to the $1000/article published range.

The point of these observations is not by any means to argue that any of the above operations are hopelessly inefficient. To the contrary, the object is to assess in an operational (rather than theoretical) context what are the likely editorial costs if the current system is taken all-electronic. The order of magnitude conclusion is that costs on the order of some irreducible $1000 per peer-reviewed published article should be expected, using current methodology. The number is not too surprising after taking into account that the human labor (plus overhead) that dominates the costs is effectively quantized in order $100,000 chunks per person (including overhead: but space, utilities, network connections all cost money). A functional operation, to peer review and publish many hundreds of articles per year, will ordinarily require at least parts of a few people for high level editor, low level secretarial work, system administration, and some amount of R&D on an ongoing basis (still of course assuming volunteered referee time). The costs are therefore immediately in the many hundreds of thousands dollar range, confirming the rough $1000/article order of magnitude, up to a factor of 2 one way or another. Moreover, there do not appear to be dramatic economies of scale from taking such an already idealized skeletal editorial operation to larger sizes of thousands or tens of thousands of published articles per year: the labor costs just scale proportionally. (It is even possible that there are certain diseconomies of scale, i.e., that organizing peer review for many thousands of articles per year leads to additional overhead for centralized offices, managerial staff, and more complicated communications infrastructure; but with the potential benefit of a more coherent set of policies and long-range planning for a larger fraction of the literature.)

Another data point in Figure 1 is the current revenue for a representative "web printer", i.e., an operation that takes the data feed from an existing print publisher and converts it to HTML and/or PDF for rendering by a suitable browser. At least one such operation currently functions in the $100/article range. It is eminently reasonable that the costs should be somewhat lower than any of the above peer review figures, since these services are conducted after the peer review and other editorial functions have taken place. The revenue may seem high, but that is because the operation currently involves reverse engineering part of a legacy process intended for print, and can require a slightly different re-engineering for each participation publisher. With better standardized formats, and better authoring tools to produce them, the associated costs may diminish. There are also other "transitional" R&D costs to working with a still evolving technology, and additional costs associated with experimentation on formats, search engines, alert services, and other forms of reader personalization.

Finally, at the bottom of the scale in Figure 1 is an estimate of the cost per current arXiv submission: in the $1-$5/submission range, based on the direct labor costs per year involved only in processing incoming submissions and operating an e-mail "help desk". (Hardware and labor costs for maintaining the static archival database add on only a small percentage.) The estimate is given as a range because the labor per submission is a skewed distribution. There are subsets, such as the original hep-th (High Energy Physics - Theory), which operate according to the original "fully automated" design, with users requiring no assistance at all. Indeed the vast majority of submissions require zero labor time and only a very small number of new users or problematic submissions are responsible for all labor time spent. This has to be the case since there are upwards of 200 new submissions and replacements per weekday -- if each took even just 15 minutes of human labor at the arXiv end, that would mean over 50 hours of work per day, i.e., at least 7 full-time employees. The current tiny percentage of problematic submissions, and smattering of other user questions, in reality requires less than a single full-time equivalent, placing the cost in the middle of the above cited range. This is also assuming a relatively matured and static system, without the need for constant R&D -- it is not clear whether this is a realistic long-term assumption, but including more R&D would only push the costs closer to the upper end of the $1-$5 range.

A key point of the electronic communication medium is that the cost to archive an article and make it freely available to the entire world in perpetuity is a tiny fraction of the amount to produce the research in the first place. This is, moreover, consistent with public policy goals [3] for what is in large part publicly funded research.

In the future there is likely to be a more ideal case, in which the steady state labor is not dominated by the current ever-expanding profile of new users, but would rely instead on an experienced userbase in possession of better local authoring tools. Such tools would make it possible for the user to prepare a more sophisticated and fully portable document format, with accurate and automatically parsable metadata, auto-linked references, better treatment of figures and other attachments, and more. Then the rest of the research community could interact with the automated system as autonomously as the original hep-th community, with the result that the system as a whole could operate in the $1/submission range (or below).

The conclusion of the above is that the per article costs for a pure dissemination system are likely to be at least a factor of 100 to 1000 lower than for a conventionally peer reviewed system. This is the real lesson of the move to electronic formats and distribution, i.e., not that everything should somehow be free, but that with many of the production tasks automatable or off-loadable to the authors, the editorial costs will then dominate the costs of an unreviewed distribution system by many orders of magnitude. This crucial point is the subtle difference from the paper system, in which the expenses directly associated to print production and distribution were roughly the same order of magnitude as the editorial costs (estimates for the cost of the print component are typically 30% of the total). It wasn't as essential to ask whether the production and dissemination system should be decoupled from the intellectual authentication system when the two were comparable in cost. Now that the former may be feasible at less than 1% of the latter, the unavoidable question is whether the utility provided by the latter, in its naive extrapolation to electronic form, continues to justify the associated time and expense. Since many communities rely in an essential way on the structuring of the literature provided by the editorial process, the related question is whether there might be some hybrid methodology that can provide all of the benefits of the current system but for a cost somewhere in between the order $1000/article cost of current editorial methodology and the order $1/article cost of a pure distribution system.

The above questions cannot yet be answered, but some closing observations regarding the above revenue per article estimates may be relevant. The key for any automated system to getting the per article cost down (presuming for the moment that is the objective) will always be to handle far greater volume than can a conventionally edited journal. As mentioned above, any significant fraction of an employee immediately puts the costs per year in the $100,000 range, including overhead, so that requires of order 100,000 articles per year in order to get down towards the bottom of Figure 1. It is also worthwhile to clarify that the above comparisons involve cost per arXiv submission on the one hand, and editorial cost per article published on the other. It is not so much an issue that adding in the numbers for rejected articles would reduce the nominal cost/article (e.g., by a factor of 2 for a journal with a 50% acceptance rate); but rather that the bulk of the editorial time, hence cost, is evidently spent on the articles rejected for publication. (Lest it seem a hopelessly paradoxical and inefficient effort to devote the majority of time to the material that won't be seen, recall that we don't ordinarily regard sculptors as involved in a futile effort just because the vast majority of their time is also spent removing extraneous material.) This does suggest, however, that there might be some modification of the existing editorial methodology to somehow take advantage of the open distribution sector as a pre-filter in order to maximize the time and effort spent on the articles that will be selected for publication, while still maintaining high standards (presuming that remains the objective).

It should also be noted that for the most part the current peer review system has itself escaped a systematic assessment. Despite its widespread use, and the widespread dependence on it both for publication and for grant allocation, much of the evidence for its efficacy remains largely anecdotal. In the Health Sciences, recent studies suggest that conventional editorial peer review can be "expensive, slow, subjective and biased, open to abuse, patchy at detecting important methodological defects, and almost useless at detecting fraud or misconduct" [4]. While it can improve the quality of those articles that do eventually get published, studies suggest that a competent lone editor can perform as well or better. Peer review is by no means a monolithic practice, however, and the Health Sciences differ from, say, Physics in a number of potentially crucial respects. The journals in the former discipline frequently have much lower acceptance rates, as low as 10%, and provide a conduit for a small number of researchers to speak to a much larger number of clinicians. In Physics and closely related disciplines, by contrast, the acceptance rates are typically higher, and the author and reader communities essentially coincide (and hence the referee community as well is composed of the same set of researchers). It will consequently be very valuable to assess peer review more systematically in other disciplines to determine whether it is as well for those "a process with so many flaws that it is only the lack of an obvious alternative that keeps the process going" [4].

Another corollary of the above observation concerning volume is that a physically distributed set of repositories, even if seamlessly aggregated via some interoperability protocol, is not likely to be as cost-efficient as a centralized one. The argument again is that any manual labor involved will bring in some fraction of an employee at a cost of a few tens of thousands of dollars per year, and then a few tens of thousands of articles per year would be required to get the cost down to the few dollars per article range. But volume in that range is closer to the world output for a given discipline, not to the output of a single department. Distributed archiving, even if not as cost-efficient, could of course have other advantages, including redundancy and non-centralized control. In addition, such "in-sourcing" of research communication infrastructure could also make more effective use of existing support labor resources than does the current system.

The near future?

Currently, the research literature continues to owe its structure to the editorial work funded by publisher revenues to organize peer review. The latter of course depends on the donated time and energy of the research community, and is subsidized by the same grant funds and institutions that sponsor the research in the first place. The question crystallized by the new communications medium is whether this arrangement remains the most efficient way to organize the review and certification functions, or if the dissemination and authentication systems can be naturally disentangled to create a more forward-looking research communications infrastructure.

Figure 2. Possible hierarchical structuring of research communications infrastructure.

Figure 2 is meant to illustrate one such possible hierarchical structuring of our research communications infrastructure. At left it depicts three electronic service layers, and at right the eyeball of the interested reader/researcher is given the choice of most auspicious access method for navigating the electronic literature. The three layers, depicted in blue, green, and red, are respectively the data, information, and "knowledge" networks (where "information" is usually taken to mean data + metadata [i.e. descriptive data], and "knowledge" here signifies information + synthesis [i.e. additional synthesizing information]). The figure also represents graphically the key possibility of disentangling and decoupling the production and dissemination on the one hand from the quality control and validation on the other (as was not possible in the paper realm).

At the data level, the figure suggests a small number of potentially representative providers, including the e-print arXiv (and implicitly its international mirror network), a university library system (CDL = California Digital Library eScholarship project), and a typical foreign funding agency (the French CNRS = Centre National de Recherche Scientifique CCSD project). These are intended to convey the likely importance of library and international components. Note that there already exist cooperative agreements with each of these to coordinate via the "open archives" protocols (http://www.openarchives.org/) to facilitate aggregate distributed collections.

Representing the information level, the Figure shows a generic public search engine (Google), a generic commercial indexer (ISI = Institute for Scientific Information), and a generic government resource (the PubScience initiative at the DOE), suggesting a mixture of free, commercial, and publicly funded resources at this level. For the biomedical audience at hand, I might have included services like Chemical Abstracts and PubMed at this level. A service such as GenBank is a hybrid in this setting, with components at both the data and information layers. The proposed role of PubMedCentral would be to fill the electronic gaps in the data layer highlighted by the more complete PubMed metadata.

At the "knowledge" layer, the Figure shows a tiny set of existing Physics publishers (APS = American Physical Society, JHEP = Journal of High Energy Physics, and ATMP = Applied and Theoretical Mathematical Physics; the second is based in Italy and the third already uses the arXiv entirely for its electronic dissemination); and BMC (= BioMedCentral) should also have been included at this level. These are the third parties that can overlay additional synthesizing information on top of the information and data levels, and partition the information into sectors according to subject area, overall importance, quality of research, degree of pedagogy, interdisciplinarity, or other useful criteria; and can maintain other useful retrospective resources (such as suggesting a minimal path through the literature to understand a given article, and suggesting pointers to outstanding lines of research later spawned by it). The synthesizing information in the knowledge layer is the glue that assembles the building blocks from the lower layers into a knowledge structure more accessible to both experts and non-experts.

The three layers depicted are multiply interconnected. The green arrows indicate that the information layer can harvest and index metadata from the data layer to generate an aggregation which can in turn span more than one particular archive or discipline. The red arrows suggest that the knowledge layer points to useful resources in the information layer. As mentioned above, the knowledge layer in principle provides much more information than that contained in just the author-provided "data": e.g. retrospective commentaries, etc. The blue arrows -- critical here -- represent how journals of the future can exist in an "overlay" form, i.e. as a set of pointers to selected entries at the data level. Abstracted, that is the current primary role of journals: to select and certify specific subsets of the literature for the benefit of the reader. A heterodox point that arises in this model is that a given article at the data level can be pointed to by multiple such virtual journals, insofar as they're trying to provide a useful guide to the reader. (Such multiple appearance would no longer waste space on library shelves, nor be viewed as dishonest.) This could tend to reduce the overall article flux and any tendency on the part of authors towards "least publishable units". The future author could thereby be promoted on the basis of quality rather than quantity: instead of 25 articles on a given subject, the author can point to a single critical article that "appears" in 25 different journals.

Finally, the black arrows suggest how the reader might best proceed for any given application: either trolling for gems directly from the data level (as many graduate students are occasionally wont, hoping to find a key insight missed by the mainstream), or instead beginning the quest at the information or knowledge levels, in order to benefit from some form of pre-filtering or other pre-organization. The reader most in need of a structured guide would turn directly to the highest level of "value-added" provided by the "knowledge" network. This is where capitalism should return to the fore: researchers can and should be willing to pay a fair market value for services provided at the information or knowledge levels that facilitate and enhance the research experience. For reasons detailed above, however, we expect that access at the raw data level can be provided without charge to readers. In the future this raw access can be further assisted not only by full text search engines but also by automatically generated reference and citation linking. The experience from the Physics e-print archives is that this raw access is extremely useful to research, and the small admixture of noise from an unrefereed sector has not constituted a major problem. (Research in science has certain well-defined checks and balances, and is ordinarily pursued by certain well-defined communities.)

Ultimately, issues regarding the correct configuration of electronic research infrastructure will be decided experimentally, and it will be edifying to watch the evolving roles of the current participants. Some remain very attached to the status quo, as evidenced by responses to successive forms of the PubMedCentral proposal from professional societies and other agencies, ostensibly acting on behalf of researchers but sometimes disappointingly unable to recognize or consider potential benefits to them. (Media accounts have been equally telling and disappointing in giving more attention to the "controversy" between opposing viewpoints than to a substantive accounting of the proposed benefits to researchers, and to taxpayers.) It is also useful to bear in mind that much of the entrenched current methodology is largely a post World War II construct, including both the largescale entry of commercial publishers and the widespread use of peer review for mass production quality control (neither necessary to, nor a guarantee of, good science). Ironically, the new technology may allow the traditional players from a century ago, namely the professional societies and institutional libraries, to return to their dominant role in support of the research enterprise.

The original objective of the e-print arXiv was to provide functionality that was not otherwise available, and to provide a level playing field for researchers at different academic levels and different geographic locations -- the dramatic reduction in cost of dissemination came as an unexpected bonus. (The typical researcher is entirely unaware and sometimes quite upset to learn that the average article generates many thousands of dollars in publisher revenues.) As Andy Grove of Intel has pointed out [5], when a critical business element is changed by a factor of 10, it is necessary to rethink the entire enterprise. The e-print arXiv suggests that dissemination costs can be lowered by more than two orders of magnitude, not just one.

But regardless of how different research areas move into the future (perhaps by some parallel and ultimately convergent evolutionary paths), and independent of whether they also employ "pre-refereed" sectors in their data space, on the one- to two-decade time scale it is likely that other research communities will also have moved to some form of global unified archive system without the current partitioning and access restrictions familiar from the paper medium, for the simple reason that it is the best way to communicate knowledge and hence to create new knowledge.

References

[1] P. Ginsparg, "Winners and Losers in the Global Research", Electronic Publishing in Science, at UNESCO HQ, Paris, 1996 (eds. Dennis Shaw and Howard Moore), copy at http://arXiv.org/blurb/pg96unesco.html

[2] A. Odlyzko, "Tragic loss or good riddance? The impending demise of traditional scholarly journals," Intern. J. Human-Computer Studies (formerly Intern. J. Man-Machine Studies) 42 (1995), pp. 71-122, and in the electronic J. Univ. Comp. Sci., pilot issue, 1994.
A. Odlyzko, "Competition and cooperation: Libraries and publishers in the transition to electronic scholarly journals," Journal of Electronic Publishing 4(4) (June 1999), and in J. Scholarly Publishing 30(4) (July 1999), pp. 163-185.
Articles also available at http://www.research.att.com/~amo/doc/eworld.html.

[3] S. Bachrach et al., "Who Should 'Own' Scientific Papers?", Science, Volume 281, Number 5382, Issue of 4 Sep 1998, pp. 1459-1460.
See also "Bits of Power: Issues in Global Access to Scientific Data", by the Committee on Issues in the Transborder Flow of Scientific Data; U.S. National Committee for CODATA; Commission on Physical Sciences, Mathematics, and Applications; and the National Research Council; National Academy Press (1997).

[4] "Peer Review in Health Sciences," Ed. by Fiona Godlee and Tom Jefferson, BMJ Books, 1999.