Electronic Clones vs. the Global Research
Archive
by Paul Ginsparg, Los Alamos National Laboratory
Our research communications infrastructure is evolving
rapidly to take advantage of newly emerging electronic resources. The question
before us is how to reconfigure that infrastructure to maximize the advantages
electronic communications offer. Instead of developing a new system of
"electronic publishing," which connotes a straightforward cloning of the paper
method to the electronic network, many researchers would prefer to see the new
technology lead soon to a form of global "knowledge network."
Some of the possibilities offered by a unified global
archive are suggested by the Los Alamos
e-print arXiv, where "e-print" denotes self-archiving by the author).
Since its inception in 1991 it has become a major forum for dissemination of
results in physics and mathematics. These e-print archives are entirely
scientist-driven and are flexible enough either to co-exist with the
pre-existing publication system, or to help it change to something better
optimized for each researcher. The arXiv is an example of a service created by
a group of specialists for their own use: when researchers or professionals
create such services, the results often differ markedly from the services
provided by publishers and libraries. It is also important to note that the
rapid dissemination they provide is not inconsistent with concurrent or post
facto peer review; in fact they offer a possible framework for a more
functional archival structuring of the literature than is provided by current
peer review processes.
As argued by Odlyzko
[1], the current method of research dissemination and validation is
premised on a paper medium that once was difficult to produce, difficult to
distribute, difficult to archive, and difficult to duplicate—a medium requiring
numerous local redistribution points in the form of research libraries. The
electronic medium is opposite in each of the above regards. If we were to start
from scratch today to design a quality-controlled distribution system for
research findings, it likely would take a very different form both from the
current system and from the electronic clone it might spawn without more
constructive input from the research community.
Reinforcing the need to reconsider the current method
is the fact that each article typically costs many tens of thousands of dollars
(minimum) in salary, and much more in equipment, overhead, etc. A key aspect of
the electronic communication medium is that for a minuscule additional fraction
of this amount, it is possible to archive the article and make it freely
available to the entire world in perpetuity. This is, moreover, consistent with
the public policy goals [2] for
what largely is publicly funded research. The nine-year lesson from the Los
Alamos archives is that this additional cost, including the global mirror
network, can be as little as one dollar per article, and there is no indication
that maintenance of the archival portion of the database will require an
increasing fraction of the time, cost, or effort.
Odlyzko [1]
also has pointed out that average aggregate publisher revenues are roughly
$4000 per article, and that since acquisition costs are typically one third of
library budgets, the current system expends an additional $8000 per article in
other library costs. Of course, some of the publisher revenues are necessary to
organize peer review, though the latter depends on the donated time and energy
of the research community and is subsidized by the same grant funds and
institutions that sponsor the research in the first place. The question
crystallized by the new communications medium is whether this arrangement
remains the most efficient way to organize the review and certification
functions, or if the dissemination and authentication systems can be
disentangled to create a more forward-looking research communications
infrastructure.

FIGURE 1
Figure 1 is meant to illustrate one such possible
hierarchical structuring of our research communications infrastructure. At left
it depicts three electronic service layers, and at right the eyeball of the
interested reader/researcher may choose the most auspicious access method for
navigating the electronic literature. The three layers, depicted in blue,
green, and red, are respectively the data, information, and "knowledge"
networks (where "information" is usually taken to mean data + metadata [i.e.,
descriptive data], and "knowledge" here signifies information + synthesis
[i.e., additional synthesizing information]). The figure also represents
graphically the possibility of disentangling and decoupling the production and
dissemination from the quality control and validation (as was not possible in
the paper realm).
At the data level, the Figure suggests a small number
of potentially representative providers, including the Los Alamos e-print arXiv
(and implicitly its international mirror network), a university library system
(CDL = California Digital Library), and a typical foreign funding agency (the
French CNRS = Centre National de Recherche Scientifique). These are intended to
convey the likely importance of library and international components. Note that
there already exist cooperative agreements with each of these components to
coordinate their efforts to facilitate aggregate distributed collections via
the "open archives" protocols (http://www.openarchives.org/).
Representing the information level, the Figure shows a
generic public search engine (Google), a generic commercial indexer (ISI =
Institute for Scientific Information), and a generic government resource (the
PubScience initiative at the Department of Energy), suggesting a
mixture of free, commercial, and publicly funded resources at this level. For a
biomedical audience the Figure might include services like Chemical Abstracts
and PubMed at this level. A service such as GenBank is a hybrid in this
setting, with components at both the data and information levels. The proposed
role of
PubMedCentral would be to fill the electronic gaps in the data layer
highlighted by the more complete PubMed metadata.
At the "knowledge" level, the Figure shows a tiny set
of existing physics publishers (APS
= American Physical Society, JHEP = Journal of High Energy Physics, and
ATMP = Applied and Theoretical Mathematical Physics; the second is based in
Italy and the third already uses the arXiv entirely for its electronic
dissemination);
BMC (BioMedCentral) also is included at this level. These are the third
parties that can overlay additional synthesizing information on top of the
information and data levels; can partition the information into sectors
according to subject area, overall importance, quality of research, degree of
pedagogy, interdisciplinarity, or other useful criteria; and can maintain other
useful retrospective resources (such as suggesting a minimal path through the
literature to understand a given article, and suggesting pointers to later
lines of research spawned by it). The synthesizing information in the knowledge
layer is the glue that assembles the building blocks from the lower layers into
a knowledge structure more accessible to both experts and non-experts.
The three layers depicted are multiply interconnected.
The green arrows indicate that the information layer can harvest and index
metadata from the data layer to generate an aggregation which can in turn span
more than one particular archive or discipline. The red arrows suggest that the
knowledge layer points to useful resources in the information layer. As
mentioned above, the knowledge layer in principle provides much more
information than that contained in just the author-provided "data": e.g.,
retrospective commentaries, etc. The blue arrows—critical here—represent how
journals of the future can exist in an "overlay" form, i.e., as a set of
pointers to selected entries at the data level. Abstracted, that is the current
primary role of journals: to select and certify specific subsets of the
literature for the benefit of the reader. A heterodox point that arises in this
model is that a given article at the data level can be pointed to by multiple
such virtual journals, insofar as each is trying to provide a useful guide for
the reader. (Such multiple appearances would no longer waste space on library
shelves, nor would they be viewed as dishonest.) This could tend to reduce the
overall article flux and any tendency on the part of authors towards "least
publishable units." In the future, authors could be promoted on the basis of
quality rather than quantity: instead of 25 articles on a given subject, an
author can point to a single critical article that "appears" in 25 different
journals.
Finally, the black arrows suggest how the reader might
best proceed for any given application: either trolling for gems directly from
the data level (as many graduate students occasionally do, hoping to find a key
insight missed by the mainstream), or beginning the quest at the information or
knowledge levels in order to benefit from some form of pre-filtering or other
pre-organization. The reader most in need of a structured guide would turn
directly to the highest level of "value-added" resources provided by the
"knowledge" network. This is where capitalism should return to the fore:
researchers can and should be willing to pay a fair market value for services
provided at the information or knowledge levels that facilitate and enhance the
research experience. For reasons detailed above, however, we expect that access
at the raw data level can be provided without charge to readers. In the future
this raw access can be assisted further not only by full-text search engines
but also by automatically generated reference and citation linking. The
experience from the physics e-print archives is that this raw access is
extremely useful to research, and the small admixture of noise from an
unrefereed sector has not constituted a major problem. (Research in science has
certain well-defined checks and balances and ordinarily is pursued by certain
well-defined communities.)
Ultimately, issues regarding the correct configuration
of electronic research infrastructure will be decided experimentally, and it
will be edifying to watch the evolving roles of the current participants. Some
remain very attached to the status quo, as seen by responses to successive
forms of the PubMedCentral proposal from professional societies and other
agencies, ostensibly acting on behalf of researchers but sometimes
disappointingly unable to recognize or consider potential benefits to them.
(Media accounts have been equally telling and disappointing in giving more
attention to the "controversy" between opposing viewpoints than to a
substantive assessment of the proposed benefits to researchers and taxpayers.)
It is also useful to bear in mind that much of the entrenched current method is
a post-World War II construct, including the large-scale entry of commercial
publishers and the widespread use of peer review for mass production quality
control (neither necessary to, nor a guarantee of, good science). Ironically,
the new technology may allow the traditional players from a century ago, namely
the professional societies and institutional libraries, to return to their
dominant role in supporting the research enterprise.
The original objectives of the Los Alamos archives
were to provide functionality that was not otherwise available and to provide a
level playing field for researchers at different academic levels and different
geographic locations. The dramatic reduction in cost of dissemination came as
an unexpected bonus. (The typical researcher is entirely unaware and sometimes
quite upset to learn that the average article generates a few thousand dollars
in publisher revenues.) As Andy Grove of Intel has pointed out
[3], when a critical business element is changed by a factor of ten, it
is necessary to rethink the entire enterprise. The Los Alamos e-print archives
suggest that dissemination costs can be lowered by more than two orders of
magnitude, not one.
But regardless of how different research areas move in
the future (by either parallel or convergent evolutionary paths), and
independent of whether they also employ "pre-refereed" sectors in their data
space, within one or two decades, it is likely that other research communities
will also move to some form of global unified archive system without the
current partitioning and access restrictions familiar to the paper medium,
because such an archive is the best way to communicate knowledge and, hence, to
create new knowledge.
Footnotes:
1. A. Odlyzko, "Tragic Loss Or Good Riddance? The
Impending Demise of Traditional Scholarly Journals," Intern. J. Human-Computer
Studies (formerly Intern. J. Man-Machine Studies) 42 (1995), pp.
71-122, and the electronic J. Univ. Comp. Sci. pilot issue, 1994; A.
Odlyzko,
"Competition and Cooperation: Libraries and Publishers in the Transition to
Electronic Scholarly Journals," J. Electronic Publishing 4:4
(June 1999) and J. Scholarly Publishing 30:4 (July 1999), pp. 163-85;
articles also available at
http://www.research.att.com/~amo/doc/eworld.html.
2. S. Bachrach et al.,
"Who Should Own Scientific Papers?", Science 281:5382 (4 Sept.
1998), pp. 1459-60. See also
"Bits of Power: Issues in Global Access to Scientific Data," the
Committee on Issues in the Transborder Flow of Scientific Data; U.S. National
Committee for CODATA; Commission on Physical Sciences, Mathematics, and
Applications; and the National Research Council (National Academy Press, 1997).
3. Andy Grove, "Only the Paranoid Survive: How to
Exploit the Crisis Points That Challenge Every Company and Career," Bantam
Doubleday Dell, 1996 (as cited in A. Odlyzko,
"The Economics of Electronic Journals," First Monday 2:8 (Aug.
1997) and
J. Electronic Publishing 4:1 (Sept. 1998); definitive version
on pp. 380-93 in Technology and Scholarly Communication, R. Ekman and R.
E. Quandt, eds., Univ. Calif. Press, 1999).
Back to the
Table of Contents
|