42
digital archive. A fortiori, the model also states that implementations will vary depending
upon the needs of an archival community.
Research conducted during the planning year has identified four different approaches to
preservation: emulation, migration, hard copy, and computer museums. These different
approaches should not, however, be viewed as mutually exclusive. They can be used in
conjunction with each other. Additionally, as we are not pursuing the present study as
simply an academic exercise but rather, as a very practical investigation of what it will
take to build, operate, and maintain a fully functioning production-mode digital archive,
we cannot possibly discount the financial implications of the different approaches and the
impact of standards on a choice of approach.
In choosing our approach, we quickly discounted both hard copy and computer museums.
Hard copy decays over time and multimedia objects cannot be printed. Computer
museums were also discounted as impractical. To function as an archive, computer
museum equipment would need to be operational, not simply a collection of static
exhibits. In turn, operation would lead to inevitable wear and tear on the equipment, with
the consequential need for maintenance and repair work. When considering the repair of
"antique" computer equipment, one has to ask about the source of spare parts — do all of
them have to be hand-made at enormous expense? Even if money were available for such
expensive work, does the "antique" equipment come with adequate diagnostic and testing
equipment, wiring diagrams, component specifications, and the like, which would make
the "museum" choice technically feasible in the first place? Even if the antique equipment
were to receive only the most minimal usage, the silicon chips would deteriorate over
time. In such a museum environment, one would question whether we would be in danger
of losing our focus, ending up as a living history-of-computers museum rather than an
archive of digital materials.
Rejecting hardcopy and museum options left the team with two very different approaches
to the storage of content in an OAIS archive. One approach to content preservation is to
store objects based upon emerging standards such as XML and then migrate them to new
formats as new paradigms emerge. The other approach, advanced by and associated with
one of its chief proponents, Marc Rothenberg, is to preserve content through emulation.
Both approaches have potential and value, but probably to different subcultures in the
archival community. A goal of standards is to preserve the essential meaning or argument
contained in the digital object. Archives that are charged with the responsibility of
preserving text-based objects such as e-journals are likely to adopt a migratory approach.
Archives that need to preserve an exact replication or clone of the digital objects may
choose, in the future, to deploy emulation as an archival approach. Both approaches are
rooted in the use of standards. Contrary to an argument advanced in a research paper by
Rothenberg,[5] YEA team members maintain that standards, despite their flaws,
represent an essential component of any coherent preservation strategy adopted.
That said, Rothenberg criticized the use of migration as an approach to digital longevity.
Rothenberg does make an insightful and enterprising case for the practice of emulation as
the "true" answer to the digital longevity problem. The "central idea of the approach, " he
41
writes, "is to enable the emulation of obsolete systems on future, unknown systems, so
that a digital document's original software can be run in the future despite being
obsolete." Rothenberg avers that only by preservation of a digital object's context — or,
simply stated, an object's original hardware and software environment — can the object's
originality (look, feel, and meaning) be protected and preserved from technological decay
and software dependency.
The foundation of this approach rests on hardware emulation, which is a common
practice in the field of data processing. Rothenberg logically argues that once a hardware
system is emulated, all else just naturally follows. The operating system designed to run
on the hardware works and software application(s) that were written for the operating
system also work. Consequently, the digital object behaves and interacts with the
software as originally designed.
However, emulation cannot escape standards. Processors and peripherals are designed
with the use of standards. If the manufacturer of a piece of hardware did not adhere 100
percent to the standard, then the emulation will reflect that imperfection or flaw.
Consequently, there is never a true solution, as suggested by Rothenberg, that a
generalized specification for an emulator of a hardware platform can be constructed. In
the data processing trenches, system programmers are well acquainted with the
imperfections and problems of emulation. For example, the IBM operating system MVS
never ran without problems under IBM's VM operating system. It was a good emulation
but it was not perfect. Another major problem with emulation in a practical sense is its
financial implications. The specification, development, and testing of an emulator require
large amounts of very sophisticated and expensive resources.
At this stage, the YEA team believes the most productive line of research is a migratory
approach based upon standards. Standards development must, therefore, feature front and
center in the next phase of e-journal archiving activities. If one listens closely to
academic discourse, the most seductive adverb of all is one not found in a dictionary; it is
spelled "just" and pronounced "jist" and is heard repeatedly in optimistic and transparent
schemes for making the world a better place. If scientists would "jist" insist on
contributing to publishing venues with the appropriate high-minded standards of broad
access, we would all be better off. If users would "jist" insist on using open source
operating systems like Linux, we would all be better off. If libraries would "jist" spend
more money on acquisitions, we would all be better off.
Many of those propositions are undoubtedly true, but the adverb is their Achilles' heel. In
each case the "jist" masks the crucial point of difficulty, the sticking point to movement.
To identify those sticking points reliably is the first step to progress in any realistic plan
for action. In some cases, the plan for action itself is arguably a good one, but building
the consensus and the commonality is the difficulty; in other cases, the plan of action is
fatally flawed because the "jist" masks not merely a difficulty but an impossibility.
It would be a comparatively easy thing to design, for any given journal and any given
publisher, a reliable system of digital information architecture and a plan for preservation
36
that would be absolutely bulletproof — as long as the other players in the system would
"jist" accept the self-evident virtue of the system proposed. Unfortunately, the acceptance
of self-evident virtue is a practice far less widely emulated than one could wish.
It is fundamental to the intention of a project such as the YEA that the product — the
preserved artifact — be as independent of mischance and the need for special supervising
providence as possible. That means that, like it or not, YEA and all other seriously
aspiring archives must work in an environment of hardware, software, and information
architecture that is as collaboratively developed and as broadly supported as possible, as
open and inviting to other participants as possible, and as likely to have a clear migration
forward into the future as possible.
The lesson is simple: standards mean durability. Adhering to commonly and widely
recognized data standards will create records in a form that lends itself to adaptation as
technologies change. Best of all is to identify standards that are in the course of
emerging, i.e., that appear powerful at the present moment and are likely to have a strong
future in front of them. Identifying those standards has an element of risk about it if we
choose the version that has no future, but at the moment some of the choices seem fairly
clear.
Standards point not only to the future but also to the present in another way. The well-
chosen standard positions itself at a crossroads, from which multiple paths of data
transformation radiate. The right standards are the ones that allow transformation into as
many forms as the present and foreseeable user could wish. Thus PDF is a less desirable,
though widely used, standard because it does not convert into structured text. The XML
suite of technology standards is most desirable because it is portable, extensible, and
transformative: it can generate everything from ASCII to HTML to PDF and beyond.
Plan of Work
The Project Manager chart describes the planning project's working efforts during the
year, and it highlights certain key events: http://www.diglib.org/preserve/2bplan.pdf
.
Part II: Lines of Inquiry
Trigger Events
Makers of an archive need to be very explicit about one question: what is the archive for?
The correct answer to that question is not a large idealistic answer about assuring the
future of science and culture but a practical one: when and how and for what purpose will
this archive be put to use? Any ongoing daily source needs to be backed up reliably,
somewhere away from the risks of the live server, and that backup copy becomes the de
facto archive and the basis for serious preservation activities.
41
Types of Archives
The team discovered during the course of its explorations that there is no single type of
archive. While it is true that all or most digital archives might share a common mission,
i.e., the provision of permanent access to content, as we noted in our original proposal to
the Mellon Foundation, "This simple truth grows immensely complicated when one
acknowledges that such access is also the basis of the publishers' business and that, in the
digital arena (unlike the print arena), the archival agent owns nothing that it may preserve
and cannot control the terms on which access to preserved information is provided."
In beginning to think about triggers, business models, and sustainability, the project team
modeled three kinds of archival agents. The first two types of archives include a de facto
archival agent, defined as a library or consortium having a current license to load all of a
publisher's journals locally, or a self-designated archival agent. Both of these types are
commercial transactions, even though they do not conduct their business in the same
ways or necessarily to meet the same missions. The third type of archive is a publisher-
archival agent partnership and the focus of our investigation. Whether this type can now
be brought into existence turns on the business viability of an archive that is not heavily
accessed. Project participants varied in their views about whether an archive with an as
yet uncertain mission can be created and sustained over time and whether, if created, an
individual library such as Yale or a wide-reaching library enterprise like OCLC would be
the more likely archival partner.
Accessing the Archive
So when does one access the archive? Or does one ever access it? If the archive is never
to be accessed (until, say, material passes into the public domain, which currently in the
United States is seventy years plus the lifetime of the author or rights holder), then the
incentives for building it diminish greatly, or at least the cost per use becomes infinite.
There is talk these days of "dark" archives, that is, collections of data intended for no use
but only for preservation in the abstract. Such a "dark" archive concept is at the least
risky and in the end possibly absurd.
Planning for access to the e-archive requires two elements. The less clearly defined at the
present is the technical manner of opening and reading the archive, for this will depend
on the state of technology at the point of need. The more clearly defined, however, will
be what we have chosen to call "trigger" events. In developing an archival arrangement
with a publisher or other rights holder, it will be necessary for the archive to specify the
circumstances in which 1) the move to the archive will be authorized, which is much
easier to agree to than the point at which 2) users may access the archive's content. The
publisher or rights holder will naturally discourage too early or too easy authorization, for
then the archive would begin to attract traffic that should go by rights to the commercial
source. Many rights holders will also naturally resist thinking about the eventuality in
which they are involuntarily removed from the scene by corporate transformation or other
misadventure, but it is precisely such circumstances that need to be most carefully
defined.
40
Project participants worked and thought hard to identify conditions that could prompt a
transfer of access responsibilities from the publisher to the archival agent. These
conditions would be the key factors on which a business plan for a digital archive would
turn. The investigation began by trying to identify events that would trigger such a
transfer, but it concluded that most such events led back to questions about the
marketplace for and the life cycle of electronic information that were as yet impossible to
answer. Team members agreed that too little is known about the relatively young
business of electronic publishing to enable us now to identify definitively situations in
which it would be reasonable for publishers to transfer access responsibility to an
archival agent.
Possible Trigger Events
That said, some of the possible trigger events identified during numerous discussions by
the project team were:
Long-term physical damage to the primary source. Note that we have not imagined
the e-journal archive to serve as a temporary emergency service. We expect formal
publishers to make provision for such access. Nevertheless, in the case of cataclysmic
event, the publisher could have an agreement with the archive that would allow the
publisher to recopy material for ongoing use.
Loss of access or abdication of responsibility for access by the rights holder or
his/her successor, or no successor for the rights holder is identified. In other words,
the content of the archive could be made widely available by the archive if the content is
no longer commercially available from the publisher or future owner of that content. We
should note that at this point in time, we were not easily able to imagine a situation in
which the owner or successor would not make provision precisely because in the event of
a sale or bankruptcy, content is a primary transactional asset. But that is not to say that
such situations will not occur or that the new owner might not choose to deal with the
archive as, in some way, the distributor of the previous owner's content.
Lapse of a specified period of time. That is, it could be negotiated in advance that the
archive would become the primary source after a negotiated period or "moving wall," of
the sort that JSTOR has introduced into the e-journal world's common parlance. It may be
that the "free science" movement embodied in PubMed Central or Public Library of
Science might set new norms in which scientific content is made widely available from
any and all sources after a period of time, to be chosen by the rights owner. This is a
variant on the "moving wall" model.
On-site visitors. Elsevier, our partner in this planning venture, has agreed that at the
least, its content could be made available to any onsite visitors at the archive site, and
possibly to other institutions licensing the content from the publisher. Another possibility
is provision of access to institutions that have previously licensed the content. This latter
option goes directly to development of financial and sustainability models that will be
key in Phase II.
Documents you may be interested
Documents you may be interested