57
94
Sound Directions
Best Practices For Audio Preservation
NAS. Currently, content is moved to the IU mass storage system as a stopgap measure
because the IU digital repository is not yet ready to ingest audio
5.2.1.4 Local Storage Supplementing Preservation Storage at Indiana University ATM
There is another question related to local, interim storage that we addressed at the IU Archives
of Traditional Music during this project: at this point in the development of preservation
repositories, should copies be retained for a longer period of time by the unit whose holdings
are the target of preservation transfer? Typically, original and backup files are stored locally
until content is ingested into a preservation repository, at which point they are deleted. Is it
worthwhile to retain redundant copies of preserved content locally for five years, ten years,
or forever? A recent article in D-Lib Magazine provided some food for thought:
As libraries and other institutions embark on the digital preservation process, judgment
must be used to balance risk against the maturity of the process. Documents that are
extremely rare or whose loss might cause considerable financial, environmental,
or cultural disasters should not be entrusted to a relatively immature process. We
would like to say that we will preserve our cultural heritage materials in perpetuity;
however the unknown—and, furthermore, unknowable—digital landscape suggests
that any such guarantee would be inadvisable at this point.
90
Digital files created from deteriorating analog sources recorded in the field will quickly
become the best—in some cases, only—copy available. Few digital repositories have been
tested by failure and there is little data available on dealing with the many different types
of threats to digital data residing in a preservation repository. These threats include media
faults, media/hardware obsolescence, software/format obsolescence, human error, loss of
metadata, malicious attack, natural disaster, and failure of organizations.
91
Preservation files created for content held at the Archives of Traditional Music are destined
for long-term storage and access through a digital preservation repository currently under
development by the Indiana University Digital Library Program (DLP). This repository will
manage preserved assets from many sources at IU. Although on the same campus, the DLP
is a separate administrative unit from the ATM. These two units have worked closely on a
number of projects, establishing a solid, trusted, and mutually beneficial relationship in the
process. Even with the new preservation repository, the ATM has decided to maintain an on-
site copy of Preservation Master Files for an interim period because
the IU digital preservation repository is still under development;
the ATM does not yet have a service level agreement with the DLP;
the DLP does not yet have a service level agreement with the IU Massive Data Storage
System which will provide underlying long-term storage;
the preservation repository field itself is not yet mature;
there is little experience anywhere in planning for, and recovering from, problems;
failure modes are not yet well understood in the field due to lack of experience.
It must be noted that the IU digital preservation repository, like others around the country,
is undergoing careful development using current standards and best practices to ensure the
integrity and longevity of the data it is charged with preserving. There is much experience
90 Ronald Jantz and Michael J. Giarlo, “Architecture and Technology for Trusted Digital Repositories,” D-Lib
Magazine 11, no. 6 (June 2005), http://www.dlib.org/dlib/june05/jantz/06jantz.html.
91 Mary Baker, et al., “A Fresh Look at the Reliability of Long-term Digital Storage,” in EuroSys Proceedings:
Proceedings of the 2006 EuroSys Conference, Leuven, Belgium, April 18-21, 2006 (New York: ACM Press, 2006),
221-34. Also available online: http://www.lockss.org/locksswiki/files/3/30/Eurosys2006.pdf.
57
95
Sound Directions
Best Practices For Audio Preservation
protecting digital data among corporate IT professionals that can be utilized. The ATM strongly
believes that a digital library repository represents the best strategy for preservation into the
future. However, we feel that there is a window of time—perhaps 5-10 years—during which
it is prudent to maintain under our direct control redundant copies of files. During this time
period we expect the digital library community to gain significant experience in developing
and managing preservation repositories so that a certified trusted digital repository is not only
possible, but has been demonstrated. It must also be noted that managing local storage to
supplement preservation requires a certain amount of both commitment and expertise to be
successful. If these are not present, the storage effort may not be usable.
The ATM considers this type of locally controlled, interim storage as last-ditch: in the face
of catastrophe, if all else fails, this one additional redundant copy of preservation files
stored outside of the preservation repository might save content. For this reason we are not
attempting to store all files from the preservation process, which would involve significantly
more time and expense, but Preservation Masters only, from which all other types of files can
be regenerated.
There are three formats often considered for temporary storage for a 5-10 year time period
that can handle the large files recorded at 24/96, and that are seen as feasible by some—
reasonably priced and technologically manageable—for archives like the ATM: DVD, hard
drives stored offline on shelves,
92
and data tape. All must be managed, with a planned
migration necessary if this time window extends beyond 3-5 years, depending on the format.
We have chosen data tape using the LTO (Linear Tape–Open) format for this purpose. LTO
was introduced in 1998, developed jointly by HP, IBM, and Quantum as an “open format”
technology in the sense that users have multiple sources of tape drives and tape media. This
is due to intentional decisions by the developers to make the specification available to all
potential manufacturers for a reasonably priced IP licensing fee. We have chosen the LTO
format for the following reasons:
Preservation problems with DVD are not yet well understood or tested, expensive
testing software is necessary to manage any optical disc program, and more human
intervention than we can manage is necessary to migrate from optical disc
Hard drives may fail sooner than data tape and may be less reliable, hard drives are
intended to remain powered up, there are no tests on drives that are infrequently used,
and there are reports of lost data from drives sitting on shelves for too long
LTO is the most “open” of the data tape formats and is supported by at least 30
companies and is a leader in the midrange tape drive segment of the market
There is a clear roadmap to the future for LTO that now extends to six generations
enabling informed decision-making and management. A drive can read tapes from its
own generation and the two previous generations and write data to its own and the
immediate prior generation in the prior generation format
LTO supports Write Once, Read Many (WORM)
Each LTO tape holds 400 GB, uncompressed, for LTO Generation 3
The ATM suggests, if content is destined for a preservation repository, conducting an analysis
of the repository’s current stage of development as well as the experience of staff who manage
it. If the above study warrants, and if it is technically and economically feasible, maintain a
local, interim copy of all Preservation Master Files for a temporary period of time.
92 Very few people recommend this. A more robust solution would be a number of hard drives arranged in a
RAID array in, for example, a network attached storage device. This is considerably more expensive.
34
96
Sound Directions
Best Practices For Audio Preservation
5.2.1.5 Local, Interim Storage at Harvard
At Harvard College Library’s Audio Preservation Services, all digitizing, editing, processing,
and metadata creation up to the point of transfer to the Digital Repository Service (DRS)
is done on twenty interim storage, logical volumes residing on our Storage Area Network
(SAN), which is a four-terabyte, EMC CLARiiON storage appliance connected via Fibre
Channel to the DAWs. The volumes are accessed using the SAN client software.
93
Half of the
volumes are formatted as NTFS for direct use by the Pyramix workstations, and the other half
are formatted as HFS+ for use by our Macintosh G4 and G5 where we mirror the NTFS files,
using symbolic links, for testing deliverables and final creation of the deposit package. The
NTFS volumes on the SAN are backed up nightly over the network to SAIT™ tape by Harvard
College Library’s Information Technology Services. Those backups are retrievable going back
one month. The HFS+ volumes on the SAN are backed up each night, locally, using Dantz
Retrospect™ to VXA™ tape.
Even though we normally move or copy files from NTFS to HFS+ (if we move them at all), we
occasionally need to copy files from HFS+ to NTFS. This is accomplished through our NAS
(Network Attached Storage) appliance using Macintosh File Services. The NAS is not backed
up because it is very temporary storage, and used only as a bridge between file systems.
Our choice of a local workspace storage system was based upon the desire for cross-platform
compatibility to accommodate the various tools used in our workflow. In pursuit of that
compatibility, the SAN was originally configured for a single file-system with dual-platform
read/write access. In practice, we found that the storage client software’s reliance upon third
party file-system translation utilities (for cross-platform read and write to a single file-system)
made the storage system unreliable. Rather than jeopardize our content or delay the schedule
for Sound Directions, we abandoned the idea of cross-platform read/write to a single file-
system, and adopted a split file-system approach. WindowsXP reads and writes to NTFS only.
Mac OSX reads and writes to HFS+, and can read NTFS but cannot write to it. We admit that
this approach is less than ideal, but it is at least stable and reliable.
93 A “client” is a piece of software that lets one negotiate access to one’s storage. For instance, it manages access
permissions for each user such as the ability to read and write to a volume or read-only.
38
97
Sound Directions
Best Practices For Audio Preservation
5.2.2 Long-Term Preservation Storage
5.2.2.1 Best Practices
Best Practice 34: Use mirroring techniques for redundancy of online preservation and access
storage, and migrate the storage environment as technology changes.
Best Practice 35: Use off-site data tape for near-line storage and tape clones with periodic
media refresh.
Best Practice 36: Regenerate message digest (checksum) values periodically, and when
accessing files, to verify that all copies are unchanged.
Best Practice 37: Implement systems that generate periodic reports about the condition of
stored objects, and allow for ad hoc reporting of those conditions such as preservation risk
factors and confidence levels.
Best Practice 38: Monitor digital audio formats, the technical environment in which they
are used, and the service requirements of the user community. Look for usability threats or
opportunities and implement an appropriate preservation action plan.
5.2.2.2 Rationale
Failures are a factor in all systems. A digital audio object’s usability is dependent upon
the reliability of the data and the systems that support that data. It is therefore vital that
both the data and system integrity be monitored for failures and potential failures, and it is
also vital that the systems have sufficient redundancy to sustain failures while maintaining
uninterrupted service and integrity of objects. These best practices are best supported by a
digital preservation repository system.
Digital audio file formats become obsolete. Software applications required for the use of
those formats also become obsolete. User requirements change and may demand the richer
feature sets of newer formats. It is vital that a preservation repository recognize both the threat
of obsolescence and the opportunities provided by feature-rich file formats, and consult with
collection owners to take appropriate action. Such actions might include either a format
migration or the commitment to preserve an obsolete format and supporting application.
5.2.2.3 Long-Term Preservation Storage at Harvard
5.2.2.3.1 Background
Harvard’s Digital Repository Service was developed as a vital part of the infrastructure for
the Library Digital Initiative—a comprehensive program to develop the University’s capacity
to manage digital information. The DRS is a preservation and access repository available
to any Harvard affiliate or administrative unit. The DRS is committed to preserving access
to eligible, library-like content—that which supports research, scholarship and pedagogy
—content that has inherent, persistent value and is intended to be stored indefinitely.
31
98
Sound Directions
Best Practices For Audio Preservation
Figure 24: Library Digital Initiative at Harvard
The Digital Repository Service has been in production operation since the fourth quarter of
2000. It is used by 28 administrative units and 5 reformatting laboratories or depositing agents.
It holds over 6 million objects, totaling more than 24 terabytes of data in 12 formats.
The appropriate unit of curatorial management is the object—the digital expression of an
intellectual work—not the file. The digital content and its metadata comprise the object. In
the DRS, the following core metadata is required for each file within the object:
DRS object ID—a unique numeric identifier assigned upon deposit
MD5 checksum used for data integrity upon ingest and beyond
Insertion date
MIME type and format type
Owner code (FHCL.MUSI)
Billing code for project level management (FHCL.MUS_0001)
Owner-supplied identifier—used to describe the managed object in a curatorially-
significant way (AWM_DAT_172_AM_01_01_{52A7EEB3-1ED4-4FA3-8385-
C008F6F047F5})
Access (public, Harvard-only, staff-only)
An audio preservation package in the DRS may contain Broadcast Wave Format files, RealAudio
files, and SMIL files to support complex delivery of streaming media. All relationships among
files are described using METS and AES31-3 ADLs.
26
99
Sound Directions
Best Practices For Audio Preservation
Figure 25: Digital Repository Service diagram
The DRS is tightly integrated with Streaming Delivery Service. The RealAudio server streams
the data, and SMIL provides playlist capabilities. Delivery and administration services are
Java web applications. The administrative metadata is structured as an Oracle 9i database,
and is stored on a Fibre Channel EMC CLARiiON SAN. Online content is stored on a Fibre
Channel RAID appliance. All servers and services are monitored 24/7 and are housed in
a University Information Systems data center. Near-line content is stored off-site in a Fibre
Channel tape jukebox with off-line tape clones.
5.2.2.4 Long-Term Preservation Storage at Indiana
5.2.2.4.1 Background
The Indiana University (IU) Digital Library Program is currently in the process of designing
and implementing a digital preservation repository to support the storage, preservation, and
delivery of digital objects. This repository project, along with the larger Digital Library Program,
is jointly funded by the IU Libraries and University Information Technology Services, and is
intended to serve the digital access and preservation needs of library and archive collections
from across the university. While work to date on the IU digital preservation repository has
largely been focused on access needs, our goal is to develop it into a preservation repository
capable of being certified as an OAIS-compliant trusted digital repository, through the
emerging trustworthy repositories audit and certification process.
94
94 The Center for Research Libraries and Online Computer Library Center, Trustworthy Repositories Audit &
Certification: Criteria and Checklist, ver. 1.0 (Chicago, IL: CRL; Dublin, OH: OCLC, 2007). Also available online:
http://www.crl.edu/PDF/trac.pdf.
54
100
Sound Directions
Best Practices For Audio Preservation
5.2.2.4.2 Fedora
The Fedora repository system is being used as the basis for the IU digital preservation repository.
Fedora (Flexible Extensible Digital Object and Repository Architecture)
95
is an open source
digital object storage system developed jointly by Cornell University Information Science
and the University of Virginia Library. The Fedora architecture is implemented as a set of Web
services. Core repository functions are separated from utilities that act on the repository (e.g.,
submission tools, search systems, metadata harvesting providers, etc.), allowing external
utilities to be replaced or upgraded without changes to the digital objects. Fedora is extremely
flexible, both in terms of the data stored and how it may be accessed. Objects in Fedora can
contain an unlimited number of “datastreams,” which may contain digital media files or
metadata about those files. Datastreams may be stored in locally managed disk space or may
be distributed across the Web. Data in locally-managed space is stored in a straightforward
manner, making manipulation by traditional file system tools easy, including backup and
restore functions. Each datastream may be associated with one or more Web services to
provide “just-in-time” data transformations when users make requests.
Fedora’s flexible nature allows it to be used as the foundation for a wide variety of applications
for digital libraries, archives, institutional repositories, and learning object systems. Fedora
has a growing user community, with implementations at more than 30 organizations around
the world, including the University of Virginia, Tufts University, the Technical University of
Denmark, the University of Hull, the US National Science Digital Library, and the Australian
ARROW project. The community has become active in developing tools that work with
Fedora-based repositories. Some examples of applications that have been built upon Fedora
include library digital collections management, multimedia authoring systems, archival
repositories, institutional repositories, and digital libraries for education.
5.2.2.4.3 Massive Data Storage System
To support archival storage in the IU digital preservation repository, our intention is to make use
of IU’s Massive Data Storage System (MDSS).
96
MDSS is a distributed storage service offered
by University Information Technology Services to faculty, staff, and graduate students who
need large scale nearline storage. The system is based on HPSS (High Performance Storage
System)
97
hierarchical storage management software developed by the US Department of
Energy labs and IBM. As a system based on the principle of hierarchical storage management
(HSM), data transferred to MDSS initially resides on disk drives, but is quickly migrated to
storage in a robotic tape library. The HPSS software manages the disks and tapes as a single
logical file system which can be scaled up to store arbitrary amounts of data with minimal
expense. The current system at Indiana University is able to store roughly 4.2 petabytes.
Data is mirrored between the data centers at IU’s Bloomington and Indianapolis campuses,
approximately 60 miles apart from each other, through a dedicated fiber optic link to provide
fault tolerance and some level of disaster protection. As storage technologies evolve and
new disk or tape storage systems are deployed as part of MDSS, the HPSS software is able to
automatically migrate data files to newer storage media while the files remain a part of the
same logical file system from the user’s perspective.
95 Fedora. http://www.fedora.ivnfo/.
96 Indiana University, Distributed Storage Services, “The Indiana University Massive Data Storage System Service”
(January 2007), http://storage.iu.edu/mdss.shtml.
97 High Performance Storage System (HPSS). http://www.hpss-collaboration.org/.
21
101
Sound Directions
Best Practices For Audio Preservation
5.2.2.4.4 Preservation Repository
To implement preservation storage in the IU digital preservation repository, we plan to configure
Fedora to make use of MDSS as an underlying storage mechanism for large datastreams
such as master audio files. We are currently evaluating technical options for implementing
this connection. With this integration, we will be able take advantage of Fedora’s abilities
to manage access control, metadata, and the information about the relationships between
the components of a complex digital object (metadata records, audio files, etc.) along with
MDSS’ large scale storage and mirroring of data across multiple geographic locations.
Beyond the features provided by Fedora and MDSS, additional services will be necessary for
this to fully qualify as preservation. Specifically, we will need to implement a preservation
integrity service to routinely check files that have been deposited into the repository to make
sure that they can be retrieved from MDSS and match their checksums. This will ensure
that objects deposited in the repository for preservation have not been intentionally or
unintentionally altered or lost over time. We will also need to develop policies and operational
and financial plans for the repository to ensure its long-term sustainability and ability to be
certified as a trusted digital repository. This will require a concerted joint effort involving
the technologists who manage the MDSS, technologists and librarians in the Digital Library
Program, and librarians and archivists with preservation expertise from across IU.
45
102
Sound Directions
Best Practices For Audio Preservation
6.1 Preservation Overview
S
imply put, if every institution’s buckets of bits are not only different in character but also
not understandable outside their own context, they are idiosyncratic—not interoperable—
and true preservation has not occurred. Real preservation depends on the usability and
readability of files over an extended period of time and by different technologists at different
institutions. Should one institution fail, this type of interchange guarantees preservation by
enabling any engineer to access preserved content.
Interoperable files depend upon appropriate metadata in order to ensure readability over
time. In particular, descriptive metadata, and administrative technical and digital provenance
metadata provide information necessary to identify digital objects, and migrate and preserve
them over time. Opaque digital objects are difficult if not impossible to preserve. The
development of compatible Submission Information Packages (SIPs) lays the groundwork for
defining what constitutes a preservation object.
The Role of Packages in a Preservation Repository
An institution committed to preservation of digital objects over the long term must actively
manage those objects in a repository designed with that purpose in mind. A preservation
“package” is a representation of the data to be preserved in some sort of managed unit. The
OAIS model defines an “information package” as made up of “content information” (the
original target of preservation) and “preservation descriptive information” (other information
needed to preserve the object, including provenance, context, reference, and fixity
information), which are held together with “packaging information” (that which “is used to
bind and identify the components of an Information Package”). The information package is
then discoverable by external “descriptive information.”
98
The package by this definition is a
conceptual one, with all relevant data logically grouped together for action by a preservation
repository. It does not necessarily consist of one single file; in fact, in practice the information
needed to make up an information package may be present in a number of different files that
must be bound together conceptually for preservation activity.
A preservation repository may represent an information package in several forms at different
times in the lifecycle of the digital object. The OAIS model defines a Submission Information
Package (SIP), Archival Information Package (AIP), and a Dissemination Information Package
(DIP). These packages represent the form in which information to be preserved is delivered
to a preservation repository, stored by a repository, and exposed to another entity on request,
respectively. The “preservation packages” exchanged between institutions as part of the
Sound Directions project are DIPs from the originating repository’s point of view, and SIPs
from the receiving repository’s point of view.
The exact format for an information package is not described in the OAIS model. Several
XML-based schemas may be used for this purpose: the Metadata Encoding and Transmission
98 CCSDS, OAIS.
6 Preservation Packages and Interchange
Documents you may be interested
Documents you may be interested