on-line data. The activities that SANs support include disk
mirroring, backup and restore, archival and retrieval of archived
data, data migration from one storage device to another, and the
sharing of data among different servers in a network.
The Storage Resource Broker (SRB) is a data grid application
developed by San Diego Supercomputer Centre aimed at federating
collections of distributed data and presenting them to the user as
a uniﬁed collection. Its features include management,
collaboration, controlled sharing, publication, replication,
transfer, and preservation of distributed data. The SRB system is
middleware in the sense that it is built on top of other major
software packages including repository systems such as Fedora
comes/FedoraDB.php) and DSpace
iRODS™, the Integrated Rule-Oriented Data System, is an
open-source data grid software system developed by the Data
‘PRONOM is an on-line information system about data ﬁle formats and their supporting software
products. PRONOM holds information about software products, and the ﬁle formats which each
product can read and write. PRONOM is a resource for … impartial and deﬁnitive information
about the ﬁle formats, software products and other technical components required to support
long-term access to electronic records and other digital objects of cultural, historical or
Global Digital Format Registry (GDFR) – ‘Peer-to-peer network of independent, but cooperating
registries of format communicating over a common protocol… will provide sustainable
distributed services to store, discover, and deliver representation information about digital
Format validation: often used by repositories when adding digital objects
JHOVE (JSTOR/Harvard Object Validation Environment) – a tool used to recognize and validate a
limited number of ‘popular’ ﬁle formats. It is both a ﬁle type identiﬁcation tool and a ﬁle
format validation tool. JHOVE reads through an entire ﬁle and determines the degree of
compliance to a format speciﬁcation.
‘DROID (Digital Record Object Identiﬁcation) provides automated ﬁle identiﬁcation
information using the PRONOM registry. It ‘is designed to… be able to identify the precise format
of all stored digital objects, and to link that identiﬁcation to a central registry of technical
information about that format and its dependencies. DROID is a platform-independent Java
application, and includes a documented, public API, for ease of integration with other systems.’
Intensive Cyber Environments group at University of North Carolina
at Chapel Hill (developers of the SRB). The iRODS system is based
on applying SRB technologies in support of data grids, digital
libraries, persistent archives, and real-time data systems. The set
of assertions these communities make about their digital
collections are characterized in iRODS Rules which are interpreted
by a Rule Engine to decide how the system is to respond to various
requests and conditions (Moore et al, 2007).
Storage Area Networks and Repositories
At Edinburgh there are two versions of SAN available. One is for research purposes and is under the
direction of the Edinburgh Parallel Computing Centre (EPCC). The other is a general SAN run by the
Edinburgh Compute Data Facility (ECDF) available to anyone from the university who wishes to back
up large research data stores (see: http://www.ecdf.ed.ac.uk). An option for researchers wishing
to share large-scale datasets is to create a descriptive metadata record in the Edinburgh DataShare
repository pointing to a remote storage location, including the SAN, and providing contact details
for requesting access.
Storage Resource Broker and Repositories
The PLEDGE project (Smith et al. 2006), a collaboration between the MIT and University of
California, San Diego Libraries, and the San Diego Supercomputer Center successfully demonstrated
an SRB-managed and grid-enabled storage architecture for DSpace. This was achieved by
standardising archival policies for their digital objects including research data through the use of
rules engines to produce scalable, interoperable digital archives and preservation environments.
Distributed Data Collections
The Purdue e-Data repository (see Witt, M. 2008) has been built using a Fedora Web Services
framework to provide functionality for remote datasets in addition to datasets being stored locally.
In the case of very large datasets for which it is not practical or possible to ingest into a
central repository, middleware such as OAISRB has been developed locally to provide an Open
Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) interface to the SRB to enable the
harvesting of metadata from datasets residing on a storage grid so that they can be represented
alongside local data collections.
A number of decisions need to be made by the digital repository regarding
metadata of various types. ‘Administrative, descriptive, technical, structural
and preservation metadata, using appropriate standards, are used to ensure
adequate description and control over the long term’ (DCC, 2008).
2.a ACCESS TO METADATA
Anyone may access the metadata free of charge.
Access to some or all of the metadata is controlled.
2.b REUSE OF METADATA
May the metadata be reused in another medium without prior
permission provided there is a link to the original metadata and/or
the repository is mentioned?
Will it be permissible to reuse the metadata for commercial
purposes? Is formal permission required?
Will the repository system allow metadata harvesting of dataset
descriptions by other institutions following the OAI-PMH guidelines,
or other harvesting protocols?
What level of metadata is re-usable? Dataset descriptions? Full
descriptive metadata (e.g. DDI XML record)?
Are data providers required to allow reuse of metadata?
2.c METADATA TYPES AND SOURCES
It is important that researchers deposit additional ﬁles known as
documentation that describe their dataset in more detail, and
especially the processes used to create it. Examples of documentation
include: a codebook ﬁle for statistical data; code for a software
program; a format speciﬁcation; a technical report explaining the
research protocol or methodology. Without such documentation, a
dataset may not be ﬁt for re-use.
In addition, most repositories attach metadata ﬁelds to each deposited
item, which conform to some standard or schema. Dublin Core (DC)
elements (properties) include descriptive information such as data
creator(s), date produced, abstract and subject. DC metadata can be
conﬁgured within the repository software to conform to an XML-based
standard exchange protocol called OAI-PMH, which allows the content
to be ‘harvested’ by web-services and other repositories.
Open Archives Initiative Protocol for Metadata Harvesting,
C# PDF - Read Barcode on PDF in C#.NET
C#.NET convert PDF to images, C#.NET PDF file & pages edit, C#.NET PDF pages extract, copy, paste, C#.NET rotate PDF pages, C#.NET search text in PDF, C#.NET extract text from pdf c#; can't copy and paste text from pdf
Metadata can take several forms, some of which will be visible to the
user of a digital library system, while others operate behind the scenes.
The Digital Library Foundation (DLF), a coalition of 15 major research
libraries in the USA, deﬁnes three types of metadata which can apply to
objects in a digital library-
descriptive metadata: information describing the intellectual content
of the object, such as MARC cataloguing records, ﬁnding aids or similar
administrative metadata: information necessary to allow a
repository to manage the object: this can include information on how it
was scanned, its storage format etc (often called technical metadata),
copyright and licensing information, and information necessary for the
long-term preservation of the digital objects (preservation metadata)
structural metadata: information that ties each object to others to
make up logical units (e.g. information that relates individual images of
pages from a book to the others that make up the book itself).
In general, only descriptive metadata is visible to the users of a system,
who search and browse it to ﬁnd and assess the value of items in the
collection. Administrative metadata is usually only used by those who
maintain the collection, and structural metadata is generally used by the
interface which compiles individual digital objects into more meaningful
units (such as journal volumes) for the user (University of Oxford, 2005).
The repository must make choices about what kinds of metadata will be
required within the repository and from where each type will be
Bibliographic description/s, e.g. Dublin Core, MODS, MARC21,
A structured catalogue record, or study description, is created for
each dataset. Domain-speciﬁc descriptive metadata: DDI, SDMX,
FGDC, EAD, TEI etc. (For social science data, particularly the DDI,
see ICPSR, 2005, chapter 3 Best Practice in Creating Technical
Full information relating to the content, structure, context and
source of the data; information about the methods, instruments,
and techniques used in the creation or collection of the data
References to publications pertaining to the data.
Information on how the data have been processed prior to
submission to the repository.
Preservation metadata maintained over the lifecycle of the data,
documenting actions taken at submission, curation and
t ’Event history’ information is stored and linked to the digital
Rights management metadata
Technical metadata (storage format etc.)
Representation Information: how data are internally coded,
necessary for rendering data in an understandable form.
(See section 1.e on File Formats)
Structural metadata indicates how different components of a set of
associated data relate to one another.
The most straightforward example is Relational Database metadata,
described in Wikipedia as follows:
Each relational database system has its own mechanisms for storing
metadata. Examples of relational database metadata include: Tables of
all tables in a database, their names, sizes and number of rows in each
table. Tables of columns in each database, what tables they are used in,
and the type of data stored in each column. In database terminology,
this set of metadata is referred to as the catalogue.
Other examples of structural metadata:
is used in the Fedora repository software, where
compound objects are treated as a single ﬁle.
deﬁnes compound objects distributed on the Internet
through the creation of resource maps which use unique URLs for
Introduction to Fedora Object XML (FOXML)
Open Archives Initiative Object Reuse and Exchange (OAI-ORE).
A Dublin Core-based dataset proﬁle is in use at the the University of Edinburgh’s DataShare
repository (http://www.disc-uk.org/docs/Edinburgh_DataShare_DC-schema1.pdf) and the
University of Southampton’s ePrints repository
is used as a ‘wrapper’ for compound digital objects,
allowing them to be identiﬁed as such and acted upon, e.g.
importing and exporting in repositories.
provides a simple way to make statements about Web
resources, often expressed in RDF/XML, in the form of “triples,”
i.e. subject-predicate-object expressions that relate objects to
one another. (For example, A is version of B.)
2.d METADATA SCHEMAS
Repositories may need to put in place additional metadata schemas to
support the ingest, management, and use of data in their collections.
Some repositories implement additional or extended metadata schemas
for domain speciﬁc datasets. For example, creating a new community/
/ collection (e.g. for Astronomy or Space Physics) – the SPDML (Space
Physics Data Markup Language) schema could be ‘plugged-in’ to DSpace.
This would mean that researchers wishing to deposit Space Physics data
would be presented with an ingestion interface based on the metadata
schema for their particular data type, thus capturing the richness of
that particular domain dataset (which otherwise could be lost by the
default DC-based schema).
The same could be applied to Learning and Teaching Objects (LOM –
Learning Object Metadata), biological species observational data
(Darwin Core), SPASE Data Model (standard metadata for space science
data description) etc.
Metadata Encoding and Transmission Standard (METS).
Resource Description Framework (RDF). http://www.w3.org/RDF/
‘Open Archives Initiative Object Reuse and Exchange (OAI-ORE) deﬁnes standards for the
description and exchange of aggregations of Web resources. These aggregations, sometimes called
compound digital objects, may combine distributed resources with multiple media types including
text, images, data and video. The goal of these standards is to expose the rich content in these
aggregations to applications that support authoring, deposit, exchange, visualization, reuse, and
preservation. Although a motivating use case for the work is the changing nature of scholarship and
scholarly communication, and the need for cyberinfrastructure to support that scholarship, the
intent of the effort is to develop standards that generalize across all web-based information
including the increasingly popular social networks of web 2.0.’ See
DDI, the Data Documentation Initiative, is an XML-based metadata schema that can describe not
just the dataset as a whole but also descriptive material drawn from the life-cycle of a data
resource.This can include information, for example, about the source of funding and methodology
used in collecting data to an entire set of survey questions and resultant variables. It was initially
created as a metadata schema for codebooks but has developed broader application for time series
data, complex hierarchical data ﬁles, and tabular data. The DISC-UK DataShare project explored
the potential of enhancing institutional data repositories through the use of DDI metadata in a
brieﬁng paper (Martinez, 2008).
3. SUBMISSION OF DATA (INGEST)
3.a ELIGIBLE DEPOSITORS
Will eligibility be restricted by status?
Deposits may be made by, for example:
Accredited members, academic staff, registered students,
employees of the institution, department, subject community
or delegated agents
t Data producers or their representatives (‘self deposit’)
Only repository staff.
Will eligibility be restricted by content?
For example, eligible depositors:
may only deposit their own work
must enter descriptive metadata for all their data
t are limited to depositing complete datasets as deﬁned by the
may only deposit data of a certain type or subject.
_ Will the repository provide a conﬁrmation of receipt to the
depositor including a request to resubmit a digital object in the
case of errors resulting from the submission?
3.b MODERATION BY REPOSITORY
Are submissions checked to ensure that data integrity has been
fully maintained during the transfer process? If so, spot checks, or
The repository checks metadata records for accuracy.
_ The repository adds Digital Object Identiﬁers (DOIs) or another per
sistent identiﬁer, such as the Handle system.
_ Does the repository’s administration review items for the
eligibility of authors/depositors?
relevance to the scope of the repository?
exclusion of spam?
Submission of Data (Ingest)
3.c DATA QUALITY REQUIREMENTS
The repository should have clear and concise depositor agreements
written in plain language that are presented to depositors with each
acquisition. In most cases, data producers are responsible for the
quality of the digital research data. The repository is responsible for the
quality of storage and availability of the data. For example, the
following statement could be part of a submissions policy:
The validity and authenticity of the content of submissions (all
materials submitted by the depositor, including full data and metadata)
is the sole responsibility of the depositor, and is not checked by the
repository (SHERPA, 2007).
In these cases, the repository accepts no responsibility for mistakes,
omissions, or legal infringements within the deposited object. There
may be situations in which the depositor does not guarantee that the
dataset is accurate and the depositor indemniﬁes the repository’s
institution against any legal action arising from the content of the
dataset. One way to mitigate against legal risks is to have a ‘take-down’
policy for removal of objectionable items. (See section 6, Withdrawal).
In some cases, licenses are presented to depositors to cover the range
of requirements for reuse of the data (see section 4, Access and Reuse
3.c.2 QUALITY ASSESSMENT
If the repository evaluates data quality in order to make decisions about
whether to accept content or not, the repository can choose to
determine the quality by assessing the following:
Their intrinsic scientiﬁc quality through assessment by experts and
colleagues in the ﬁeld. Considerations:
Are the research data based on work performed by the data
producer (researcher or institution that makes the research
available) and does the data producer have a record of
Was data collection or digitization carried out in accordance
with prevailing criteria in the research discipline?
Are the research data useful for certain types of research and
suitable for reuse? (DANS, 2008 p. 7)
The required contextual information (metadata) has been provided
by the data producer. Descriptive, structural and administrative
metadata must be provided, or created by the repository, in
accordance with the applicable guidelines of the repository.
Submission of Data (Ingest)
_ Data ﬁles are inspected to ensure that variables and values are
accurate according to the documentation supplied and are
sufﬁciently labelled for secondary use.
_ Checks are made to verify that metadata in a data ﬁle matches
metadata in descriptive documentation (e.g. variable names in a
dataset match variable names in a codebook).
3.d CONFIDENTIALITY AND DISCLOSURE
Repositories will often require that data depositors ensure that data
meet requirements of conﬁdentiality and non-disclosure for data
collected from human subjects. In some cases, the repository may alter
sensitive data to create anonymised data that can be distributed to its
In the case of one social science data archive, data collections acquired
by the repository “undergo stringent conﬁdentiality reviews to
determine whether the data contain any information that could be used
- on its own or in combination with other publicly available information
- to identify respondents. Only after the completion of those reviews
are data made available from the repository. Should such information be
discovered, the repository alters the sensitive data after consultation
with the principal investigator to create public use ﬁles that limit the
risk of disclosure” (ICPSR, 2007c).
3.e EMBARGO STATUS
Some repository infrastructure systems have the technical capacity to
embargo or sequester access to data until the content has been
approved for release to the public. Agreements about the embargo – its
length and what triggers its ending – need to be made between the
repository and its contributors.
Submission of Data (Ingest)
Data Seal of Approval from DANS (Data Archiving and Networked Services, The Netherlands)
‘DANS was given the task ... to develop a seal of Approval in order to ensure that archived data
can also be found, recognized and used in future. Such a seal of approval can be applied for and
awarded to research which meets a number of recognizable criteria in the area of quality, durablity
and accessibility of the data. The seal of approval can also be requested by and awarded to data
repositories that want to store research data perminently and make them accessible. The seal of
approval contains guidelines for applying and checking quality aspects of the creation, storage and
(re)use of digital research data in the social sciences and humanities. These guidelines serve as a
basis for granting a “data seal of approval.’
Documents you may be interested
Documents you may be interested