54
to address the realities that small to mid-sized institutions face with limited funding and
technical staffing. During the project, various software applications (some free), metadata
formats, and guides were used and evaluated. The summary below includes product
information and the results of our trials. This report should not be considered an official
endorsement of any product, nor is it a comprehensive list of every applicable product.
Note:
See glossary
for format definitions.
Product
ABC Amber Outlook Converter
Description
ProcessText Group application that converts email into different formats
such as PDF, HTML, and TXT. Trial version available.
Vendor information “ABC Amber Outlook Converter is intended to help you keep your
important emails, newsletters, other important messages organized in
one file. It is a useful tool that converts your emails from MS Outlook to
any document format (PDF, DOC, HTML, CHM, RTF, HLP, TXT, DBF, CSV,
XML, MDB, etc.) easily and quickly. It generates the contents with
bookmarks (in PDF, DOC, RTF and HTML), keeping hyperlinks. Also you
can use this tool as MSG Converter. Currently our software supports
more than 50 languages.”
Intended CERP Use SIA tried it for some XML conversion of email before the XML parser-
schema work was started by the CERP technical consultant. It can
produce a report indicating number of unread items within the folders of
an email account. RAC converted PST files to RTF, HTML, and TXT.
Attachments displayed correctly and the to/from/date header data
displayed; however, the Internet Header metadata including computer
identification data that makes the message authentic and reliable did not
appear. Trial version tested.
Results
SIA and RAC discontinued testing this application due to time and
success with other software.
URL
www.processtext.com/abcoutlk.HTML
Product
ABC Amber PDF Converter
Description
ProcessText Group application that converts email into PDF. Trial version
available.
Vendor information “ABC Amber PDF Converter is a powerful tool which allows you to convert
PDF to any document format (HTML, CHM, RTF, HLP, TXT, DOC, DBF,
XML, CSV, XLS, MDB, DB, etc.) easily and quickly. You can export all
pages or just selected pages, as plain text or as preview pictures.
Currently our software supports more than 50 languages.”
Intended CERP Use RAC converted some PSTs to PDF using Amber Outlook Converter, but
was unable to get an XML display from Amber PDF Converter. SIA also
tried the PDF conversion of emails but CERP decided on XML for the
preservation format. Trial version tested.
Results
SIA and RAC discontinued testing this application due to time and
success with other software.
URL
www.processtext.com/abcPDF.HTML
1
48
programs and processes messages quickly and faithfully, including those
with file attachments and embedded contents like pictures and
background images. Unlike other migration methods, Aid4Mail can also
export message status information such as ‘unread,’ ‘read,’ ‘replied,’ and
‘forwarded’ from most mail clients.”
Intended CERP Use SIA used it to convert MSG files into a PST file. SIA also used it to create
MBOX files from PST files, extract attachments and embedded contents,
and extract sender names. SIA used it to convert Eudora email (MBX) files
to MBOX and PST. RAC used it to convert generic MBOX files to MSG
format for ease of sorting non-business from business messages. After
sorting, the MSG files were converted back to MBOX format for parsing.
Results
SIA discontinued using it for PST to MBOX conversion because: 1) it failed
to capture email attachments from child messages and it created
attachments such as winmail.dat files out of email bodies; 2) processed
email missing both its attachment and email message body to an
Aid4Mail upgrade; and 3) the MBOX output was not as robust or legal as
output from other products. SIA continues to use it with email formats
other than PST. RAC found it very useful for the wide range of formats
encountered in testbed email. This range of formats is typical of RAC
depositors. RAC did not use it for PSTs. Some experimentation with
selecting different options was necessary before finding a combination
that worked and that retained all data. Generally “generic source” and
“convert to generic” worked.
URL
http://www.aid4mail.com
Product
Archivists’ Toolkit (AT)
Description
Open-source software that offers archival data management, which was
produced through an Andrew W. Mellon Foundation-funded collaboration
among the University of California San Diego Libraries, the New York
University Libraries and the Five Colleges Inc. Libraries.
Vendor information “The Archivists’ Toolkit, or the AT, is the first open source archival data
management system to provide broad, integrated support for the
management of archives. It is intended for a wide range of archival
repositories. The main goals of the AT are to support archival processing
and production of access instruments, promote data standardization,
promote efficiency, and lower training costs. Currently, the application
supports accessioning and describing archival materials; establishing
names and subjects associated with archival materials, including the
names of donors; managing locations for the materials; and exporting
EAD finding aids, MARCXML records, and METS, MODS and Dublin Core
records. Future functionality will be built to support repository
user/resource use information, appraisal for archival materials,
expressing and managing rights information, and interoperability with
user authentication systems.”
2
53
MARC records. The capability of producing multiple schema from one
input is useful; with modification, it could be used for accession
metadata. RAC did not use it after discovering that it was not compatible
with email.
URL
http://www.archiviststoolkit.org
Product
Cooktop
Description
Freeware application for creating XML.
Vendor information “Cooktop is an editor and development environment for XML, DTD, and
XSLT documents. Cooktop is a Windows application. Color-coded XML,
DTD, and XSLT editing. Check well-formedness and validate
Stylesheet testing with almost any XSLT engine. XPATH testing.
Customizable ‘Code Bits’ library. XML formatting via Tidy. Small
download, small footprint”
Intended CERP Use SIA used it to edit some EAD (see below) stylesheets when working in
NoteTab Pro (see below). RAC did not use this tool.
Results
Tool was easy to use. SIA discontinued using once stylesheets were set
and oXygen (see below) was implemented.
URL
www.XMLcooktop.com
Product
DROID (Digital Record Object Identification)
Description
National Archives (United Kingdom) software that identifies file formats.
Vendor information “DROID (Digital Record Object Identification) is a software tool developed
by The National Archives to perform automated batch identification of
file formats. Developed by its Digital Preservation Department as part of
its broader digital preservation activities, DROID is designed to meet the
fundamental requirement of any digital repository to be able to identify
the precise format of all stored digital objects, and to link that
identification to a central registry of technical information about that
format and its dependencies.
DROID uses internal and external signatures to identify and report the
specific file format versions of digital files. These signatures are stored in
an XML signature file, generated from information recorded in the
PRONOM technical registry. New and updated signatures are regularly
added to PRONOM, and DROID can be configured to automatically
download updated signature files from the PRONOM website via web
services. DROID is a platform-independent Java application, and includes
a documented, public API, for ease of integration with other systems. It
can be invoked from two interfaces: a Java Swing GUI and a command
line interface.”
Intended CERP Use SIA used it for format identification of native attachments extracted from
email messages for preservation assessments. RAC did not use as it did
not conduct such assessments for the pilot.
Results
Setup was easy and alerts user of possible file extension mismatch.
Output available as XML. SIA adopted this tool in conjunction with JHOVE
(see below) to perform assessment tasks.
3
51
entry, or forensic analysis and eDiscovery. Migrate: Emailchemy includes
an embedded IMAP mail server from which any IMAP-compatible email
application can import your converted email. Emailchemy also includes a
utility for uploading converted mail up to a Google Apps email account.
Manage: Emailchemy provides utilities for splitting, sorting and merging
email archives, and harvesting email addresses from email archives.”
Intended CERP Use RAC used it to convert a batch of MBOX testbed messages. SIA did not
test this application.
Results
Worked quickly and showed correct number of messages converted, but
output is MBOX only. Internet Header metadata was retained and
reasonably reader-friendly displays of the email messages could be
viewed in Quick View Plus, Open Office, Explorer, and XML Editor. Several
attachments were sampled and compared with ones that were viewable
in Aid4Mail conversions (Word and PowerPoint). None of the Emailchemy
attachments displayed in a viewable manner; all were simply character
strings, typical of MBOX displays. Even though some mail within the
captured MBOX was Eudora, Emailchemy produced a “does not match”
message when Eudora was selected as the type to be converted.
URL
http://www.weirdkid.com/products/emailchemy/
Product
EAD Cookbook
Description
Guidebook to use for implementing EAD (Encoded Archival Description).
Vendor information “The appearance of EAD Version 2002, the shift of the EAD community
from SGML to an XML environment, the appearance of new tools for
creating and distributing finding aids, and the emergence of community-
based encoding protocols necessitate a revision of that earlier work.
While the basic EAD recipe has not changed, some of the ingredients
have. As an update, this edition focuses on those aspects of
implementation that have changed since 2003, specifically changes in
the EAD element set, new tools for creating EAD-encoded documents,
and the need to provide additional XSLT stylesheets for transforming EAD
files into HTML.”
Intended CERP Use SIA used the Cookbook for learning about creating EAD. RAC did not use
EAD for its finding aids for the pilot.
Results
It provides assistance for EAD creation in NoteTab, XMetal, and Oxygen.
Other references for EAD included:
Archives of American Art finding aids
http://www.aaa.si.edu/collections/findingaids
Archives Hub, EAD 2002 Online Template
http://www.archiveshub.ac.uk/eadform2002.HTML
EAD Help Pages http://www.archivists.org/saagroups/ead
EAD Version 2002 Official Site http://www.loc.gov/ead
RLG Best Practice Guidelines for Encoded Archival Description
http://www.oclc.org/programs/ourwork/past/ead/bpg.PDF
Texas Archival Resources Online
http://www.lib.utexas.edu/taro/index.HTML
4
51
directory comparison tool for Windows 98/Me/NT/2000/XP/2003/Vista.
It features unique functionality that distinguishes ExamDiff Pro from
other comparison programs. If you've been frustrated with other
comparison utilities, you will find that ExamDiff Pro offers a much more
efficient and user-friendly way to compare files and folders.”
Intended CERP Use SIA used it to view differences in files (changed text, added text, and
deleted text) side by side of email XML output from different PCs. RAC
did not test this tool.
Results
It delivered fast results as compared to a manual scan of documents. SIA
adopted it for document comparison.
URL
http://www.prestosoft.com/edp_examdiffpro.asp
Product
ExMerge
Description
Microsoft Exchange utility that captures selected emails from the
Exchange Email Server into a PST file.
Vendor information “Use the Mailbox Merge Program to extract data from mailboxes on a
Microsoft Exchange Server and then merge this data into mailboxes on
another Microsoft Exchange Server. The program copies data from the
source server into Personal Folders (PST files) and then merges the data,
in the Personal Folders, into mailboxes on the destination server. The
ability to merge data to and from an Exchange Server makes this
program an invaluable tool with a variety of uses - especially during
disaster recovery. The program can also replace existing data instead of
merging new data if specified by the Administrator. Mailbox Merge has
some limitations. Please read the tools documentation before using this
program.”
Intended CERP Use SIA used it to capture and transfer Outlook Exchange email accounts for
CERP. RAC did not use this tool as it did not have access to depositors’
MS Exchange email systems.
Results
SIA discontinued using it because data captured was either incomplete
(missing folders) and/or too recent for pilot. Setup was time-consuming.
URL
http://www.microsoft.com/downloads/details.aspx?FamilyID=429163ec
-dcdf-47dc-96da-1c12d67327d5&displaylang=en
Product
EZDetach
Description
TechHit software that extracts email attachments in native formats with
options of naming file with sender or message subject and zipping after
extraction. Works as plug-in within Outlook. Trial version available.
Vendor information “Improve productivity and save hours on managing attachments. Save
mailbox and PST space and comply with email policies. Speed up Outlook
by reducing your PST or mailbox size. Organize attachments in file
system folders for easy access and sharing. Automatically save
attachments with Outlook Rules Wizard. Automatically print
attachments.”
5
53
Product
Fentun
Description
Freeware that extracts attachments.
Vendor information “Fentun's program understands Microsoft's Transport Neutral
Encapsulation Format (TNEF). Register it as Netscape's helper application
for ‘application/ms-tnef,’ and you will be able to extract attachments
embedded in the TNEF. It should be easy enough to use Fentun with
other email programs as well.”
Intended CERP Use RAC used it to open attachments in unidentified formats. SIA did not test
this tool.
Results
It successfully opened .dat attachments from within current Outlook
Inboxes; however, it was not useful for testbed legacy attachments from
various email clients.
URL
http://www.fentun.com
Product
File Investigator
Description
Forensics Innovations Software that identifies file type. Trial version
available.
Vendor information “The File Investigator Engine is the core library that identifies a file by its
content rather than filename extension. You might assume that it has to
be slow if it opens every file, but it is almost as fast as any other program
that just reads the disk directory. MS Windows, and most applications,
only look at a file's extension when identifying or loading it. If the file has
the wrong extension or the application doesn't recognize the extension,
then you are out of luck. Unless you have an application that uses the
File Investigator Engine. This engine also extracts valuable information
out of many different types of files. Information like: image resolutions,
sound file sampling rates, document titles, and much more. It then adds
general information about that particular file type/format.”
Intended CERP Use SIA used it to determine formats of files after JHOVE indicated problems
with specific attachments. RAC did not test this tool. Trial version tested.
Results
File report included checksum, extensions for the format, and other
metadata. Works with more than 2,000 file formats. SIA discontinued
testing this application due to time.
URL
http://www.forensicinnovations.com/fiengine.HTML
Product
File Merlin
Description
Advanced Computer Innovations software that converts files.
Vendor information “FileMerlin accurately converts word processing, spreadsheet,
presentation and data base files between a very wide range of file
formats. Widely regarded as the premier document conversion product, it
is suitable for straight-forward as well as complex documents, and is the
most accurate, complete and flexible such solution we know of.”
Intended CERP Use RAC tried converting folders and individual attachments in unknown
formats. SIA did not test this application on testbed email attachments.
6
56
Product
JHOVE
Description
JHOVE - JSTOR/Harvard Object Validation Environment’s open-source
application identifies open formats.
Vendor information “JHOVE provides functions to perform format-specific identification,
validation, and characterization of digital objects
.
Identification,
validation, and characterization actions are frequently necessary during
routine operation of digital repositories and for digital preservation
activities. These actions are performed by modules. The output from
JHOVE is controlled by output handlers. JHOVE uses an extensible plug-in
architecture; it can be configured at the time of its invocation to include
whatever specific format modules and output handlers that are desired.
The initial release of JHOVE includes modules for arbitrary byte streams,
ASCII and UTF-8 encoded text, GIF, JPEG2000, and JPEG, and TIFF
images, AIFF and WAVE audio, PDF, HTML, and XML; and text and XML
output handlers.”
Intended CERP Use SIA used it for format validation of native attachments extracted from
email messages for preservation assessments. RAC did not test this tool.
Results
Setup at SIA was complicated due to some Java issues on the workstation
but technical assistance from Harvard was very helpful. Only works with
open formats such as JPG, PDF, TIFF, etc. Output available as XML. SIA
adopted this tool in conjunction with DROID (see above) to perform
assessment tasks.
URL
http://hul.harvard.edu/jhove
Product
Lookout
Description
Unsupported application that searches words/strings in MS Outlook.
Vendor information “Lookout is lightning-fast search for your email, files, and desktop
integrated with Microsoft Outlook.
Built on top of a powerful search engine, Lookout is the only personal
search engine that can search all of your email from directly within
Outlook - in seconds...
You can use Lookout to search your:
Email messages. Contacts, calendar, notes, tasks, etc. Data from
exchange, POP, IMAP, PST files, Public Folders.
Files on your computer or other computers.
Just enter your search and press enter. Results are instant. Lookout will
find your search terms hiding nearly anywhere in your Outlook mailbox -
subjects, bodies, phone numbers, addresses, etc.”
Intended CERP Use RAC and SIA used it for keyword searching within email collections.
Results
Product produced more accurate results than Outlook’s search tool. For
example, when searching for “mission” the Outlook search would also
find “commission” and “submission” as opposed to only finding
“mission.” SIA adopted this tool for more refined searching when
applicable with email collections. RAC used it for testbed messages only.
URL
http://web.archive.org/web/20060831223528/http:/www.lookoutsoft.c
om/Lookout/download.HTML
7
46
organization. File sent messages.
Keep email message along with other related documents. Store messages
for legal compliance.
Keep audit trail of email messages. Standardize your organization's email
storage policy. Read messages in plain text without annoying images and
popups. Reduce mailbox size. Automatically save messages with rules
wizard.
Process Outlook messages with custom scripts. Easily view/save full
Internet (RFC 822) message headers.”
Intended CERP Use SIA used it to convert PST files to MBOX format within MS Outlook
application. Since most of RAC’s testbed messages were not PSTs, this
tool was not tested.
Results
SIA adopted this tool for PST conversion. Application only works with
Outlook but handles idiosyncrasies well by creating complete MBOX files
that are RFC2822-compliant. It does not appear to report on status of
email message (read, unread). It also produces a sender-subject log.
URL
http://www.techhit.com/messagesave
Product
Metadata Extractor
Description
National Library of New Zealand open-source tool reads header
information of a file to extract preservation metadata.
Vendor information “The Metadata Extraction Tool was developed by the National Library of
New Zealand to programmatically extract preservation metadata from a
range of file formats like PDF documents, image files, sound files
Microsoft office documents, and many others. The tool was initially
developed in 2003 and released as open source software in 2007. The
current version can be downloaded from the SourceForge download
page. Purpose of the Metadata Extraction Tool: The Tool builds on the
Library's work on digital preservation, and its logical preservation
metadata schema. It is designed to automatically extract preservation-
related metadata from digital files; and to output that metadata in a
standard format (XML) for use in preservation activities. The Tool was
designed for preservation processes and activities, but can be used to for
other tasks, such as the extraction of metadata for resource discovery.”
Intended CERP Use SIA tested it on some extracted attachments and PSTs for metadata
reports. RAC did not test this tool.
Results
Results produced file name, extension, size, mimetype, and date and
time. It detected Microsoft Office formats. SIA discontinued testing this
application due to time.
URL
http://meta-extractor.sourceforge.net
8
50
bookmarks, convert text to HTML on-the-fly, and take charge of your
code. Use a simple, power-packed scripting language to create anything
from a text macro to a mini-application.”
Intended CERP Use SIA used it initially to create EAD in conjunction with the EAD Cookbook
2002. RAC did not test this tool.
Results
NoteTab served as a useful tool for learning about EAD at SIA. Set up was
manual and time-consuming. SIA discontinued using it for EAD after
working with oXygen (see below).
URL
http://www.notetab.com
Product
oXygen XML editor (trial available)
Description
SyncRO Soft Ltd., Romania, software is a multi-platform XML editor.
Vendor information “<oXygen/> is a complete cross platform XML editor providing the tools
for XML authoring, XML conversion, XML Schema, DTD, Relax NG and
Schematron development, XPath, XSLT, XQuery debugging, SOAP and
WSDL testing.
The integration with the XML document repositories is made through the
WebDAV, Subversion and S/FTP protocols. <oXygen/> has also support
to browse, manage and query native XML and relational databases.
The <oXygen/> XML editor is also available as an Eclipse IDE plugin,
bringing unique XML development features to this widely used Java IDE.”
Intended CERP Use RAC and SIA used it to create EAD and validate XML from email parser.
Results
SIA adopted this tool for XML work but uses other applications as well. It
offers a clean display and numerous functions such as validation while
typing. It did have problems opening larger XML files (41 MB>) for
editing/validating though. RAC had successful results for XML editing
and validating.
URL
http://www.oxygenXML.com
Product
Stylus Studio (trial available)
Description
Progress Software Corp software for XML editing.
Vendor information “Stylus Studio 2009 XML Enterprise Suite raises the bar for productivity in
XML development tools. Approximately 2 Million XML developers and
data integration specialists turn to Stylus Studio's comprehensive and
intuitive XML toolset to tackle today's advanced XML data transformation
and aggregation challenges. … Whether you are developing Web
applications that transform relational data to XML, leveraging legacy data
in EDI, HL7, or other flat file formats like CSV, or wrestling with complex
XSLT stylesheets, Stylus Studio helps you realize the promise of existing
and emerging XML technologies. From XQuery to XML Pipelines to Java or
C# for .NET code generation, Stylus Studio is the one XML IDE that does it
all.”
Intended CERP Use SIA used it to validate XML from email parser. RAC did not use this tool.
9
13
Vendor information “Got a file via email and don't know how to open it? Have a .DOC file that
Office won't open?
Downloaded a document and don't know what archiver was used to
compress it?
Check what TrID can tell about it!”
Intended CERP Use SIA used it to discover format of single files when extension was missing.
RAC did not test this tool.
Results
Identification results are fast and offer alternative formats as well. SIA
adopted this tool in cases of single files.
URL
http://mark0.net/onlinetrid.aspx
10
Documents you may be interested
Documents you may be interested