Reccomendations for metadata and data formats for
online availability and long-term preservation, version
Succeed is supported by the European Union under FP7-ICT and coordinated by Universidad de Alicante.
It most commonly serves as an extension schema used within the METS administrative
metadata section, in order to preserve patrimonial contents. However, ALTO instances
can also exist as a standalone document used independently of METS.
Advantages and drawbacks
ALTO takes benefits of the XML world:
XML is readable and understandable, even by novices, and no more difficult to
code than HTML.
ALTO schema is quite simple, and therefore, ALTO contents are easily
XML is completely interoperable: any application that can process XML can use
your information, regardless of platform.
ALTO contents can be distributed between libraries, they are interoperable, etc.
XML contents are transformable: ALTO contents can be transformable into
simple text files, HTML pages, etc.
ALTO also inherits disadvantages of XML:
Each XML language needs adequate processing applications to display, transform
ALTO needs specific tools (e.g. an ALTO file can‟t be displayed in a web browser)
XML is extendable: ALTO XML schema can be hacked locally (e.g. ALTO BnF)
Besides, ALTO has shown some other limitations:
Physical description: the layout region types supported by ALTO are limited. One
may want to be more precise: maths content, music score, etc.
Logical description: ALTO format captures the layout and the full text of OCRed
pages. But one may want to mark the logical structure of documents. This can be
done with a container format like METS in association with ALTO (to capture the
intellectual structure of the document), and/or with logical labelling of structural
elements in ALTO (page numbers, margin note, etc.)
These limitations will be addressed by the next version of the ALTO format, which is
planned to be published in January 2014.
Page Analysis and Ground-Truth Elements (PAGE)
Page Analysis and Ground-Truth Elements (PAGE) is a format framework related to
production and evaluation of Optical Character Recognition and Document Image
Analysis results. One of the main design goals was to enable “a highly detailed and
accurate description of any information which can be derived from a given document
image“ (S. Pletschacher, 2010) overcoming limitations of existing formats (like ALTO)
and allowing its use in applications requiring a very precise content representation (such
as performance evaluation). PAGE is based on a number of XML-Schemas which specify
a root structure and individual sub-formats. All Schemas are maintained by the PRImA
Research Lab and are publicly available at http://schema.primaresearch.org/PAGE/.