11.6 Describing Language Resources Using OLAC Metadata
Members of the NLP community have a common need for discovering language re-
sources with high precision and recall. The solution which has been developed by the
Digital Libraries community involves metadata aggregation.
What Is Metadata?
The simplest definition of metadata is “structured data about data.” Metadata is de-
scriptive information about an object or resource, whether it be physical or electronic.
Although the term “metadata” itself is relatively new, the underlying concepts behind
metadata have been in use for as long as collections of information have been organized.
Library catalogs represent a well-established type of metadata; they have served as col-
lection management and resource discovery tools for decades. Metadata can be gen-
erated either “by hand” or automatically using software.
The Dublin Core Metadata Initiative began in 1995 to develop conventions for finding,
sharing, and managing information. The Dublin Core metadata elements represent a
broad, interdisciplinary consensus about the core set of elements that are likely to be
widely useful to support resource discovery. The Dublin Core consists of 15 metadata
elements, where each element is optional and repeatable: Title, Creator, Subject, De-
scription, Publisher, Contributor, Date, Type, Format, Identifier, Source, Language,
Relation, Coverage, and Rights. This metadata set can be used to describe resources
that exist in digital or traditional formats.
The Open Archives Initiative (OAI) provides a common framework across digital re-
positories of scholarly materials, regardless of their type, including documents, data,
software, recordings, physical artifacts, digital surrogates, and so forth. Each repository
consists of a network-accessible server offering public access to archived items. Each
item has a unique identifier, and is associated with a Dublin Core metadata record (and
possibly additional records in other formats). The OAI defines a protocol for metadata
search services to “harvest” the contents of repositories.
OLAC: Open Language Archives Community
The Open Language Archives Community, or OLAC, is an international partnership
of institutions and individuals who are creating a worldwide virtual library of language
resources by: (i) developing consensus on best current practices for the digital archiving
of language resources, and (ii) developing a network of interoperating repositories and
services for housing and accessing such resources. OLAC’s home on the Web is at http:
OLAC Metadata is a standard for describing language resources. Uniform description
across repositories is ensured by limiting the values of certain metadata elements to the
use of terms from controlled vocabularies. OLAC metadata can be used to describe
data and tools, in both physical and digital formats. OLAC metadata extends the
11.6 Describing Language Resources Using OLAC Metadata a | | 435
Dublin Core Metadata Set, a widely accepted standard for describing resources of all
types. To this core set, OLAC adds descriptors to cover fundamental properties of
language resources, such as subject language and linguistic type. Here’s an example of
a complete OLAC record:
<?xml version="1.0" encoding="UTF-8"?>
<creator>Evans, Nicholas D.</creator>
<subject xsi:type="olac:language" olac:code="gyd">Kayardild</subject>
<language xsi:type="olac:language" olac:code="en">English</language>
<description>Kayardild Grammar (ISBN 3110127954)</description>
<publisher>Berlin - Mouton de Gruyter</publisher>
<format>hardcover, 837 pages</format>
<relation>related to ISBN 0646119966</relation>
Participating language archives publish their catalogs in an XML format, and these
records are regularly “harvested” by OLAC services using the OAI protocol. In addition
to this software infrastructure, OLAC has documented a series of best practices for
describing language resources, through a process that involved extended consultation
with the language resources community (e.g., see http://www.language-archives.org/
OLAC repositories can be searched using a query engine on the OLAC website. Search-
ing for “German lexicon” finds the following resources, among others:
• CALLHOME German Lexicon, at http://www.language-archives.org/item/oai:
• MULTILEX multilingual lexicon, at http://www.language-archives.org/item/oai:el
• Slelex Siemens Phonetic lexicon, at http://www.language-archives.org/item/oai:elra
Searching for “Korean” finds a newswire corpus, and a treebank, a lexicon, a child-
language corpus, and interlinear glossed texts. It also finds software, including a syn-
tactic analyzer and a morphological analyzer.
Observe that the previous URLs include a substring of the form:
. This is an OAI identifier, using a URI scheme regis-
tered with ICANN (the Internet Corporation for Assigned Names and Numbers). These
436 | | Chapter 11: Managing Linguistic Data
identifiers have the format
is the name of the URI
is an archive identifier, such as
resource identifier assigned by the archive, e.g.,
Given an OAI identifier for an OLAC resource, it is possible to retrieve the complete
XML record for the resource using a URL of the following form:
• Fundamental data types, present in most corpora, are annotated texts and lexicons.
Texts have a temporal structure, whereas lexicons have a record structure.
• The life cycle of a corpus includes data collection, annotation, quality control, and
publication. The life cycle continues after publication as the corpus is modified
and enriched during the course of research.
• Corpus development involves a balance between capturing a representative sample
of language usage, and capturing enough material from any one source or genre to
be useful; multiplying out the dimensions of variability is usually not feasible be-
cause of resource limitations.
• XML provides a useful format for the storage and interchange of linguistic data,
but provides no shortcuts for solving pervasive data modeling problems.
• Toolbox format is widely used in language documentation projects; we can write
programs to support the curation of Toolbox files, and to convert them to XML.
• The Open Language Archives Community (OLAC) provides an infrastructure for
documenting and discovering language resources.
11.8 Further Reading
Extra materials for this chapter are posted at http://www.nltk.org/, including links to
freely available resources on the Web.
The primary sources of linguistic corpora are the Linguistic Data Consortium and the
European Language Resources Agency, both with extensive online catalogs. More de-
tails concerning the major corpora mentioned in the chapter are available: American
National Corpus (Reppen, Ide & Suderman, 2005), British National Corpus (BNC,
1999), Thesaurus Linguae Graecae (TLG, 1999), Child Language Data Exchange Sys-
tem (CHILDES) (MacWhinney, 1995), and TIMIT (Garofolo et al., 1986).
Two special interest groups of the Association for Computational Linguistics that or-
ganize regular workshops with published proceedings are SIGWAC, which promotes
the use of the Web as a corpus and has sponsored the CLEANEVAL task for removing
HTML markup, and SIGANN, which is encouraging efforts toward interoperability of
11.8 Further Reading g | 437
linguistic annotations. An extended discussion of web crawling is provided by (Croft,
Metzler & Strohman, 2009).
Full details of the Toolbox data format are provided with the distribution (Buseman,
Buseman & Early, 1996), and with the latest distribution freely available from http://
www.sil.org/computing/toolbox/. For guidelines on the process of constructing a Tool-
box lexicon, see http://www.sil.org/computing/ddp/. More examples of our efforts with
the Toolbox are documented in (Bird, 1999) and (Robinson, Aumann & Bird, 2007).
Dozens of other tools for linguistic data management are available, some surveyed by
(Bird & Simons, 2003). See also the proceedings of the LaTeCH workshops on language
technology for cultural heritage data.
There are many excellent resources for XML (e.g., http://zvon.org/) and for writing
Python programs to work with XML http://www.python.org/doc/lib/markup.html.
Many editors have XML modes. XML formats for lexical information include OLIF
(http://www.olif.net/) and LIFT (http://code.google.com/p/lift-standard/).
For a survey of linguistic annotation software, see the Linguistic Annotation Page at
http://www.ldc.upenn.edu/annotation/. The initial proposal for standoff annotation was
(Thompson & McKelvie, 1997). An abstract data model for linguistic annotations,
called “annotation graphs,” was proposed in (Bird & Liberman, 2001). A general-
purpose ontology for linguistic description (GOLD) is documented at http://www.lin
For guidance on planning and constructing a corpus, see (Meyer, 2002) and (Farghaly,
2003). More details of methods for scoring inter-annotator agreement are available in
(Artstein & Poesio, 2008) and (Pevzner & Hearst, 2002).
Rotokas data was provided by Stuart Robinson, and Iu Mien data was provided by Greg
For more information about the Open Language Archives Community, visit http://www
.language-archives.org/, or see (Simons & Bird, 2003).
1.◑ In Example 11-2 the new field appeared at the bottom of the entry. Modify this
program so that it inserts the new subelement right after the
field. (Hint: create
, assign a text value to it, then use the
method of the parent element.)
2.◑ Write a function that deletes a specified field from a lexical entry. (We could use
this to sanitize our lexical data before giving it to others, e.g., by removing fields
containing irrelevant or uncertain content.)
3.◑ Write a program that scans an HTML dictionary file to find entries having an
illegal part-of-speech field, and then reports the headword for each entry.
438 | | Chapter 11: Managing Linguistic Data
Documents you may be interested
Documents you may be interested