Despite the complexities and idiosyncrasies of individual corpora, at base they are col-
lections of texts together with record-structured data. The contents of a corpus are
often biased toward one or the other of these types. For example, the Brown Corpus
contains 500 text files, but we still use a table to relate the files to 15 different genres.
At the other end of the spectrum, WordNet contains 117,659 synset records, yet it
incorporates many example sentences (mini-texts) to illustrate word usages. TIMIT is
an interesting midpoint on this spectrum, containing substantial free-standing material
of both the text and lexicon types.
11.2 The Life Cycle of a Corpus
Corpora are not born fully formed, but involve careful preparation and input from
many people over an extended period. Raw data needs to be collected, cleaned up,
documented, and stored in a systematic structure. Various layers of annotation might
be applied, some requiring specialized knowledge of the morphology or syntax of the
language. Success at this stage depends on creating an efficient workflow involving
appropriate tools and format converters. Quality control procedures can be put in place
to find inconsistencies in the annotations, and to ensure the highest possible level of
inter-annotator agreement. Because of the scale and complexity of the task, large cor-
pora may take years to prepare, and involve tens or hundreds of person-years of effort.
In this section, we briefly review the various stages in the life cycle of a corpus.
Three Corpus Creation Scenarios
In one type of corpus, the design unfolds over in the course of the creator’s explorations.
This is the pattern typical of traditional “field linguistics,” in which material from elic-
itation sessions is analyzed as it is gathered, with tomorrow’s elicitation often based on
questions that arise in analyzing today’s. The resulting corpus is then used during sub-
sequent years of research, and may serve as an archival resource indefinitely. Comput-
erization is an obvious boon to work of this type, as exemplified by the popular program
Shoebox, now over two decades old and re-released as Toolbox (see Section 2.4). Other
software tools, even simple word processors and spreadsheets, are routinely used to
acquire the data. In the next section, we will look at how to extract data from these
Another corpus creation scenario is typical of experimental research where a body of
carefully designed material is collected from a range of human subjects, then analyzed
to evaluate a hypothesis or develop a technology. It has become common for such
databases to be shared and reused within a laboratory or company, and often to be
published more widely. Corpora of this type are the basis of the “common task” method
of research management, which over the past two decades has become the norm in
government-funded research programs in language technology. We have already en-
countered many such corpora in the earlier chapters; we will see how to write Python
412 | | Chapter 11: Managing Linguistic Data
C# Word - Merge Word Documents in C#.NET
File: Merge Word Files. |. Home ›› XDoc.Word ›› C# Word: Merge Word Files. Combine and Merge Multiple Word Files into One Using C#. pdf merger; append pdf files reader
programs to implement the kinds of curation tasks that are necessary before such cor-
pora are published.
Finally, there are efforts to gather a “reference corpus” for a particular language, such
as the American National Corpus (ANC) and the British National Corpus (BNC). Here
the goal has been to produce a comprehensive record of the many forms, styles, and
uses of a language. Apart from the sheer challenge of scale, there is a heavy reliance on
automatic annotation tools together with post-editing to fix any errors. However, we
can write programs to locate and repair the errors, and also to analyze the corpus for
Good tools for automatic and manual preparation of data are essential. However, the
creation of a high-quality corpus depends just as much on such mundane things as
documentation, training, and workflow. Annotation guidelines define the task and
document the markup conventions. They may be regularly updated to cover difficult
cases, along with new rules that are devised to achieve more consistent annotations.
Annotators need to be trained in the procedures, including methods for resolving cases
not covered in the guidelines. A workflow needs to be established, possibly with sup-
porting software, to keep track of which files have been initialized, annotated, validated,
manually checked, and so on. There may be multiple layers of annotation, provided by
different specialists. Cases of uncertainty or disagreement may require adjudication.
Large annotation tasks require multiple annotators, which raises the problem of
achieving consistency. How consistently can a group of annotators perform? We can
easily measure consistency by having a portion of the source material independently
annotated by two people. This may reveal shortcomings in the guidelines or differing
abilities with the annotation task. In cases where quality is paramount, the entire corpus
can be annotated twice, and any inconsistencies adjudicated by an expert.
It is considered best practice to report the inter-annotator agreement that was achieved
for a corpus (e.g., by double-annotating 10% of the corpus). This score serves as a
helpful upper bound on the expected performance of any automatic system that is
trained on this corpus.
Care should be exercised when interpreting an inter-annotator agree-
ment score, since annotation tasks vary greatly in their difficulty. For
example, 90% agreement would be a terrible score for part-of-speech
tagging, but an exceptional score for semantic role labeling.
The Kappa coefficient κ measures agreement between two people making category
judgments, correcting for expected chance agreement. For example, suppose an item
is to be annotated, and four coding options are equally likely. In this case, two people
coding randomly would be expected to agree 25% of the time. Thus, an agreement of
11.2 The Life Cycle of a Corpus s | | 413
25% will be assigned κ = 0, and better levels of agreement will be scaled accordingly.
For an agreement of 50%, we would get κ = 0.333, as 50 is a third of the way from 25
to 100. Many other agreement measures exist; see
We can also measure the agreement between two independent segmentations of lan-
guage input, e.g., for tokenization, sentence segmentation, and named entity recogni-
tion. In Figure 11-4 we see three possible segmentations of a sequence of items which
might have been produced by annotators (or programs). Although none of them agree
are in close agreement, and we would like a suitable measure. Win-
dowdiff is a simple algorithm for evaluating the agreement of two segmentations by
running a sliding window over the data and awarding partial credit for near misses. If
we preprocess our tokens into a sequence of zeros and ones, to record when a token is
followed by a boundary, we can represent the segmentations as strings and apply the
>>> s1 = "00000010000000001000000"
>>> s2 = "00000001000000010000000"
>>> s3 = "00010000000000000001000"
>>> nltk.windowdiff(s1, s1, 3)
>>> nltk.windowdiff(s1, s2, 3)
>>> nltk.windowdiff(s2, s3, 3)
In this example, the window had a size of 3. The
computation slides this
window across a pair of strings. At each position it totals up the number of boundaries
found inside this window, for both strings, then computes the difference. These dif-
ferences are then summed. We can increase or shrink the window size to control the
sensitivity of the measure.
Curation Versus Evolution
As large corpora are published, researchers are increasingly likely to base their inves-
tigations on balanced, focused subsets that were derived from corpora produced for
Figure 11-4. Three segmentations of a sequence: The small rectangles represent characters, words,
sentences, in short, any sequence which might be divided into linguistic units; S
are in close
agreement, but both differ significantly from S
414 | | Chapter 11: Managing Linguistic Data
entirely different reasons. For instance, the Switchboard database, originally collected
for speaker identification research, has since been used as the basis for published studies
in speech recognition, word pronunciation, disfluency, syntax, intonation, and dis-
course structure. The motivations for recycling linguistic corpora include the desire to
save time and effort, the desire to work on material available to others for replication,
and sometimes a desire to study more naturalistic forms of linguistic behavior than
would be possible otherwise. The process of choosing a subset for such a study may
count as a non-trivial contribution in itself.
In addition to selecting an appropriate subset of a corpus, this new work could involve
reformatting a text file (e.g., converting to XML), renaming files, retokenizing the text,
selecting a subset of the data to enrich, and so forth. Multiple research groups might
do this work independently, as illustrated in Figure 11-5. At a later date, should some-
one want to combine sources of information from different versions, the task will
probably be extremely onerous.
Figure 11-5. Evolution of a corpus over time: After a corpus is published, research groups will use it
independently, selecting and enriching different pieces; later research that seeks to integrate separate
annotations confronts the difficult challenge of aligning the annotations.
The task of using derived corpora is made even more difficult by the lack of any record
about how the derived version was created, and which version is the most up-to-date.
An alternative to this chaotic situation is for a corpus to be centrally curated, and for
committees of experts to revise and extend it at periodic intervals, considering sub-
missions from third parties and publishing new releases from time to time. Print dic-
tionaries and national corpora may be centrally curated in this way. However, for most
corpora this model is simply impractical.
A middle course is for the original corpus publication to have a scheme for identifying
any sub-part. Each sentence, tree, or lexical entry could have a globally unique identi-
fier, and each token, node, or field (respectively) could have a relative offset. Annota-
tions, including segmentations, could reference the source using this identifier scheme
(a method which is known as standoff annotation). This way, new annotations could
be distributed independently of the source, and multiple independent annotations of
the same source could be compared and updated without touching the source.
If the corpus publication is provided in multiple versions, the version number or date
could be part of the identification scheme. A table of correspondences between
11.2 The Life Cycle of a Corpus s | | 415
identifiers across editions of the corpus would permit any standoff annotations to be
Sometimes an updated corpus contains revisions of base material that
has been externally annotated. Tokens might be split or merged, and
constituents may have been rearranged. There may not be a one-to-one
correspondence between old and new identifiers. It is better to cause
standoff annotations to break on such components of the new version
than to silently allow their identifiers to refer to incorrect locations.
11.3 Acquiring Data
Obtaining Data from the Web
The Web is a rich source of data for language analysis purposes. We have already
discussed methods for accessing individual files, RSS feeds, and search engine results
(see Section 3.1). However, in some cases we want to obtain large quantities of web text.
The simplest approach is to obtain a published corpus of web text. The ACL Special
Interest Group on Web as Corpus (SIGWAC) maintains a list of resources at http://
www.sigwac.org.uk/. The advantage of using a well-defined web corpus is that they are
documented, stable, and permit reproducible experimentation.
If the desired content is localized to a particular website, there are many utilities for
capturing all the accessible contents of a site, such as GNU Wget (http://www.gnu.org/
software/wget/). For maximal flexibility and control, a web crawler can be used, such
as Heritrix (http://crawler.archive.org/). Crawlers permit fine-grained control over
where to look, which links to follow, and how to organize the results. For example, if
we want to compile a bilingual text collection having corresponding pairs of documents
in each language, the crawler needs to detect the structure of the site in order to extract
the correspondence between the documents, and it needs to organize the downloaded
pages in such a way that the correspondence is captured. It might be tempting to write
your own web crawler, but there are dozens of pitfalls having to do with detecting
MIME types, converting relative to absolute URLs, avoiding getting trapped in cyclic
link structures, dealing with network latencies, avoiding overloading the site or being
banned from accessing the site, and so on.
Obtaining Data from Word Processor Files
Word processing software is often used in the manual preparation of texts and lexicons
in projects that have limited computational infrastructure. Such projects often provide
templates for data entry, though the word processing software does not ensure that the
data is correctly structured. For example, each text may be required to have a title and
date. Similarly, each lexical entry may have certain obligatory fields. As the data grows
416 | | Chapter 11: Managing Linguistic Data
in size and complexity, a larger proportion of time may be spent maintaining its con-
How can we extract the content of such files so that we can manipulate it in external
programs? Moreover, how can we validate the content of these files to help authors
create well-structured data, so that the quality of the data can be maximized in the
context of the original authoring process?
Consider a dictionary in which each entry has a part-of-speech field, drawn from a set
of 20 possibilities, displayed after the pronunciation field, and rendered in 11-point
bold type. No conventional word processor has search or macro functions capable of
verifying that all part-of-speech fields have been correctly entered and displayed. This
task requires exhaustive manual checking. If the word processor permits the document
to be saved in a non-proprietary format, such as text, HTML, or XML, we can some-
times write programs to do this checking automatically.
Consider the following fragment of a lexical entry: “sleep [sli:p] v.i. condition of body
and mind...”. We can key in such text using MSWord, then “Save as Web Page,” then
inspect the resulting HTML file:
<span style='mso-spacerun:yes'> </span>
<span style='mso-spacerun:yes'> </span>
<span style='mso-spacerun:yes'> </span>
<i>a condition of body and mind ...<o:p></o:p></i>
Observe that the entry is represented as an HTML paragraph, using the
and that the part of speech appears inside a
The following program defines the set of legal parts-of-speech,
. Then it ex-
tracts all 11-point content from the dict.htm file and stores it in the set
that the search pattern contains a parenthesized sub-expression; only the material that
matches this subexpression is returned by
. Finally, the program constructs
the set of illegal parts-of-speech as the set difference between
>>> pattern = re.compile(r"'font-size:11.0pt'>([a-z.]+)<")
>>> document = open("dict.htm").read()
>>> used_pos = set(re.findall(pattern, document))
>>> illegal_pos = used_pos.difference(legal_pos)
>>> print list(illegal_pos)
This simple program represents the tip of the iceberg. We can develop sophisticated
tools to check the consistency of word processor files, and report errors so that the
maintainer of the dictionary can correct the original file using the original word
11.3 Acquiring Data a | | 417
Once we know the data is correctly formatted, we can write other programs to convert
the data into a different format. The program in Example 11-1 strips out the HTML
, extracts the words and their pronunciations, and
generates output in “comma-separated value” (CSV) format.
Example 11-1. Converting HTML created by Microsoft Word into comma-separated values.
SEP = '_ENTRY'
html = open(html_file).read()
html = re.sub(r'<p', SEP + '<p', html)
text = nltk.clean_html(html)
text = ' '.join(text.split())
for entry in text.split(SEP):
if entry.count(' ') > 2:
yield entry.split(' ', 3)
>>> import csv
>>> writer = csv.writer(open("dict1.csv", "wb"))
Obtaining Data from Spreadsheets and Databases
Spreadsheets are often used for acquiring wordlists or paradigms. For example, a com-
parative wordlist may be created using a spreadsheet, with a row for each cognate set
and a column for each language (see
). Most spreadsheet software can export their data in CSV format. As we will
see later, it is easy for Python programs to access these using the
Sometimes lexicons are stored in a full-fledged relational database. When properly
normalized, these databases can ensure the validity of the data. For example, we can
require that all parts-of-speech come from a specified vocabulary by declaring that the
part-of-speech field is an enumerated type or a foreign key that references a separate
part-of-speech table. However, the relational model requires the structure of the data
(the schema) be declared in advance, and this runs counter to the dominant approach
to structuring linguistic data, which is highly exploratory. Fields which were assumed
to be obligatory and unique often turn out to be optional and repeatable. A relational
database can accommodate this when it is fully known in advance; however, if it is not,
or if just about every property turns out to be optional or repeatable, the relational
approach is unworkable.
Nevertheless, when our goal is simply to extract the contents from a database, it is
enough to dump out the tables (or SQL query results) in CSV format and load them
into our program. Our program might perform a linguistically motivated query that
cannot easily be expressed in SQL, e.g., select all words that appear in example sentences
for which no dictionary entry is provided. For this task, we would need to extract enough
information from a record for it to be uniquely identified, along with the headwords
and example sentences. Let’s suppose this information was now available in a CSV file
418 | | Chapter 11: Managing Linguistic Data
Documents you may be interested
Documents you may be interested