These and many other language resources have been documented using OLAC Meta-
data, and can be searched via the OLAC home page at http://www.language-archives
.org/. Corpora List (see http://gandalf.aksis.uib.no/corpora/sub.html) is a mailing list for
discussions about corpora, and you can find resources by searching the list archives or
posting to the list. The most complete inventory of the world’s languages is Ethno-
logue, http://www.ethnologue.com/. Of 7,000 languages, only a few dozen have sub-
stantial digital resources suitable for use in NLP.
This chapter has touched on the field of Corpus Linguistics. Other useful books in
this area include (Biber, Conrad, & Reppen, 1998), (McEnery, 2006), (Meyer, 2002),
(Sampson & McCarthy, 2005), and (Scott & Tribble, 2006). Further readings in quan-
titative data analysis in linguistics are: (Baayen, 2008), (Gries, 2009), and (Woods,
Fletcher, & Hughes, 1986).
The original description of WordNet is (Fellbaum, 1998). Although WordNet was
originally developed for research in psycholinguistics, it is now widely used in NLP and
Information Retrieval. WordNets are being developed for many other languages, as
documented at http://www.globalwordnet.org/. For a study of WordNet similarity
measures, see (Budanitsky & Hirst, 2006).
Other topics touched on in this chapter were phonetics and lexical semantics, and we
refer readers to Chapters 7 and 20 of (Jurafsky & Martin, 2008).
1.○ Create a variable
containing a list of words. Experiment with the opera-
tions described in this chapter, including addition, multiplication, indexing, slic-
ing, and sorting.
2.○ Use the corpus module to explore
. How many word
tokens does this book have? How many word types?
3.○ Use the Brown Corpus reader
or the Web Text Cor-
to access some sample text in two differ-
4.○ Read in the texts of the State of the Union addresses, using the
reader. Count occurrences of
in each document. What has
happened to the usage of these words over time?
5.○ Investigate the holonym-meronym relations for some nouns. Remember that
there are three kinds of holonym-meronym relation, so you need to use
6.○ In the discussion of comparative wordlists, we created an object called
, which you could look up using words in both German and Italian in order
74 | | Chapter 2: Accessing Text Corpora and Lexical Resources
to get corresponding words in English. What problem might arise with this ap-
proach? Can you suggest a way to avoid this problem?
7.○ According to Strunk and White’s Elements of Style, the word however, used at
the start of a sentence, means “in whatever way” or “to whatever extent,” and not
“nevertheless.” They give this example of correct usage: However you advise him,
he will probably do as he thinks best. (http://www.bartleby.com/141/strunk3.html)
Use the concordance tool to study actual usage of this word in the various texts we
have been considering. See also the LanguageLog posting “Fossilized prejudices
about ‘however’” at http://itre.cis.upenn.edu/~myl/languagelog/archives/001913
8.◑ Define a conditional frequency distribution over the Names Corpus that allows
you to see which initial letters are more frequent for males versus females (see
9.◑ Pick a pair of texts and study the differences between them, in terms of vocabu-
lary, vocabulary richness, genre, etc. Can you find pairs of words that have quite
different meanings across the two texts, such as monstrous in Moby Dick and in
Sense and Sensibility?
10.◑ Read the BBC News article: “UK’s Vicky Pollards ‘left behind’” at http://news
.bbc.co.uk/1/hi/education/6173441.stm. The article gives the following statistic
about teen language: “the top 20 words used, including yeah, no, but and like,
account for around a third of all words.” How many word types account for a third
of all word tokens, for a variety of text sources? What do you conclude about this
statistic? Read more about this on LanguageLog, at http://itre.cis.upenn.edu/~myl/
11.◑ Investigate the table of modal distributions and look for other patterns. Try to
explain them in terms of your own impressionistic understanding of the different
genres. Can you find other closed classes of words that exhibit significant differ-
ences across different genres?
12.◑ The CMU Pronouncing Dictionary contains multiple pronunciations for certain
words. How many distinct words does it contain? What fraction of words in this
dictionary have more than one possible pronunciation?
13.◑ What percentage of noun synsets have no hyponyms? You can get all noun syn-
14.◑ Define a function
that takes a synset
as its argument and returns
a string consisting of the concatenation of the definition of
, and the definitions
of all the hypernyms and hyponyms of
15.◑ Write a program to find all words that occur at least three times in the Brown
16.◑ Write a program to generate a table of lexical diversity scores (i.e., token/type
ratios), as we saw in Table 1-1. Include the full set of Brown Corpus genres
2.8 Exercises s | | 75
). Which genre has the lowest diversity (greatest
number of tokens per type)? Is this what you would have expected?
17.◑ Write a function that finds the 50 most frequently occurring words of a text that
are not stopwords.
18.◑ Write a program to print the 50 most frequent bigrams (pairs of adjacent words)
of a text, omitting bigrams that contain stopwords.
19.◑ Write a program to create a table of word frequencies by genre, like the one given
in Section 2.1 for modals. Choose your own words and try to find words whose
presence (or absence) is typical of a genre. Discuss your findings.
20.◑ Write a function
that takes a word and the name of a section of the
Brown Corpus as arguments, and computes the frequency of the word in that sec-
tion of the corpus.
21.◑ Write a program to guess the number of syllables contained in a text, making
use of the CMU Pronouncing Dictionary.
22.◑ Define a function
that processes a text and produces a new version
with the word
between every third word.
23.● Zipf’s Law: Let f(w) be the frequency of a word w in free text. Suppose that all
the words of a text are ranked according to their frequency, with the most frequent
word first. Zipf’s Law states that the frequency of a word type is inversely
proportional to its rank (i.e., f × r = k, for some constant k). For example, the 50th
most common word type should occur three times as frequently as the 150th most
common word type.
a.Write a function to process a large text and plot word frequency against word
. Do you confirm Zipf’s law? (Hint: it helps to use a
logarithmic scale.) What is going on at the extreme ends of the plotted line?
b.Generate random text, e.g., using
, taking care to
include the space character. You will need to
first. Use the string
concatenation operator to accumulate characters into a (very) long string.
Then tokenize this string, generate the Zipf plot as before, and compare the
two plots. What do you make of Zipf’s Law in the light of this?
24.● Modify the text generation program in Example 2-1 further, to do the following
76 | | Chapter 2: Accessing Text Corpora and Lexical Resources
a.Store the n most likely words in a list
, then randomly choose a word
from the list using
. (You will need to
b.Select a particular genre, such as a section of the Brown Corpus or a Genesis
translation, one of the Gutenberg texts, or one of the Web texts. Train the
model on this corpus and get it to generate random text. You may have to
experiment with different start words. How intelligible is the text? Discuss the
strengths and weaknesses of this method of generating random text.
c.Now train your system using two distinct genres and experiment with gener-
ating text in the hybrid genre. Discuss your observations.
25.● Define a function
that takes a string as its argument and returns
a list of languages that have that string as a word. Use the
corpus and limit
your searches to files in the Latin-1 encoding.
26.● What is the branching factor of the noun hypernym hierarchy? I.e., for every
noun synset that has hyponyms—or children in the hypernym hierarchy—how
many do they have on average? You can get all noun synsets using
27.● The polysemy of a word is the number of senses it has. Using WordNet, we can
determine that the noun dog has seven senses with
Compute the average polysemy of nouns, verbs, adjectives, and adverbs according
28.● Use one of the predefined similarity measures to score the similarity of each of
the following pairs of words. Rank the pairs in order of decreasing similarity. How
close is your ranking to the order given here, an order that was established exper-
imentally by (Miller & Charles, 1998): car-automobile, gem-jewel, journey-voyage,
boy-lad, coast-shore, asylum-madhouse, magician-wizard, midday-noon, furnace-
stove, food-fruit, bird-cock, bird-crane, tool-implement, brother-monk, lad-
brother, crane-implement, journey-car, monk-oracle, cemetery-woodland, food-
rooster, coast-hill, forest-graveyard, shore-woodland, monk-slave, coast-forest,
lad-wizard, chord-smile, glass-magician, rooster-voyage, noon-string.
2.8 Exercises s | | 77
Documents you may be interested
Documents you may be interested