National Government; United Nations; public money
Building collocations list
medium build; social drinker; quiet nights; long term; age open;
financially secure; fun times; similar interests; Age open; poss
rship; single mum; permanent relationship; slim build; seeks lady;
Late 30s; Photo pls; Vibrant personality; European background; ASIAN
LADY; country drives
The collocations that emerge are very specific to the genre of the texts. In order to find
red wine as a collocation, we would need to process a much larger body of text.
Counting Other Things
Counting words is useful, but we can count other things too. For example, we can look
at the distribution of word lengths in a text, by creating a
out of a long list of
numbers, where each number is the length of the corresponding word in the text:
>>> [len(w) for w in text1]
>>> fdist = FreqDist([len(w) for w in text1])
<FreqDist with 260819 outcomes>
[3, 1, 4, 2, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20]
We start by deriving a list of the lengths of words in
, and the
counts the number of times each of these occurs
. The result
is a distribution
containing a quarter of a million items, each of which is a number corresponding to a
word token in the text. But there are only 20 distinct items being counted, the numbers
1 through 20, because there are only 20 different word lengths. I.e., there are words
consisting of just 1 character, 2 characters, ..., 20 characters, but none with 21 or more
characters. One might wonder how frequent the different lengths of words are (e.g.,
how many words of length 4 appear in the text, are there more words of length 5 than
length 4, etc.). We can do this as follows:
(15, 70), (16, 22), (17, 12), (18, 1), (20, 1)]
From this we see that the most frequent word length is 3, and that words of length 3
account for roughly 50,000 (or 20%) of the words making up the book. Although we
will not pursue it here, further analysis of word length might help us understand
1.3 Computing with Language: Simple Statistics s | | 21
differences between authors, genres, or languages. Table 1-2 summarizes the functions
defined in frequency distributions.
Table 1-2. Functions defined for NLTK’s frequency distributions
fdist = FreqDist(samples)
Create a frequency distribution containing the given samples
Increment the count for this sample
Count of the number of times a given sample occurred
Frequency of a given sample
Total number of samples
The samples sorted in order of decreasing frequency
for sample in fdist:
Iterate over the samples, in order of decreasing frequency
Sample with the greatest count
Tabulate the frequency distribution
Graphical plot of the frequency distribution
Cumulative plot of the frequency distribution
fdist1 < fdist2
Test if samples in
occur less frequently than in
Our discussion of frequency distributions has introduced some important Python con-
cepts, and we will look at them systematically in Section 1.4.
1.4 Back to Python: Making Decisions and Taking Control
So far, our little programs have had some interesting qualities: the ability to work with
language, and the potential to save human effort through automation. A key feature of
programming is the ability of machines to make decisions on our behalf, executing
instructions when certain conditions are met, or repeatedly looping through text data
until some condition is satisfied. This feature is known as control, and is the focus of
Python supports a wide range of operators, such as
, for testing the relationship
between values. The full set of these relational operators are shown in Table 1-3.
Table 1-3. Numerical comparison operators
Less than or equal to
Equal to (note this is two “
”signs, not one)
22 | | Chapter 1: Language Processing and Python
C# Word - Merge Word Documents in C#.NET
Combine and Merge Multiple Word Files into One Using C#. This part illustrates how to combine three Word files into a new file in C# application. append pdf files reader; pdf combine
Not equal to
Greater than or equal to
We can use these to select different words from a sentence of news text. Here are some
examples—notice only the operator is changed from one line to the next. They all use
, the first sentence from
(Wall Street Journal). As before, if you get an error
is undefined, you need to first type:
from nltk.book import *
>>> [w for w in sent7 if len(w) < 4]
>>> [w for w in sent7 if len(w) <= 4]
>>> [w for w in sent7 if len(w) == 4]
['will', 'join', 'Nov.']
>>> [w for w in sent7 if len(w) != 4]
There is a common pattern to all of these examples:
[w for w in text if
is a Python “test” that yields either true or false. In the cases shown
in the previous code example, the condition is always a numerical comparison. How-
ever, we can also test various properties of words, using the functions listed in Table 1-4.
Table 1-4. Some word comparison operators
t in s
is contained inside
Test if all cased characters in
Test if all cased characters in
Test if all characters in
Test if all characters in
Test if all characters in
is titlecased (all words in
have initial capitals)
Here are some examples of these operators being used to select words from our texts:
words ending with -ableness; words containing gnt; words having an initial capital; and
words consisting entirely of digits.
1.4 Back to Python: Making Decisions and Taking Control l | | 23
>>> sorted([w for w in set(text1) if w.endswith('ableness')])
>>> sorted([term for term in set(text4) if 'gnt' in term])
['Sovereignty', 'sovereignties', 'sovereignty']
>>> sorted([item for item in set(text6) if item.istitle()])
>>> sorted([item for item in set(sent7) if item.isdigit()])
We can also create more complex conditions. If c is a condition, then
c is also a
condition. If we have two conditions c
, then we can combine them to form a
new condition using conjunction and disjunction: c
Your Turn: Run the following examples and try to explain what is going
on in each one. Next, try to make up some conditions of your own.
>>> sorted([w for w in set(sent7) if not w.islower()])
Operating on Every Element
In Section 1.3, we saw some examples of counting items other than words. Let’s take
a closer look at the notation we used:
>>> [len(w) for w in text1]
>>> [w.upper() for w in text1]
These expressions have the form
[f(w) for ...]
[w.f() for ...]
function that operates on a word to compute its length, or to convert it to uppercase.
For now, you don’t need to understand the difference between the notations
. Instead, simply learn this Python idiom which performs the same operation on
every element of a list. In the preceding examples, it goes through each word in
, assigning each one in turn to the variable
and performing the specified oper-
ation on the variable.
The notation just described is called a “list comprehension.” This is our
first example of a Python idiom, a fixed notation that we use habitually
without bothering to analyze each time. Mastering such idioms is an
important part of becoming a fluent Python programmer.
Let’s return to the question of vocabulary size, and apply the same idiom here:
24 | | Chapter 1: Language Processing and Python
>>> len(set([word.lower() for word in text1]))
Now that we are not double-counting words like This and this, which differ only in
capitalization, we’ve wiped 2,000 off the vocabulary count! We can go a step further
and eliminate numbers and punctuation from the vocabulary count by filtering out any
>>> len(set([word.lower() for word in text1 if word.isalpha()]))
This example is slightly complicated: it lowercases all the purely alphabetic items. Per-
haps it would have been simpler just to count the lowercase-only items, but this gives
the wrong answer (why?).
Don’t worry if you don’t feel confident with list comprehensions yet, since you’ll see
many more examples along with explanations in the following chapters.
Nested Code Blocks
Most programming languages permit us to execute a block of code when a conditional
statement, is satisfied. We already saw examples of conditional tests
in code like
[w for w in sent7 if len(w) < 4]
. In the following program, we have
created a variable called
containing the string value
whether the test
len(word) < 5
is true. It is, so the body of the
statement is invoked
statement is executed, displaying a message to the user. Remember to
statement by typing four spaces.
>>> word = 'cat'
>>> if len(word) < 5:
... print 'word length is less than 5'
word length is less than 5
When we use the Python interpreter we have to add an extra blank line
in order for
it to detect that the nested block is complete.
If we change the conditional test to
len(word) >= 5
, to check that the length of
greater than or equal to
, then the test will no longer be true. This time, the body of
statement will not be executed, and no message is shown to the user:
>>> if len(word) >= 5:
... print 'word length is greater than or equal to 5'
1.4 Back to Python: Making Decisions and Taking Control l | | 25
statement is known as a control structure because it controls whether the code
in the indented block will be run. Another control structure is the
loop. Try the
following, and remember to include the colon and the four spaces:
>>> for word in ['Call', 'me', 'Ishmael', '.']:
... print word
This is called a loop because Python executes the code in circular fashion. It starts by
performing the assignment
word = 'Call'
, effectively using the
variable to name
the first item of the list. Then, it displays the value of
to the user. Next, it goes
back to the
statement, and performs the assignment
word = 'me'
this new value to the user, and so on. It continues in this fashion until every item of the
list has been processed.
Looping with Conditions
Now we can combine the
statements. We will loop over every item of the
list, and print the item only if it ends with the letter l. We’ll pick another name for the
variable to demonstrate that Python doesn’t try to make sense of variable names.
>>> sent1 = ['Call', 'me', 'Ishmael', '.']
>>> for xyzzy in sent1:
... if xyzzy.endswith('l'):
... print xyzzy
You will notice that
statements have a colon at the end of the line, before
the indentation begins. In fact, all Python control structures end with a colon. The
colon indicates that the current statement relates to the indented block that follows.
We can also specify an action to be taken if the condition of the
statement is not
met. Here we see the
(else if) statement, and the
statement. Notice that these
also have colons before the indented code.
>>> for token in sent1:
... if token.islower():
... print token, 'is a lowercase word'
... elif token.istitle():
... print token, 'is a titlecase word'
... print token, 'is punctuation'
Call is a titlecase word
me is a lowercase word
26 | | Chapter 1: Language Processing and Python
Ishmael is a titlecase word
. is punctuation
As you can see, even with this small amount of Python knowledge, you can start to
build multiline Python programs. It’s important to develop such programs in pieces,
testing that each piece does what you expect before combining them into a program.
This is why the Python interactive interpreter is so invaluable, and why you should get
comfortable using it.
Finally, let’s combine the idioms we’ve been exploring. First, we create a list of cie and
cei words, then we loop over each item and print it. Notice the comma at the end of
the print statement, which tells Python to produce its output on a single line.
>>> for word in tricky:
... print word,
ancient ceiling conceit conceited conceive conscience
conscientious conscientiously deceitful deceive ...
1.5 Automatic Natural Language Understanding
We have been exploring language bottom-up, with the help of texts and the Python
programming language. However, we’re also interested in exploiting our knowledge of
language and computation by building useful language technologies. We’ll take the
opportunity now to step back from the nitty-gritty of code in order to paint a bigger
picture of natural language processing.
At a purely practical level, we all need help to navigate the universe of information
locked up in text on the Web. Search engines have been crucial to the growth and
popularity of the Web, but have some shortcomings. It takes skill, knowledge, and
some luck, to extract answers to such questions as: What tourist sites can I visit between
Philadelphia and Pittsburgh on a limited budget? What do experts say about digital SLR
cameras? What predictions about the steel market were made by credible commentators
in the past week? Getting a computer to answer them automatically involves a range of
language processing tasks, including information extraction, inference, and summari-
zation, and would need to be carried out on a scale and with a level of robustness that
is still beyond our current capabilities.
On a more philosophical level, a long-standing challenge within artificial intelligence
has been to build intelligent machines, and a major part of intelligent behavior is un-
derstanding language. For many years this goal has been seen as too difficult. However,
as NLP technologies become more mature, and robust methods for analyzing unre-
stricted text become more widespread, the prospect of natural language understanding
has re-emerged as a plausible goal.
1.5 Automatic Natural Language Understanding g | 27
In this section we describe some language understanding technologies, to give you a
sense of the interesting challenges that are waiting for you.
Word Sense Disambiguation
In word sense disambiguation we want to work out which sense of a word was in-
tended in a given context. Consider the ambiguous words serve and dish:
(2) a.serve: help with food or drink; hold an office; put ball into play
b.dish: plate; course of a meal; communications device
In a sentence containing the phrase: he served the dish, you can detect that both serve
and dish are being used with their food meanings. It’s unlikely that the topic of discus-
sion shifted from sports to crockery in the space of three words. This would force you
to invent bizarre images, like a tennis pro taking out his frustrations on a china tea-set
laid out beside the court. In other words, we automatically disambiguate words using
context, exploiting the simple fact that nearby words have closely related meanings. As
another example of this contextual effect, consider the word by, which has several
meanings, for example, the book by Chesterton (agentive—Chesterton was the author
of the book); the cup by the stove (locative—the stove is where the cup is); and submit
by Friday (temporal—Friday is the time of the submitting). Observe in (3) that the
meaning of the italicized word helps us interpret the meaning of by.
(3) a.The lost children were found by the searchers (agentive)
b.The lost children were found by the mountain (locative)
c.The lost children were found by the afternoon (temporal)
A deeper kind of language understanding is to work out “who did what to whom,” i.e.,
to detect the subjects and objects of verbs. You learned to do this in elementary school,
but it’s harder than you might think. In the sentence the thieves stole the paintings, it is
easy to tell who performed the stealing action. Consider three possible following sen-
tences in (4), and try to determine what was sold, caught, and found (one case is
(4) a.The thieves stole the paintings. They were subsequently sold.
b.The thieves stole the paintings. They were subsequently caught.
c.The thieves stole the paintings. They were subsequently found.
Answering this question involves finding the antecedent of the pronoun they, either
thieves or paintings. Computational techniques for tackling this problem include ana-
phora resolution—identifying what a pronoun or noun phrase refers to—and
28 | | Chapter 1: Language Processing and Python
Documents you may be interested
Documents you may be interested