way, since unnamed parameters are defined by position. We can define a function that
takes an arbitrary number of unnamed and named parameters, and access them via an
in-place list of arguments
and an in-place dictionary of keyword arguments
>>> def generic(*args, **kwargs):
... print args
... print kwargs
>>> generic(1, "African swallow", monty="python")
(1, 'African swallow')
appears as a function parameter, it actually corresponds to all the unnamed
parameters of the function. As another illustration of this aspect of Python syntax,
function, which operates on a variable number of arguments. We’ll
use the variable name
to demonstrate that there’s nothing special about the name
>>> song = [['four', 'calling', 'birds'],
... ['three', 'French', 'hens'],
... ['two', 'turtle', 'doves']]
>>> zip(song, song, song)
It should be clear from this example that typing
is just a convenient shorthand,
and equivalent to typing out
song, song, song
Here’s another example of the use of keyword arguments in a function definition, along
with three equivalent ways to call the function:
>>> def freq_words(file, min=1, num=10):
... text = open(file).read()
... tokens = nltk.word_tokenize(text)
... freqdist = nltk.FreqDist(t for t in tokens if len(t) >= min)
... return freqdist.keys()[:num]
>>> fw = freq_words('ch01.rst', 4, 10)
>>> fw = freq_words('ch01.rst', min=4, num=10)
>>> fw = freq_words('ch01.rst', num=10, min=4)
A side effect of having named arguments is that they permit optionality. Thus we can
leave out any arguments where we are happy with the default value:
. Another common use of
optional arguments is to permit a flag. Here’s a revised version of the same function
that reports its progress if a
flag is set:
>>> def freq_words(file, min=1, num=10, verbose=False):
... freqdist = FreqDist()
... if trace: print "Opening", file
... text = open(file).read()
... if trace: print "Read in %d characters" % len(file)
... for word in nltk.word_tokenize(text):
4.5 Doing More with Functions s | | 153
... if len(word) >= min:
... if trace and freqdist.N() % 100 == 0: print "."
... if trace: print
... return freqdist.keys()[:num]
Take care not to use a mutable object as the default value of a parameter.
A series of calls to the function will use the same object, sometimes with
bizarre results, as we will see in the discussion of debugging later.
4.6 Program Development
Programming is a skill that is acquired over several years of experience with a variety
of programming languages and tasks. Key high-level abilities are algorithm design and
its manifestation in structured programming. Key low-level abilities include familiarity
with the syntactic constructs of the language, and knowledge of a variety of diagnostic
methods for trouble-shooting a program which does not exhibit the expected behavior.
This section describes the internal structure of a program module and how to organize
a multi-module program. Then it describes various kinds of error that arise during
program development, what you can do to fix them and, better still, to avoid them in
the first place.
Structure of a Python Module
The purpose of a program module is to bring logically related definitions and functions
together in order to facilitate reuse and abstraction. Python modules are nothing more
than individual .py files. For example, if you were working with a particular corpus
format, the functions to read and write the format could be kept together. Constants
used by both formats, such as field separators, or a
EXTN = ".inf"
could be shared. If the format was updated, you would know that only one file needed
to be changed. Similarly, a module could contain code for creating and manipulating
a particular data structure such as syntax trees, or code for performing a particular
processing task such as plotting corpus statistics.
When you start writing Python modules, it helps to have some examples to emulate.
You can locate the code for any NLTK module on your system using the
This returns the location of the compiled .pyc file for the module, and you’ll probably
see a different location on your machine. The file that you will need to open is the
corresponding .py source file, and this will be in the same directory as the .pyc file.
154 | | Chapter 4: Writing Structured Programs
Alternatively, you can view the latest version of this module on the Web at http://code
Like every other NLTK module, distance.py begins with a group of comment lines giving
a one-line title of the module and identifying the authors. (Since the code is distributed,
it also includes the URL where the code is available, a copyright statement, and license
information.) Next is the module-level docstring, a triple-quoted multiline string con-
taining information about the module that will be printed when someone types
# Natural Language Toolkit: Distance Metrics
# Copyright (C) 2001-2009 NLTK Project
# Author: Edward Loper <email@example.com>
# Steven Bird <firstname.lastname@example.org>
# Tom Lippincott <email@example.com>
# URL: <http://www.nltk.org/>
# For license information, see LICENSE.TXT
Compute the distance between two items (usually strings).
As metrics, they must satisfy the following three requirements:
1. d(a, a) = 0
2. d(a, b) >= 0
3. d(a, c) <= d(a, b) + d(b, c)
After this comes all the import statements required for the module, then any global
variables, followed by a series of function definitions that make up most of the module.
Other modules define “classes,” the main building blocks of object-oriented program-
ming, which falls outside the scope of this book. (Most NLTK modules also include a
function, which can be used to see examples of the module in use.)
Some module variables and functions are only used within the module.
These should have names beginning with an underscore, e.g.,
, since this will hide the name. If another module imports this
one, using the idiom:
from module import *
, these names will not be
imported. You can optionally list the externally accessible names of a
module using a special built-in variable like this:
__all__ = ['edit_dis
Some programs bring together a diverse range of tasks, such as loading data from a
corpus, performing some analysis tasks on the data, then visualizing it. We may already
4.6 Program Development t | | 155
have stable modules that take care of loading data and producing visualizations. Our
work might involve coding up the analysis task, and just invoking functions from the
existing modules. This scenario is depicted in Figure 4-2.
Figure 4-2. Structure of a multimodule program: The main program my_program.py imports
functions from two other modules; unique analysis tasks are localized to the main program, while
common loading and visualization tasks are kept apart to facilitate reuse and abstraction.
By dividing our work into several modules and using
statements to access func-
tions defined elsewhere, we can keep the individual modules simple and easy to main-
tain. This approach will also result in a growing collection of modules, and make it
possible for us to build sophisticated systems involving a hierarchy of modules. De-
signing such systems well is a complex software engineering task, and beyond the scope
of this book.
Sources of Error
Mastery of programming depends on having a variety of problem-solving skills to draw
upon when the program doesn’t work as expected. Something as trivial as a misplaced
symbol might cause the program to behave very differently. We call these “bugs” be-
cause they are tiny in comparison to the damage they can cause. They creep into our
code unnoticed, and it’s only much later when we’re running the program on some
new data that their presence is detected. Sometimes, fixing one bug only reveals an-
other, and we get the distinct impression that the bug is on the move. The only reas-
surance we have is that bugs are spontaneous and not the fault of the programmer.
156 | | Chapter 4: Writing Structured Programs
Flippancy aside, debugging code is hard because there are so many ways for it to be
faulty. Our understanding of the input data, the algorithm, or even the programming
language, may be at fault. Let’s look at examples of each of these.
First, the input data may contain some unexpected characters. For example, WordNet
synset names have the form
, with three components separated using periods.
The NLTK WordNet module initially decomposed these names using
However, this method broke when someone tried to look up the word PhD, which has
the synset name
, containing four periods instead of the expected two. The
solution was to use
to do at most two splits, using the rightmost in-
stances of the period, and leaving the
string intact. Although several people had
tested the module before it was released, it was some weeks before someone detected
the problem (see http://code.google.com/p/nltk/issues/detail?id=297).
Second, a supplied function might not behave as expected. For example, while testing
NLTK’s interface to WordNet, one of the authors noticed that no synsets had any
antonyms defined, even though the underlying database provided a large quantity of
antonym information. What looked like a bug in the WordNet interface turned out to
be a misunderstanding about WordNet itself: antonyms are defined for lemmas, not
for synsets. The only “bug” was a misunderstanding of the interface (see http://code
Third, our understanding of Python’s semantics may be at fault. It is easy to make the
wrong assumption about the relative scope of two operators. For example,
%02d" % "ph.d.", "n", 1
produces a runtime error
TypeError: not enough arguments
for format string
. This is because the percent operator has higher precedence than
the comma operator. The fix is to add parentheses in order to force the required scope.
As another example, suppose we are defining a function to collect all tokens of a text
having a given length. The function has parameters for the text and the word length,
and an extra parameter that allows the initial value of the result to be given as a
>>> def find_words(text, wordlength, result=):
... for word in text:
... if len(word) == wordlength:
... return result
['omg', 'teh', 'teh', 'mat']
The first time we call
, we get all three-letter words as expected. The
second time we specify an initial value for the result, a one-element list
, and as
expected, the result has this word along with the other two-letter word in our text.
Now, the next time we call
we use the same parameters as in
we get a different result! Each time we call
with no third parameter, the
4.6 Program Development t | | 157
result will simply extend the result of the previous call, rather than start with the empty
result list as specified in the function definition. The program’s behavior is not as ex-
pected because we incorrectly assumed that the default value was created at the time
the function was invoked. However, it is created just once, at the time the Python
interpreter loads the function. This one list object is used whenever no explicit value
is provided to the function.
Since most code errors result from the programmer making incorrect assumptions, the
first thing to do when you detect a bug is to check your assumptions. Localize the prob-
lem by adding
statements to the program, showing the value of important vari-
ables, and showing how far the program has progressed.
If the program produced an “exception”—a runtime error—the interpreter will print
a stack trace, pinpointing the location of program execution at the time of the error.
If the program depends on input data, try to reduce this to the smallest size while still
producing the error.
Once you have localized the problem to a particular function or to a line of code, you
need to work out what is going wrong. It is often helpful to recreate the situation using
the interactive command line. Define some variables, and then copy-paste the offending
line of code into the session and see what happens. Check your understanding of the
code by reading some documentation and examining other code samples that purport
to do the same thing that you are trying to do. Try explaining your code to someone
else, in case she can see where things are going wrong.
Python provides a debugger which allows you to monitor the execution of your pro-
gram, specify line numbers where execution will stop (i.e., breakpoints), and step
through sections of code and inspect the value of variables. You can invoke the debug-
ger on your code as follows:
>>> import pdb
>>> import mymodule
It will present you with a prompt
where you can type instructions to the debugger.
to see the full list of commands. Typing
) will execute the
current line and stop. If the current line calls a function, it will enter the function and
stop at the first line. Typing
) is similar, but it stops execution at the next
line in the current function. The
) command can be used to create or list
) to continue execution as far as the next breakpoint.
Type the name of any variable to inspect its value.
We can use the Python debugger to locate the problem in our
Remember that the problem arose the second time the function was called. We’ll start
by calling the function without using the debugger
, using the smallest possible input.
The second time, we’ll call it with the debugger
158 | | Chapter 4: Writing Structured Programs
Documents you may be interested
Documents you may be interested