66
4.1 Introduction
Credit: Mark Lutz, author of Programming Python, co-author of Learning Python
Behold the file—one of the first things that any reasonably pragmatic programmer reaches for in a
programming language's toolbox. Because processing external files is a very real, tangible task,
the quality of file-processing interfaces is a good way to assess the practicality of a programming
tool.
As the examples in this chapter attest, Python shines here too. In fact, files in Python are supported
in a variety of layers: from the built-in
open
function's standard file object, to specialized tools in
standard library modules such as
os
, to third-party utilities available on the Web. All told,
Python's arsenal of file tools provides several powerful ways to access files in your scripts.
4.1.1 File Basics
In Python, a file object is an instance of a built-in type. The built-in function
open
creates and
returns a file object. The first argument, a string, specifies the file's path (i.e., the filename
preceded by an optional directory path). The second argument to
open
, also a string, specifies the
mode in which to open the file. For example:
input = open('data', 'r')
output = open('/tmp/spam', 'w')
open
accepts a file path in which directories and files are separated by slash characters (
/
),
regardless of the proclivities of the underlying operating system. On systems that don't use slashes,
you can use a backslash character (
\
) instead, but there's no real reason to do so. Backslashes are
harder to fit nicely in string literals, since you have to double them up or use "raw" strings. If the
file path argument does not include the file's directory name, the file is assumed to reside in the
current working directory (which is a disjoint concept from the Python module search path).
For the mode argument, use
'r'
to read the file in text mode; this is the default value and is
commonly omitted, so that
open
is called with just one argument. Other common modes are
'rb'
to read the file in binary mode,
'w'
to create and write to the file in text mode, and
'wb'
to create and write to the file in binary mode.
The distinction between text mode and binary mode is important on non-Unix-like platforms,
because of the line-termination characters used on these systems. When you open a file in binary
mode, Python knows that it doesn't need to worry about line-termination characters; it just moves
bytes between the file and in-memory strings without any kind of translation. When you open a
file in text mode on a non-Unix-like system, however, Python knows it must translate between the
'\n'
line-termination characters used in strings and whatever the current platform uses in the file
itself. All of your Python code can always rely on
'\n'
as the line-termination character, as long
as you properly indicate text or binary mode when you open the file.
Once you have a file object, you perform all file I/O by calling methods of this object, as we'll
discuss in a moment. When you're done with the file, you should finish by calling the
close
method on the object, to close the connection to the file:
input.close( )
In short scripts, people often omit this step, as Python automatically closes the file when a file
object is reclaimed during garbage collection. However, it is good programming practice to close
86
your files, and it is especially a good idea in larger programs. Note that
try
/
finally
is
particularly well suited to ensuring that a file gets closed, even when the program terminates by
raising an uncaught exception.
To write to a file, use the
write
method:
output.write(s)
where
s
is a string. Think of
s
as a string of characters if
output
is open for text-mode writing
and as a string of bytes if
output
is open for binary-mode writing. There are other writing-
related methods, such as
flush
, which sends any data that is being buffered, and
writelines
, which writes a list of strings in a single call. However, none of these other
methods appear in the recipes in this chapter, as
write
is by far the most commonly used
method.
Reading from a file is more common than writing to a file, and there are more issues involved, so
file objects have more reading methods than writing ones. The
readline
method reads and
returns the next line from a text file:
while 1:
line = input.readline( )
if not line: break
process(line)
This was idiomatic Python, but it is no longer the best way to read lines from a file. Another
alternative is to use the
readlines
method, which reads the whole file and returns a list of
lines:
for line in input.readlines( ):
process(line)
However, this is useful only for files that fit comfortably in physical memory. If the file is truly
huge,
readlines
can fail or at least slow things down quite drastically as virtual memory fills
up and the operating system has to start copying parts of physical memory to disk. Python 2.1
introduced the
xreadlines
method, which works just like
readlines
in a
for
loop but
consumes a bounded amount of memory regardless of the size of the file:
for line in input.xreadlines( ):
process(line)
Python 2.2 introduced the ideal solution, whereby you can loop on the file object itself, implicitly
getting a line at a time with the same memory and performance characteristics of
xreadlines
:
for line in input:
process(line)
Of course, you don't always want to read a file line by line. You may instead want to read some or
all of the bytes in the file, particularly if you've opened the file for binary-mode reading, where
lines are unlikely to be an applicable concept. In this case, you can use the
read
method. When
called without arguments,
read
reads and returns all the remaining bytes from the file. When
read
is called with an integer argument
N
, it reads and returns the next
N
bytes (or all the
remaining bytes, if less than
N
bytes remain). Other methods worth mentioning are
seek
and
Online Remove password from protected PDF file If we need a password from you, it will not be read or stored. To hlep protect your PDF document in C# project, XDoc.PDF provides some PDF security settings.
decrypt pdf; decrypt pdf without password
54
tell
, which support random access to files. These are normally used with binary files made up
of fixed-length records.
4.1.2 Portability and Flexibility
On the surface, Python's file support is straightforward. However, there are two aspects of
Python's file support that I want to underscore up-front, before you peruse the code in this chapter:
script portability and interface flexibility.
Keep in mind that most file interfaces in Python are fully portable across platform boundaries. It
would be difficult to overstate the importance of this feature. A Python script that searches all files
in a directory tree for a bit of text, for example, can be freely moved from platform to platform
without source-code changes: just copy the script's source file to the new target machine. I do it all
the time—so much so that I can happily stay out of operating-system wars. With Python's
portability, the underlying platform is largely irrelevant.
Also, it has always struck me that Python's file-processing interfaces are not restricted to real,
physical files. In fact, most file tools work with any kind of object that exposes the same interface
as a real file object. Thus, a file reader cares only about read methods, and a file writer cares only
about write methods. As long as the target object implements the expected protocol, all goes well.
For example, suppose you have written a general file-processing function such as the following,
intending to apply a passed-in function to each line in an input file:
def scanner(fileobject, linehandler):
for line in fileobject.readlines( ):
linehandler(line)
If you code this function in a module file and drop that file in a directory listed on your Python
search path, you can use it anytime you need to scan a text file, now or in the future. To illustrate,
here is a client script that simply prints the first word of each line:
from myutils import scanner
def firstword(line): print line.split( )[0]
file = open('data')
scanner(file, firstword)
So far, so good; we've just coded a reusable software component. But notice that there are no type
declarations in the
scanner
function, only an interface constraint—any object with a
readlines
method will do. For instance, suppose you later want to provide canned test input
from a string object, instead of from a real, physical file. The standard
StringIO
module, and
the equivalent but faster
cStringIO
, provide the appropriate wrapping and interface forgery:
from cStringIO import StringIO
from myutils import scanner
def firstword(line): print line.split( )[0]
string = StringIO('one\ntwo xxx\nthree\n')
scanner(string, firstword)
Here,
StringIO
objects are plug-and-play compatible with file objects, so
scanner
takes its
three lines of text from an in-memory string object, rather than a true external file. You don't need
to change the scanner to make this work—just send it the right kind of object. For more generality,
use a class to implement the expected interface instead:
C# HTML5 Viewer: Deployment on ASP.NET MVC RasterEdge.XDoc.PDF.HTML5Editor.dll. system.webServer> <validation validateIntegratedModeConfiguration="false"/> <security> <requestFiltering
convert locked pdf to word doc; decrypt pdf file online
36
class MyStream:
def readlines(self):
# Grab and return text from a source here
return ['a\n', 'b c d\n']
from myutils import scanner
def firstword(line): print line.split( )[0]
object = MyStream( )
scanner(object, firstword)
This time, as
scanner
attempts to read the file, it really calls out to the
readlines
method
you've coded in your class. In practice, such a method might use other Python standard tools to
grab text from a variety of sources: an interactive user, a pop-up GUI input box, a
shelve
object, an SQL database, an XML or HTML page, a network socket, and so on. The point is that
scanner
doesn't know or care what kind of object is implementing the interface it expects, or
what that interface actually does.
Object-oriented programmers know this deliberate naiveté as polymorphism. The type of the
object being processed determines what an operation, such as the
readlines
method call in
scanner
, actually does. Everywhere in Python, object interfaces, rather than specific data types,
are the unit of coupling. The practical effect is that functions are often applicable to a much
broader range of problems than you might expect. This is especially true if you have a background
in strongly typed languages such as C or C++. It is almost as if we get C++ templates for free in
Python. Code has an innate flexibility that is a byproduct of Python's dynamic typing.
Of course, code portability and flexibility run rampant in Python development and are not really
confined to file interfaces. Both are features of the language that are simply inherited by file-
processing scripts. Other Python benefits, such as its easy scriptability and code readability, are
also key assets when it comes time to change file-processing programs. But, rather than extolling
all of Python's virtues here, I'll simply defer to the wonderful example programs in this chapter
and this text at large for more details. Enjoy!
57
4.2 Reading from a File
Credit: Luther Blissett
4.2.1 Problem
You want to read text or data from a file.
4.2.2 Solution
Here's the most convenient way to read all of the file's contents at once into one big string:
all_the_text = open('thefile.txt').read( ) # all text
from a text file
all_the_data = open('abinfile', 'rb').read( ) # all data
from a binary file
However, it is better to bind the file object to a variable so that you can call
close
on it as soon
as you're done. For example, for a text file:
file_object = open('thefile.txt')
all_the_text = file_object.read( )
file_object.close( )
There are four ways to read a text file's contents at once as a list of strings, one per line:
list_of_all_the_lines = file_object.readlines( )
list_of_all_the_lines = file_object.read( ).splitlines(1)
list_of_all_the_lines = file_object.read().splitlines( )
list_of_all_the_lines = file_object.read( ).split('\n')
The first two ways leave a
'\n'
at the end of each line (i.e., in each string item in the result list),
while the other two ways remove all trailing
'\n'
characters. The first of these four ways is the
fastest and most Pythonic. In Python 2.2 and later, there is a fifth way that is equivalent to the first
one:
list_of_all_the_lines = list(file_object)
4.2.3 Discussion
Unless the file you're reading is truly huge, slurping it all into memory in one gulp is fastest and
generally most convenient for any further processing. The built-in function
open
creates a
Python file object. With that object, you call the
read
method to get all of the contents (whether
text or binary) as a single large string. If the contents are text, you may choose to immediately
split that string into a list of lines, with the
split
method or with the specialized
splitlines
method. Since such splitting is a frequent need, you may also call
readlines
directly on the file object, for slightly faster and more convenient operation. In Python 2.2, you
can also pass the file object directly as the only argument to the built-in type
list
.
On Unix and Unix-like systems, such as Linux and BSD variants, there is no real distinction
between text files and binary data files. On Windows and Macintosh systems, however, line
terminators in text files are encoded not with the standard
'\n'
separator, but with
'\r\n'
and
73
'\r'
, respectively. Python translates the line-termination characters into
'\n'
on your behalf,
but this means that you need to tell Python when you open a binary file, so that it won't perform
the translation. To do that, use
'rb'
as the second argument to
open
. This is innocuous even on
Unix-like platforms, and it's a good habit to distinguish binary files from text files even there,
although it's not mandatory in that case. Such a good habit will make your programs more directly
understandable, as well as letting you move them between platforms more easily.
You can call methods such as
read
directly on the file object produced by the
open
function, as
shown in the first snippet of the solution. When you do this, as soon as the reading operation
finishes, you no longer have a reference to the file object. In practice, Python notices the lack of a
reference at once and immediately closes the file. However, it is better to bind a name to the result
of
open
, so that you can call
close
yourself explicitly when you are done with the file. This
ensures that the file stays open for as short a time as possible, even on platforms such as Jython
and hypothetical future versions of Python on which more advanced garbage-collection
mechanisms might delay the automatic closing that Python performs.
If you choose to read the file a little at a time, rather than all at once, the idioms are different.
Here's how to read a binary file 100 bytes at a time, until you reach the end of the file:
file_object = open('abinfile', 'rb')
while 1:
chunk = file_object.read(100)
if not chunk: break
do_something_with(chunk)
file_object.close( )
Passing an argument
N
to the
read
method ensures that
read
will read only the next
N
bytes
(or fewer, if the file is closer to the end).
read
returns the empty string when it reaches the end of
the file.
Reading a text file one line at a time is a frequent task. In Python 2.2 and later, this is the easiest,
clearest, and fastest approach:
for line in open('thefile.txt'):
do_something_with(line)
Several idioms were common in older versions of Python. The one idiom you can be sure will
work even on extremely old versions of Python, such as 1.5.2, is quite similar to the idiom for
reading a binary file a chunk at a time:
file_object = open('thefile.txt')
while 1:
line = file_object.readline( )
if not line: break
do_something_with(line)
file_object.close( )
readline
, like
read
, returns the empty string when it reaches the end of the file. Note that
the end of the file is easily distinguished from an empty line because the latter is returned by
readline
as
'\n'
, which is not an empty string but rather a string with a length of 1.
4.2.4 See Also
5
Recipe 4.3
; documentation for the
open
built-in function and file objects in the Library
Reference.
70
4.3 Writing to a File
Credit: Luther Blissett
4.3.1 Problem
You want to write text or data to a file.
4.3.2 Solution
Here is the most convenient way to write one big string to a file:
open('thefile.txt', 'w').write(all_the_text) # text to a
text file
open('abinfile', 'wb').write(all_the_data) # data to a
binary file
However, it is better to bind the file object to a variable so that you can call
close
on it as soon
as you're done. For example, for a text file:
file_object = open('thefile.txt', 'w')
file_object.write(all_the_text)
file_object.close( )
More often, the data you want to write is not in one big string but in a list (or other sequence) of
strings. In this case, you should use the
writelines
method (which, despite its name, is not
limited to lines and works just as well with binary data as with text files):
file_object.writelines(list_of_text_strings)
open('abinfile', 'wb').writelines(list_of_data_strings)
Calling
writelines
is much faster than either joining the strings into one big string (e.g., with
''.join
) and then calling
write
, or calling
write
repeatedly in a loop.
4.3.3 Discussion
To create a file object for writing, you must always pass a second argument to
open
—either
'w'
to write textual data, or
'wb'
to write binary data. The same considerations illustrated in
Recipe 4.2
also apply here, except that calling
close
explicitly is even more advisable when
you're writing to a file rather than reading from it. Only by closing the file can you be reasonably
sure that the data is actually on the disk and not in some temporary buffer in memory.
Writing a file a little at a time is more common and less of a problem than reading a file a little at
a time. You can just call
write
and/or
writelines
repeatedly, as each string or sequence of
strings to write becomes ready. Each write operation appends data at the end of the file, after all
the previously written data. When you're done, call the
close
method on the file object. If you
have all the data available at once, a single
writelines
call is faster and simpler. However, if
the data becomes available a little at a time, it's at least as easy and fast to call
write
as it comes
as it would be to build up a temporary list of pieces (e.g., with
append
) to be able to write it all
at once in the end with
writelines
. Reading and writing are quite different from each other,
with respect to the performance implications of operating in bulk versus operating a little at a time.
Documents you may be interested
Documents you may be interested