These functions need a specific file format. The function
works with the XPORT format of SAS. For
, the file must be in the
Minitab portable worksheet (
Note that some of these functions are rather old. The newest versions of the
statistical packages mentioned here may have different specifications for the
format, so the functions aren’t always guaranteed to work.
Finally, note that some of these functions require the statistical package itself
to be installed on your computer. The
function, for example, can work
only if you have SAS installed.
The bottom line: If you can transfer data using CSV files, you’ll save yourself
a lot of trouble.
Finally, if you have a need to connect R to a database, then the odds are that
a package exists that can connect to your database of choice. See the nearby
sidebar, “Working with databases in R,” for some pointers.
Working with databases in R
Data analysts increasingly make use of databases to store large quantities of
data or to share data with other people. R has good support to work with a
variety of databases, but the exact details of how you do that will vary from
system to system.
If you need to connect R to your database, a really good place to start
looking for information is the Chapter 4 of the R manual “R data
import/export.” You can read this chapter at
allows you to connect to Open Database Connectivity
(ODBC) data sources. You can find this package on CRAN at
In addition, you can download and install packages to connect R to many
database systems, including the following
package, available at
package, available at
package, available at
package, available at
Getting Your Data out of R
For the same reason that it’s convenient to import data into R using CSV files,
it’s also convenient to export results from R to other applications in CSV format. To
create a CSV file, use the
function. In the same way that
a special case of
is a special case of
To interactively export data from R for pasting into other applications, you can
function is useful
for exporting vector data. For example, to export the names of the built-in dataset
, try the following:
This function doesn’t produce any output to the R console, but you can now
paste the vector into any application. For example, if you paste this into Excel,
you’ll have a column of five entries that contains the names of the
shown in Figure 12-3.
Figure 12-3: A spreadsheet after first using
and then pasting.
To write tabular data to the Clipboard, you need to use
> write.table(head(iris), file=”clipboard”, sep=”\t”, row.names=FALSE)
Again, this doesn’t produce output to the R console, but you can paste the
data into a spreadsheet. The results look like Figure 12-4.
Figure 12-4: The first six lines of
after pasting into a spreadsheet.
Working with Files and Folders
You know how to import your data into R and export your data from R. Now all
you need is an idea of where the files are stored with R and how to manipulate
Understanding the working directory
Every R session has a default location on your operating system’s file structure
called the working directory.
You need to keep track and deliberately set your working directory in each
R session. If you read or write files to disk, this takes place in the working
directory. If you don’t set the working directory to your desired location, you
could easily write files to an undesirable file location.
function tells you what the current working directory is:
To change the working directory, use the
function. Be sure to enter
the working directory as a character string (enclose it in quotes).
This example shows how to change your working directory to a folder called
Notice that the separator between folders is forward slash (
), as it is on
Linux and Mac systems. If you use the Windows operating system, the forward
slash will look odd, because you’re familiar with the backslash (
) of Windows
folders. When working in Windows, you need to either use the forward slash or
escape your backslashes using a double backslash (
). Compare the following
R will always print the results using
, but you’re free to use either
To avoid having to deal with escaping backslashes in file paths, you can use
function to construct file paths that are correct, independent
of the operating system you work on. This function is a little bit similar to
in the sense that it will append character strings, except that the
separator is always correct, regardless of the settings in your operating
> file.path(“f:”, “git”, “surveyor”)
It’s often convenient to use
in setting the working directory. This
allows you specify a cascade of drive letters and folder names, and
then assembles these into a proper file path, with the correct separator character:
> setwd(file.path(“F:”, “git”, “roxygen2”))
You also can use
to specify file paths that include the filename
at the end. Simply add the filename to the path argument. For example, here’s
the file path to the
file in the
package installed in a local
> file.path(“F:”, “git”, “roxygen2”, “roxygen2”, “README.md” )
Occasionally, you may want to write a script that will traverse a given folder
and perform actions on all the files or a subset of files in that folder.
To get a list of files in a specific folder, use
. These two
functions do exactly the same thing, but for backward-compatibility reasons, the
same function has two names:
> list.files(file.path(“F:”, “git”, “roxygen2”))
 “roxygen2” “roxygen2.Rcheck”
 “roxygen2_2.0.tar.gz” “roxygen2_2.1.tar.gz”
Table 12-2 lists some other useful functions for working with files.
Table 12-2 Useful Functions for Manipulating Files
Lists files in a directory.
Lists subdirectories of a directory.
Tests whether a specific file exists in a location.
Creates a file.
Deletes files (and directories in Unix operating systems).
Returns a name for a temporary file. If you create a file — for example, with
using this returned name — R will create a file in a temporary folder.
Returns the file path of a temporary folder on your file system.
Next, you get to exercise all your knowledge about working with files. In the
next example, you first create a temporary file, then save a copy of the
frame to this file. To test that the file is on disk, you then read the newly created
file to a new variable and inspect this variable. Finally, you delete the temporary
file from disk.
Start by using the
function to return a name to a character string
with the name of a file in a temporary folder on your system:
> my.file <- tempfile()
Notice that the result is purely a character string, not a file. This file doesn’t
yet exist anywhere. Next, you save a copy of the data frame
function. Then use
to see if R created the file:
> write.csv(iris, file=my.file)
As you can see, R created the file. Now you can use
to import the
data to a new variable called
> file.iris <- read.csv(my.file)
to investigate the structure of
. As expected
of 150 observations and six variables. Six variables, you say? Yes, six,
although the original
only has five columns. What happened here was that the
default value of the argument
can confirm this by taking a close look at the Help for
.) So, R saved
the original row names of
to a new column called
‘data.frame’: 150 obs. of 6 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels “setosa”,”versicolor”,..: 1 1 1 1 1 1 1
1 1 1 ...
To leave your file system in its original order, you can use
delete the temporary file:
As you can see, the result of
is an empty character string,
because the file no longer exists in that folder.
Manipulating and Processing Data
In This Chapter
Creating subsets of data
Adding calculated fields
Merging data from different sources
Meeting more members of the apply family
Getting your data into shape
Now it’s time to put together all the tools that you encounter in earlier
chapters. You know how to get data into R, you know how to work with lists and
data frames, and you know how to write functions. Combined, these tools form the
basic toolkit to be able to do data manipulation and processing in R.
In this chapter, you get to use some tricks and design idioms for working with
data. This includes methods for selecting and ordering data, such as working with
lookup tables. You also get to use some techniques for reshaping data — for
example, changing the shape of data from wide format to long format.
Deciding on the Most Appropriate Data Structure
The first decision you have to make before analyzing your data is how to
represent that data inside R. In Chapters 4, 5, and 7, you see that the basic data
structures in R are vectors, matrices, lists, and data frames.
If your data has only one dimension, then you already know that vectors
represent this type of data very well. However, if your data has more than one
dimension, you have the choice of using matrices, lists, or data frames. So, the
question is: When do you use which?
Matrices and higher-dimensional arrays are useful when all your data are of a
single class — in other words, all your data are numeric or all your data are
characters. If you’re a mathematician or statistician, you’re familiar with matrices
and likely use this type of object very frequently.
But in many practical situations, you’ll have data that have many different
classes — in other words, you’ll have a mixture of numeric and character data. In
this case, you need to use either lists or data frames.
If you imagine your data as a single spreadsheet, a data frame is probably a
good choice. Remember that a data frame is simply a list of named vectors of the
same length, which is conceptually very similar to a spreadsheet with columns and
a column heading for each. If you’re familiar with databases, you can think of a
data frame as similar to a single table in a database. Data frames are
tremendously useful and, in many cases, will be your first choice of objects for
storing your data.
If your data consists of a collection of objects but you can’t represent that as
an array or a data frame, then a list is your ideal choice. Because lists can contain
all kinds of other objects, including other lists or data frames, they’re tremendously
flexible. Consequently, R has a wide variety of tools to process lists.
Table 13-1 contains a summary of these choices.
You may find that a data frame is a very suitable choice for most analysis
and data-processing tasks. It’s a very convenient way of representing your
data, and it’s similar to working with database tables. When you read data
from a comma-separated value (CSV) file with the function
, R puts the results in a data frame.
Table 13-1 Useful Objects for Data Analysis
The basic data object in R, consisting
of one or more values of a single type
(for example, character, number, or
Think of this as a single column or row in a
spreadsheet, or a column in a database
A multidimensional object of a single
type (known as atomic). A matrix is an
array of two dimensions.
When you have to store numbers in many
dimensions, use arrays.
Lists can contain objects of any type.
Lists are very useful for storing collections of
data that belong together. Because lists can
contain lists, this type of object is very
Data frames are a special kind of
named list where all the elements have
the same length.
Data frames are similar to a single
spreadsheet or to a table in a database.
Creating Subsets of Your Data
Often the first task in data processing is to create subsets of your data for
further analysis. In Chapters 3 and 4, we show you ways of subsetting vectors. In
Chapter 7, we outline methods for creating subsets of arrays, data frames, and
Because this is such a fundamental task in data analysis, we review and
summarize the different methods for creating subsets of data.
Understanding the three subset operators
You’re already familiar with the three subset operators:
Documents you may be interested
Documents you may be interested