44
following:
> frost <- state.x77[, “Frost”]
> head(frost, 5)
Alabama Alaska Arizona Arkansas California
20 152 15 65 20
You now have a new object,
frost
, a named numeric vector. Now use
cut()
to
create three bins in your data:
> cut(frost, 3, include.lowest=TRUE)
[1] [-0.188,62.6] (125,188] [-0.188,62.6] (62.6,125]
[5] [-0.188,62.6] (125,188] (125,188] (62.6,125]
....
[45] (125,188] (62.6,125] [-0.188,62.6] (62.6,125]
[49] (125,188] (125,188]
Levels: [-0.188,62.6] (62.6,125] (125,188]
The result is a factor with three levels. The names of the levels seem a bit
complicated, but they tell you in mathematical set notation what the boundaries of
your bins are. For example, the first bin contains those states that have frost
between –0.188 and 62.8 days. In reality, of course, none of the states will have
frost on negative days — R is being mathematically conservative and adds a bit of
padding.
Note the argument
include.lowest=TRUE
to
cut()
. The default value for this
argument is
include.lowest=FALSE
, which can sometimes cause R to ignore
the lowest value in your data.
Adding labels to cut
The level names aren’t very user friendly, so specify some better names with
the
labels
argument:
> cut(frost, 3, include.lowest=TRUE, labels=c(“Low”, “Med”, “High”))
[1] Low High Low Med Low High High Med Low Low Low
....
[45] High Med Low Med High High
Levels: Low Med High
Now you have a factor that classifies states into low, medium, and high,
48
depending on the number of days of frost they get.
Using table to count the number of observations
One interesting piece of analysis is to count how many states are in each
bracket. You can do this with the
table()
function, which simply counts the number
of observations in each level of your factor.
> x <- cut(frost, 3, include.lowest=TRUE, labels=c(“Low”, “Med”, “High”))
> table(x)
x
Low Med High
11 19 20
You encounter the
table()
function again in Chapter 15.
Combining and Merging Data Sets
Now you have a grasp of how to subset your data and how to perform
calculations on it. The next thing you may want to do is combine data from
different sources. Generally speaking, you can combine different sets of data in
three ways:
By adding columns: If the two sets of data have an equal set of rows, and the
order of the rows is identical, then adding columns makes sense. Your options
for doing this are
data.frame
or
cbind()
(see Chapter 7).
By adding rows: If both sets of data have the same columns and you want to
add rows to the bottom, use
rbind()
(see Chapter 7).
By combining data with different shapes: The
merge()
function combines
data based on common columns, as well as common rows. In databases
language, this is usually called joining data.
Figure 13-1 shows these three options schematically.
In this section, we look at some of the possibilities of combining data with
merge()
. More specifically, you use
merge()
to find the intersection, as well as the
union, of different data sets. You also look at other ways of working with lookup
tables, using the functions
match()
and
%in%
.
C# Word - Search and Find Text in Word C# Word - Search and Find Text in Word. Learn How to Search Text in PDF Document and Obtain Text Content and Location Information. Overview.
how to make a pdf document text searchable; how to select text in a pdf C# PowerPoint - Search and Find Text in PowerPoint C# PowerPoint - Search and Find Text in PowerPoint. Learn How to Search Text in PDF Document and Obtain Text Content and Location Information. Overview.
searching pdf files for text; find text in pdf image
34
Figure 13-1: Different ways of combining data.
Sometimes you want to combine data where it isn’t as straightforward to
simply add columns or rows. It could be that you want to combine data based on
the values of preexisting keys in the data. This is where the
merge()
function is
useful. You can use
merge()
to combine data only when certain matching
conditions are satisfied.
Say, for example, you have information about states in a country. If one
dataset contains information about population and another contains information
about regions, and both have information about the state name, you can use
merge()
to combine your results.
Creating sample data to illustrate merging
To illustrate the different ways of using merge, have a look at the built-in
dataset
state.x77
. This is an array, so start by converting it into a data frame.
Then add a new column with the names of the states. Finally, remove the old row
names. (Because you explicitly add a column with the names of each state, you
don’t need to have that information duplicated in the row names.)
> all.states <- as.data.frame(state.x77)
> all.states$Name <- rownames(state.x77)
> rownames(all.states) <- NULL
Now you should have a data frame
all.states
with 50 observations of nine
variables:
> str(all.states)
‘data.frame’: 50 obs. of 9 variables:
$ Population: num 3615 365 2212 2110 21198 ...
$ Income : num 3624 6315 4530 3378 5114 ...
46
$ Illiteracy: num 2.1 1.5 1.8 1.9 1.1 0.7 1.1 0.9 1.3 2 ...
$ Life Exp : num 69 69.3 70.5 70.7 71.7 ...
$ Murder : num 15.1 11.3 7.8 10.1 10.3 6.8 3.1 6.2 10.7 13.9 ...
$ HS Grad : num 41.3 66.7 58.1 39.9 62.6 63.9 56 54.6 52.6 40.6 ...
$ Frost : num 20 152 15 65 20 166 139 103 11 60 ...
$ Area : num 50708 566432 113417 51945 156361 ...
$ Name : chr “Alabama” “Alaska” “Arizona” “Arkansas” ...
Creating a subset of cold states
Next, create a subset called
cold.states
consisting of those states with more
than 150 days of frost each year, keeping the columns
Name
and
Frost
:
> cold.states <- all.states[all.states$Frost>150, c(“Name”, “Frost”)]
> cold.states
Name Frost
2 Alaska 152
6 Colorado 166
....
45 Vermont 168
50 Wyoming 173
Creating a subset of large states
Finally, create a subset called
large.states
consisting of those states with a
land area of more than 100,000 square miles, keeping the columns
Name
and
Area
:
> large.states <- all.states[all.states$Area>=100000, c(“Name”, “Area”)]
> large.states
Name Area
2 Alaska 566432
3 Arizona 113417
....
31 New Mexico 121412
43 Texas 262134
Now you’re ready to explore the different types of merge.
Using the merge() function
In R you use the
merge()
function to combine data frames. This powerful
63
function tries to identify columns or rows that are common between the two
different data frames.
Using merge to find the intersection of data
The simplest form of
merge()
finds the intersection between two different sets
of data. In other words, to create a data frame that consists of those states that
are cold as well as large, use the default version of
merge()
:
> merge(cold.states, large.states)
Name Frost Area
1 Alaska 152 566432
2 Colorado 166 103766
3 Montana 155 145587
4 Nevada 188 109889
If you’re familiar with a database language such as SQL, you may have
guessed that
merge()
is very similar to a database join. This is, indeed, the
case and the different arguments to
merge()
allow you to perform natural
joins, as well as left, right, and full outer joins.
The
merge()
function takes quite a large number of arguments. These
arguments can look quite intimidating until you realize that they form a smaller
number of related arguments:
x
: A data frame.
y
: A data frame.
by
,
by.x
,
by.y
: The names of the columns that are common to both
x
and
y
. The
default is to use the columns with common names between the two data
frames.
all
,
all.x
,
all.y
: Logical values that specify the type of merge. The default
value is
all=FALSE
(meaning that only the matching rows are returned).
That last group of arguments —
all
,
all.x
and
all.y
— deserves some
explanation. These arguments determine the type of merge that will happen (see
the next section).
47
Understanding the different types of merge
The
merge()
function allows four ways of combining data:
Natural join: To keep only rows that match from the data frames, specify the
argument
all=FALSE
.
Full outer join: To keep all rows from both data frames, specify
all=TRUE
.
Left outer join: To include all the rows of your data frame
x
and only those
from
y
that match, specify
all.x=TRUE
.
Right outer join: To include all the rows of your data frame
y
and only those
from
x
that match, specify
all.y=TRUE
.
You can see a visual depiction of all these different options in Figure 13-2.
Figure 13-2: Different types of
merge()
and their database join equivalents.
Finding the union (full outer join)
Returning to the examples of U.S. states, to perform a complete merge of cold
and large states, use
merge
and specify
all=TRUE
:
> merge(cold.states, large.states, all=TRUE)
Name Frost Area
1 Alaska 152 566432
2 Arizona NA 113417
3 California NA 156361
....
13 Texas NA 262134
14 Vermont 168 NA
15 Wyoming 173 NA
49
Both data frames have a variable
Name
, so R matches the cases based on the
names of the states. The variable
Frost
comes from the data frame
cold.states
,
and the variable
Area
comes from the data frame
large.states
.
Note that this performs the complete merge and fills the columns with
NA
values where there is no matching data.
Working with lookup tables
Sometimes doing a full merge of the data isn’t exactly what you want. In these
cases, it may be more appropriate to match values in a lookup table. To do this,
you can use the
match()
or
%in%
function.
Finding a match
The
match()
function returns the matching positions of two vectors or, more
specifically, the positions of first matches of one vector in the second vector.
For example, to find which large states also occur in the data frame
cold.states
, you can do the following:
> index <- match(cold.states$Name, large.states$Name)
> index
[1] 1 4 NA NA 5 6 NA NA NA NA NA
As you see, the result is a vector that indicates matches were found at
positions one, four, five, and six. You can use this result as an index to find all the
large states that are also cold states.
Keep in mind that you need to remove the
NA
values first, using
na.omit()
:
> large.states[na.omit(index), ]
Name Area
2 Alaska 566432
6 Colorado 103766
26 Montana 145587
28 Nevada 109889
55
Making sense of %in%
A very convenient alternative to
match()
is the function
%in%
, which returns a
logical vector indicating whether there is a match.
The
%in%
function is a special type of function called a binary operator. This
means you use it by placing it between two vectors, unlike most other
functions where the arguments are in parentheses:
> index <- cold.states$Name %in% large.states$Name
> index
[1] TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
If you compare this to the result of
match()
, you see that you have a
TRUE
value for every non-missing value in the result of
match()
. Or, to put it in R code,
the operator
%in%
does the same as the following code:
> !is.na(match(cold.states$Name,large.states$Name))
[1] TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
The
match()
function returns the indices of the matches in the second
argument for the values in the first argument. On the other hand,
%in%
returns
TRUE
for every value in the first argument that matches a value in the second
argument. The order of the arguments is important here.
Because
%in%
returns a logical vector, you can use it directly to index values in
a vector.
> cold.states[index, ]
Name Frost
2 Alaska 152
6 Colorado 166
26 Montana 155
28 Nevada 188
As we mention earlier, the
%in%
function is an example of a binary operator
in R. This means that the function is used by putting it between two values, as
you would for other operators, such as
+
(plus) and
-
(minus). At the same
53
time,
%in%
is in infix operator. An infix operator in R is identifiable by the
percent signs around the function name. If you want to know how
%in%
is
defined, look at the details section of its Help page. But note that you have to
place quotation marks around the function name to get the Help page, like
this:
?”%in%”
.
Sorting and Ordering Data
One very common task in data analysis and reporting is sorting information.
You can answer many everyday questions with league tables — sorted tables of
data that tell you the best or worst of specific things. For example, parents want to
know which school in their area is the best, and businesses need to know the most
productive factories or the most lucrative sales areas. When you have the data, you
can answer all these questions simply by sorting it.
As an example, look again at the built-in data about the states in the U.S.
First, create a data frame called
some.states
that contains information contained in
the built-in variables
state.region
and
state.x77
:
> some.states <- data.frame(
+ Region = state.region,
+ state.x77)
To keep the example manageable, create a subset of only the first ten rows
and the first three columns:
> some.states <- some.states[1:10, 1:3]
> some.states
Region Population Income
Alabama South 3615 3624
Alaska West 365 6315
Arizona West 2212 4530
....
Delaware South 579 4809
Florida South 8277 4815
Georgia South 4931 4091
You now have a variable called
some.states
that is a data frame consisting of
ten rows and three columns (
Region
,
Population
, and
Income
).
Sorting vectors
34
R makes it easy to sort vectors in either ascending or descending order.
Because each column of a data frame is a vector, you may find that you perform
this operation quite frequently.
Sorting a vector in ascending order
To sort a vector, you use the
sort()
function. For example, to sort
Population
in ascending order, try this:
> sort(some.states$Population)
[1] 365 579 2110 2212 2541 3100 3615 4931 8277
[10] 21198
Sorting a vector in decreasing order
You also can tell
sort()
to go about its business in decreasing order. To do
this, specify the argument
decreasing=TRUE
:
> sort(some.states$Population, decreasing=TRUE)
[1] 21198 8277 4931 3615 3100 2541 2212 2110 579
[10] 365
You can access the Help documentation for the
sort()
function by typing ?
sort into the R console.
Sorting data frames
Another way of sorting data is to determine the order that elements should be
in, if you were to sort. This sounds long winded, but as you’ll see, having this
flexibility means you can write statements that are very natural.
Getting the order
First, determine the element order to sort
state.info$Population
in ascending
Documents you may be interested
Documents you may be interested