that benefits from being converted to a factor and data that needs to stay
numeric. If you can view your data as categorical, converting it to a factor
helps with analyzing it.
Counting unique values
Let’s take another look at the dataset
. This built-in dataset describes
fuel consumption and ten different design points from 32 cars from the 1970s. It
contains, in total, 11 variables, but all of them are numeric. Although you can work
with the data frame as is, some variables could be converted to a factor because
they have a limited amount of values.
If you don’t know how many different values a variable has, you can get this
information in two simple steps:
1. Get the unique values of the variable using
2. Get the length of the resulting vector using
function from Chapter 9, you can do this for the whole data
frame at once. You apply an anonymous function combining both mentioned steps
on the whole data frame, like this:
> sapply(mtcars, function(x) length(unique(x)))
mpg cyl disp hp drat wt qsec vs am gear carb
25 3 27 22 22 29 30 2 2 3 6
So, it looks like the variables
can benefit from a
conversion to factor. Remember: You have 32 different observations in that
dataset, so none of the variables has unique values only.
When to treat a variable like a factor depends a bit on the situation, but, as
a general rule, avoid more than ten different levels in a factor and try to have
at least five values per level.
Preparing the data
In many real-life cases, you get heaps of data in a big file, and preferably in a
format you can’t use at all. That must be the golden rule of data gathering: Make
sure your statistician sweats his pants off just by looking at the data. But no
worries! With R at your fingertips, you can quickly shape your data exactly as you
want it. Selecting only the variables you need and transforming them to the right
format becomes pretty easy with the tricks you see in the previous chapters.
Let’s prepare the data frame
a bit using some simple tricks. First,
create a data frame
> cars <- mtcars[c(1,2,9,10)]
> cars$gear <- ordered(cars$gear)
> cars$am <- factor(cars$am, labels=c(‘auto’, ‘manual’))
With this code, you do the following:
Select four variables from the data frame
and save them in a
data frame called
. Note that you use the index system for lists to select
the variables (see Chapter 7).
Make the variable
in this data frame an ordered factor.
Give the variable
if its original value is
if its original value is
Transform the new variable
to a factor.
In the conversion of
, you notice that the first argument of the
statement isn’t a logical expression. The original variable has
as values, and R reads a
and everything else as
. You can use
this property in your own code, as shown earlier.
After running this code, you should have a dataset
in your workspace
with the following structure:
‘data.frame’: 32 obs. of 4 variables:
$ mpg : num 21 21 22.8 21.4 18.7 ...
$ cyl : num 6 6 4 6 8 ...
$ am : Factor w/ 2 levels “auto”,”manual”: 1 1 1 2 2 ...
$ gear: Ord.factor w/ 3 levels “3”<”4”<”5”: 2 2 2 1 1 ...
C# HTML5 Viewer: Deployment on AzureCloudService
RasterEdge.XDoc.PDF.dll. RasterEdge.XDoc.PDF.HTML5Editor.dll. Or you can select x86 if you use x86 dlls. (The application cannot to work without this node.). pdf text search tool; pdf editor with search and replace text
C# HTML5 Viewer: Deployment on ASP.NET MVC
RasterEdge.XDoc.PDF.HTML5Editor.dll. When you select x64 and directly run the application, you may get following error. (The application cannot to work without can't select text in pdf file; search text in pdf using java
With this dataset in your workspace you’re ready to tackle the rest of this
In order to avoid too much clutter on the screen, we set the argument
function when creating the output. This argument
defines the default number of values that are displayed for each variable. If
, your output may look a bit different from the one shown
here. See the Help page
for more information. Or just forget about it —
you’ll never use it unless you start writing a book about R.
Describing Continuous Variables
You have the dataset and you’ve formatted it to fit your needs, so now you’re
ready for the real work. Analyzing your data always starts with describing it. This
way you can detect errors in the data, and you can decide which models are
appropriate to get the information you need from the data you have. Which
descriptive statistics you use depends on the nature of your data, of course. Let’s
first take a look at some things you want to do with continuous data.
Talking about the center of your data
Sometimes you’re more interested in the general picture of your data than you
are in the individual values. You may be interested not in the mileage of every car,
but in the average mileage of all cars from that dataset. For this, you calculate the
mean using the
function, like this:
You also could calculate the average number of cylinders those cars have, but
this doesn’t really make sense. The average would be 6.1875 cylinders, and we
have yet to see a car driving with an incomplete cylinder. In this case, the median
— the most central value in your data — makes more sense. You get the median
from using the function
, like this:
There are numerous other reasons for calculating the median instead of the
mean, or even both together. Both statistics describe a different property of
your data, and even the combination can tell you something. If you don’t know
how to interpret these statistics, Statistics For Dummies, 2nd Edition, by
Deborah J. Rumsey, PhD (Wiley) is a great resource.
Describing the variation
A single number doesn’t tell you that much about your data. Often it’s at least
as important to have an idea about the spread of your data. You can look at this
spread using a number of different approaches.
First, you can calculate either the variance or the standard deviation to
summarize the spread in a single number. For that, you have the convenient
for the variance and
for the standard deviation. For example,
you calculate the standard deviation of the variable
in the data frame
Checking the quantiles
Next to the mean and variation, you also can take a look at the quantiles. A
quantile, or percentile, tells you how much of your data lies below a certain value.
The 50 percent quantile, for example, is nothing but the median. Again, R has
some convenient functions to help you with looking at the quantiles.
Calculating the range
The most-used quantiles are actually the 0 percent and 100 percent quantiles.
You could just as easily call them the minimum and maximum, because that’s what
they are. We introduce the
functions in Chapter 4. You can get
both together using the
function. This function conveniently gives you the
range of the data. So, to know between which two values all the mileages are
situated, you simply do the following:
 10.4 33.9
Calculating the quartiles
The range still gives you only limited information. Often statisticians report the
first and the third quartile next to the range and the median. These quartiles are,
respectively, the 25 percent and 75 percent quantiles, which are the numbers for
which one-fourth and three-fourths of the data is smaller. You get these numbers
function, like this:
0% 25% 50% 75% 100%
10.400 15.425 19.200 22.800 33.900
The quartiles are not the same as the lower and upper hinge calculated in
the five-number summary. The latter two are, respectively, the median of the
lower and upper half of your data, and they differ slightly from the first and
third quartiles. To get the five number statistics, you use the
Getting on speed with the quantile function
function can give you any quantile you want. For that, you use
argument. You give the
(or probabilities) as a fractional number.
For the 20 percent quantile, for example, you use
as an argument for the
value. This argument also takes a vector as a value, so you can, for example, get
the 5 percent and 95 percent quantiles like this:
> quantile(cars$mpg, probs=c(0.05, 0.95))
The default value for the
argument is a vector representing the
minimum (0), the first quartile (0.25), the median (0.5), the third quartile (0.75),
and the maximum (1).
All functions from the previous sections have an argument
you to remove all
values before calculating the respective statistic. If you
don’t do this, any vector containing
as a result. This works
identically to the
argument of the
function (see Chapter 4).
A first step in every analysis consists of calculating the descriptive statistics for
your dataset. You have to get to know the data you received before you can
accurately decide what models you try out on them. You need to know something
about the range of the values in your data, how these values are distributed in the
range, and how values in different variables relate to each other. Much of what you
do and how you do it depends on the type of data.
Whenever you have a limited number of different values, you can get a quick
summary of the data by calculating a frequency table. A frequency table is a table
that represents the number of occurrences of every unique value in the variable. In
R, you use the
function for that.
Creating a table
You can tabulate, for example, the amount of cars with a manual and an
automatic gearbox using the following command:
> amtable <- table(cars$am)
This outcome tells you that, in your data, there are 13 cars with an automatic
gearbox and 19 with a manual gearbox.
Working with tables
As with most functions, you can save the output of
in a new object (in
this case, called
). At first sight, the output of
looks like a named
vector, but is it?
function generates an object of the class
. These objects
have the same structure as an array. Arrays can have an arbitrary number of
dimensions and dimension names (see Chapter 7). Tables can be treated as
arrays to select values or dimension names.
In the “Describing Multiple Variables” section, later in this chapter, you use
multidimensional tables and calculate margins and proportions based on those
After you have the table with the counts, you can easily calculate the
proportion of each count to the total simply by dividing the table by the total
counts. To calculate the proportion of manual and automatic gearboxes in the
, you can use the following code:
Yet, R also provides the
function to do the same. You can get
the exact same result as the previous line of code by doing the following:
You may wonder why you would use an extra function for something that’s as
easy as dividing by the sum. The
function also can calculate marginal
proportions (see the “Describing Multiple Variables” section, later in this chapter).
Finding the center
In statistics, the mode of a categorical variable is the value that occurs most
frequently. It isn’t exactly the center of your data, but if there’s no order in your
data — if you look at a nominal variable — you can’t really talk about a center
Although there isn’t a specific function to calculate the mode, you can get it by
combining a few tricks:
1. To get the counts for each value, use
2. To find the location of the maximum number of counts, use
3. To find the mode of your variable, select the name corresponding
with the location in Step 2 from the table in Step 1.
So, to find the mode for the variable
in the dataset
, you can use the
> id <- amtable == max(amtable)
contains a logical vector that has the value
value in the table
that is equal to the maximum in that table. You select
the name from the values in
using this logical vector as an index.
You also can use the
function to find the location of the
maximum in a vector. This function has one important disadvantage, though:
If you have multiple maximums,
will return the position of the
first maximum only. If you’re interested in all maximums, you should use the
construct in the previous example.
Sometimes the information about the center of the data just isn’t enough. You
get some information about your data from the variance or the quantiles, but still
you may miss important features of your data. Instead of calculating yet more
numbers, R offers you some graphical tools to inspect your data further. And in the
meantime, you can impress people with some fancy plots.
To get a clearer idea about how your data is distributed within the range, you
can plot a histogram. In Chapter 16, you fancy up your plots, but for now let’s just
check the most-used tool for describing your data graphically.
Making the plot
To make a histogram for the mileage data, you simply use the
function, like this:
> hist(cars$mpg, col=’grey’)
The result of this function is shown on the left of Figure 14-1. There you see
function first cuts the range of the data in a number of even
intervals, and then counts the number of observations in each interval. The bars
height is proportional to those frequencies. On the y-axis, you find the counts.
With the argument
, you give the bars in the histogram a bit of color. In
Chapter 16, we give you some more tricks for customizing the histogram (for
example, by adding a title).
Playing with breaks
R chooses the number of intervals it considers most useful to represent the
data, but you can disagree with what R does and choose the breaks yourself. For
this, you use the
argument of the
Figure 14-1: Creating a histogram for your data.
You can specify the breaks in a couple different ways:
You can tell R the number of bars you want in the histogram by giving
a single number as the argument. Just keep in mind that R will still decide
whether that’s actually reasonable, and it tries to cut up the range using nice
You can tell R exactly where to put the breaks by giving a vector with
the break points as a value to the
So, if you don’t agree with R and you want to have bars representing the
intervals 5 to 15, 15 to 25, and 25 to 35, you can do this with the following code:
> hist(cars$mpg, breaks=c(5,15,25,35))
The resulting plot is on the right side of Figure 14-1.
You also can give the name of the algorithm R has to use to determine the
number of breaks as the value for the
argument. You can find more
information on those algorithms on the Help page
. Try to experiment
with those algorithms a bit to check which one works the best.
Using frequencies or densities
By breaking up your data in intervals, you still lose some information, albeit a
lot less than when just looking at the descriptives you calculate in the previous
sections. Still, the most complete way of describing your data is by estimating the
probability density function (PDF) or density of your variable.
Documents you may be interested
Documents you may be interested