one line. The capitalized words in that string are the SQL com-
mands. It is beyond the scope of this chapter to give an SQL tuto-
rial, but, brieﬂy, SELECT chooses a subset of the table and the
ﬁelds named after select are the ones that will appear in the result.
The FROM command choose the table(s) where the data should
come from. The WHERE command speciﬁed a condition, in this
case that we only wanted rows where the July 2011 population was
less than one million. SQL is a powerful and ﬂexible language and
this just scratches the surface.
In this case we did not assign the results of dbGetQuery() to an-
other data object, so the results were just echoed to the R console.
But it would be easy to assign the results to a dataframe and then
use that dataframe for subsequent calculations or visualizations.
To emphasize a point made above, the normal motivation for ac-
cessing data through MySQL or another database system is that a
large database exists on a remote server. Rather than having our
own complete copy of those data, we can use dbConnect(),
dbGetQuery() and other database functions to access the remote
data through SQL. We can also use SQL to specify subsets of the
data, to preprocess the data with sorts and other operations, and to
create summaries of the data. SQL is also particularly well suited
to “joining” data from multiple tables to make new combinations.
In the present example, we only used one table, it was a very small
table, and we had created it ourselves in R from an Excel source, so
none of these were very good motivations for storing our data in
MySQL, but this was only a demonstration.
The next step beyond remote databases is toward distributed com-
puting across a “cluster” of computers. This combines the remote
access to data that we just demonstrated with additional computa-
tional capabilities. At this writing, one of the most popular systems
for large scale distributed storage and computing is “Hadoop”
(named after the toy elephant of the young son of the developer).
Hadoop is not a single thing, but is rather a combination of pieces
of software called a library. Hadoop is developed and maintained
by the same people who maintain the Apache open source web
server. There are about a dozen different parts of the Hadoop
framework, but the Hadoop Distributed Files System (HDFS) and
Hadoop MapReduce framework are two of the most important
HDFS is easy to explain. Imagine your computer and several other
computers at your home or workplace. If we could get them all to
work together, we could call them a “cluster” and we could theo-
retically get more use out of them by taking advantage of all of the
storage and computing power they have as a group. Running
HDFS, we can treat this cluster of computers as one big hard drive.
If we have a really large ﬁle - too big to ﬁt on any one of the com-
puters - HDFS can divide up the ﬁle and store its different parts in
different storage areas without us having to worry about the de-
tails. With a proper conﬁguration of computer hardware, such as
an IT department could supply, HDFS can provide an enormous
amount of “throughput” (i.e., a very fast capability for reading and
writing data) as well as redundancy and failure tolerance.
MapReduce is a bit more complicated, but it follows the same logic
of trying to divide up work across multiple computers. The term
MapReduce is used because there are two big processes involved:
map and reduce. For the map operation, a big job is broken up into
lots of separate parts. For example, if we wanted to create a search
index for all of the ﬁles on a company’s intranet servers, we could
break up the whole indexing task into a bunch of separate jobs.
Each job might take care of indexing the ﬁles on one server.
In the end, though, we don’t want dozens or hundreds of different
search indices. We want one big one that covers all of the ﬁles our
company owns. This is where the reduce operation comes in. As all
of the individual indexing jobs ﬁnish up, a reduce operation com-
bines them into one big job. This combining process works on the
basis of a so called “key.” In the search indexing example, some of
the small jobs might have found ﬁles that contained the word
“ﬁsh.” As each small job ﬁnishes, it mentioned whether or not ﬁsh
appeared in a document and perhaps how many times ﬁsh ap-
peared. The reduce operation uses ﬁsh as a key to match up the re-
sults from all of the different jobs, thus creating an aggregated sum-
mary listing all of the documents that contained ﬁsh. Later, if any-
one searched on the word ﬁsh, this list could be used to direct
them to documents that contained the word.
In short, “map” takes a process that the user speciﬁes and an indi-
cation of which data it applies to, and divides the processing into
as many separate chunks as possible. As the results of each chunk
become available, “reduce” combines them and eventually creates
and returns one aggregated result.
Recently, an organization called RevolutionAnalytics has devel-
oped an R interface or “wrapper” for Hadoop that they call RHa-
doop. This package is still a work in progress in the sense that it
does not appear in the standard CRAN package archive, not be-
cause there is anything wrong with it, but rather because Revolu-
tionAnalytics wants to continue to develop it without having to
provide stable versions for the R community. There is a nice tuto-
We will break open the ﬁrst example presented in the tutorial just
to provide further illustration of the use of MapReduce. As with
our MySQL example, this is a rather trivial activity that would not
normally require the use of a large cluster of computers, but it does
show how MapReduce can be put to use.
The tutorial example ﬁrst demonstrates how a repetitive operation
is accomplished in R without the use of MapReduce. In prior chap-
ters we have used several variations of the apply() function. The
lapply() or list-apply is one of the simplest. You provide an input
vector of values and a function to apply to each element, and the
lapply() function does the heavy lifting. The example in the RHa-
doop tutorial squares each integer from one to 10. This ﬁrst com-
mand ﬁlls a vector with the input data:
> small.ints <- 1:10
 1 2 3 4 5 6 7 8 9 10
Next we can apply the “squaring function” (basically just using the
^ operator) to each element of the list:
> out <- lapply(small.ints, function(x) x^2)
... (shows all of the values up to []  100)
In the ﬁrst command above, we have used lapply() to perform a
function on the input vector small.ints. We have deﬁned the func-
tion as taking the value x and returning the value x^2. The result is
a list of ten vectors (each with just one element) containing the
squares of the input values. Because this is such a small problem, R
was able to accomplish it in a tiny fraction of a second.
After installing both Hadoop and RHadoop - which, again, is not
an ofﬁcial package, and therefore has to be installed manually - we
can perform this same operation with two commands:
> small.ints <- to.dfs(1:10)
> out <- mapreduce(input = small.ints, +
map = function(k,v) keyval(v, v^2))
In the ﬁrst command, we again create a list of integers from one to
ten. But rather than simply storing them in a vector, we are using
the “distributed ﬁle system” or dfs class that is provided by RHa-
doop. Note that in most cases we would not need to create this our-
selves because our large dataset would already exist on the HDFS
(Hadoop Distributed FIle System). We would have connected to
HDFS and selected the necessary data much as we did earlier in
this chapter with dbConnect().
In the second command, we are doing essentially the same thing as
we did with lapply(). We provide the input data structure (which,
again is a dfs class data object, a kind of pointer to the data stored
by Hadoop in the cluster). We also provide a “map function”
which is the process that we want to apply to each element in our
data set. Notice that the function takes two arguments, k and v.
The k refers to the “key” that we mentioned earlier in the chapter.
We actually don’t need the key in this example because we are not
supplying a reduce function. There is in fact, no aggregation or
combining activity that needs to occur because our input list (the
integers) and the output list (the squares of those integers) are lists
of the same size. If we had needed to aggregate the results of the
map function, say by creating a mean or a sum, we would have
had to provide a “reduce function” that would do the job.
The keyval() function, for which there is no documentation at this
writing, is characterized as a “helper” function in the tutorial. In
this case it is clear that the ﬁrst argument to keyval, “v” is the int-
ger to which the process must be applied, and the second argu-
ment, “v^2” is the squaring function that is applied to each argu-
ment. The data returned by mapreduce() is functionally equivalent
to that returned by lapply(), i.e., a list of the squares of the integers
from 1 to 10.
Obviously there is no point in harnessing the power of a cluster of
computers to calculate something that could be done with a pencil
and a paper in a few seconds. If, however, the operation was more
complex and the list of input data had millions of elements, the use
of lapply() would be impractical as it would take your computer
quite a long time to ﬁnish the job. On the other hand, the second
strategy of using mapreduce() could run the job in a fraction of a
second, given a sufﬁcient supply of computers and storage.
On a related note, Amazon, the giant online retailer, provides vir-
tual computer clusters that can be used for exactly this kind of
work. Amazon’s product is called the Elastic Compute Cloud or
EC2, and at this writing it is possible to create a small cluster of
Linux computers for as little as eight cents per hour.
To summarize this chapter, although there are many analytical
problems that require only a small amount of data, the wide avail-
ability of larger data sets has added new challenges to data science.
As a single user program running on a local computer, R is well
suited for work by a single analyst on a data set that is small
enough to ﬁt into the computer’s memory. We can retrieve these
small datasets from individual ﬁles stored in human readable 9e.g.,
CSV) or binary (e.g., XLS) formats.
To be able to tackle the larger data sets, however, we need to be
able to connect R with either remote databases or remote computa-
tional resources or both. A variety of packages is available to con-
nect R to mature database technologies such as MySQL. In fact, we
demonstrated the use of MySQL by installing it on a local machine
and then using the RMySQL package to create a table and query it.
The more cutting edge technology of Hadoop is just becoming
available for R users. This technology, which provides the potential
for both massive storage and parallel computational power, prom-
ises to make very large datasets available for processing and analy-
sis in R.
Hadoop is a software framework designed for use with Apache,
which is ﬁrst and foremost a Linux server application. Yet there are
development versions of Hadoop available for Windows and Mac
as well. These are what are called single node instances, that is
they use a single computer to simulate the existence of a large clus-
ter of computers. See if you can install the appropriate version of
Hadoop for your computer’s operating system.
As a bonus activity, if you are successful in installing Hadoop, then
get a copy of the RHadoop package from RevolutionAnalytics and
install that. If you are successful with both, you should be able to
run the MapReduce code presented in this chapter.
R Functions Used in this Chapter
as.numeric() - Convert another data type to a number
dbConnect() - Connect to an SQL database
dgGetQuery() - Run an SQL query and return the results
dbListTables() - Show the tables available in a connection
dbWriteTable() - Send a data table to an SQL systems
install.packages() - Get the code for an R package
lapply() - Apply a function to elements of a list
library() - Make an R package available for use
Numberize() - A custom function created in this chapter
read.xls() - Import data from a binary R ﬁle; part of the gdata pack-
return() - Used in functions to designate the data returned by the
str_replace() - Replace a character string with another
str_replace_all() - Replace multiple instances of a character string
Much of what we have accomplished so far has focused on the standard rectangular dataset: one neat
table with rows and columns well deﬁned. Yet much of the power of data science comes from
bringing together difference sources of data in complementary ways. In this chapter we combine
different sources of data to make a unique product that transcends any one source.
Mashup is a term that originated in the music business decades
ago related to the practice of overlaying one music recording on
top of another one. The term has entered general usage to mean
anything that brings together disparate inﬂuences or elements. In
the application development area, mashup often refers to bringing
together various sources of data to create a new product with
unique value. There’s even a non-proﬁt group called the Open
Mashup Alliance that develops standards and methods for creating
One example of a mashup is http://www.housingmaps.com
/ , a
web application that grabs apartment rental listings from the classi-
ﬁed advertising service Craigslist and plots them on an interactive
map that shows the location of each listing. If you have ever used
Craigslist you know that it provides a very basic text-based inter-
face and that the capability to ﬁnd listings on a map would be wel-
In this chapter we tackle a similar problem. Using some address
data from government records, we call the Google geocoding API
over the web to ﬁnd the latitude and longitude of each address.
Then we plot these latitudes and longitudes on a map of the U.S.
This activities reuses skills we learned in the previous chapter for
reading in data ﬁles, adds some new skills related to calling web
APIs, and introduces us to a new type of data, namely the shape-
ﬁles that provide the basis for electronic maps.
Let’s look at the new stuff ﬁrst. The Internet is full of shapeﬁles
that contain mapping data. Shapeﬁles are a partially proprietary,
partially open format supported by a California software company
called ESRI. Shapeﬁle is actually an umbrella term that covers sev-
eral different ﬁle types. Because the R community has provided
some packages to help deal with shapeﬁles, we don’t need to much
information about the details. The most important thing to know is
that shapeﬁles contain points, polygons, and “polylines.” Everyone
knows what a point and a polygon are, but a polyline is a term
used by computer scientist to refer to a multi-segment line. In
many graphics applications it is much easier to approximate a
curved line with many tiny connected straight lines than it is to
draw a truly curved line. If you think of a road or a river on a map,
you will have a good idea of a polyline.
The U.S. Census bureau publishes shapeﬁles at various levels of de-
tail for every part of the country. Search for the term “shapeﬁle” at
“site:census.gov” and you will ﬁnd several pages with listings of
different shapeﬁles. For this exercise, we are using a relatively low
detail map of the whole U.S. We downloaded a “zip” ﬁle. Unzip
this (usually just by double-clicking on it) and you will ﬁnd several
ﬁles inside it. The ﬁle ending in “shp” is the main shapeﬁle. An-
other ﬁle that will be useful to us ends in “dbf” - this contains la-
bels and other information.
To get started, we will need two new R packages called PBSmap-
ping and maptools. PBSmapping refers not to public broadcasting,
but rather to the Paciﬁc Biology Station, whose researchers and
technologists kindly bundled up a wide range of the R processing
tools that they use to manage map data. The maptools package
was developed by Nicholas J. Lewin-Koh (University of Singapore)
and others to provide additional tools and some “wrappers” to
make PBSmapping functions easier to use. In this chapter we only
scratch the surface of the available tools: there could be a whole
book just dedicated to R mapping functions alone.
Before we read in the data we grabbed from the Census Bureau,
let’s set the working directory in R-Studio so that we don’t have to
type it out on the command line. Click on the Tools menu and then
choose “Set Working Directory.” Use the dialog box to designate
the folder where you have unzipped the shape data. After that,
these commands will load the shape data into R and show us what
> usShape <- importShapefile( +
Records : 90696
Contours : 574
Holes : 0
Events : NA
On boundary : NA
X : [-179.14734, 179.77847]
Y : [17.884813, 71.3525606439998]
Projection : LL
Zone : NULL
Extra columns :
This last command gives us a simple plot of the 90,696 shapes that
our shapeﬁle contains. Here is the plot:
This is funny looking! The ranges output from the summary() com-
mand gives us a hint as to why. The longitude of the elements in
our map range from -179 to nearly 180: this covers pretty much the
whole of the planet. The reason is that the map contains shapes for
Hawaii and Alaska. Both states have far western longitudes, but
the Aleutian peninsula in Alaska extends so far that it crosses over
the longitude line where -180 and 180 meet in the Paciﬁc Ocean. As
a result, the continental U.S. is super squashed. We can specify a
more limited area of the map to consider by using the xlim and
ylim parameters. The following command:
...gives a plot that shows the continental U.S. more in its typical
So now we have some map data stored away and ready to use. The
PBSmapping package gives us the capability of adding points to an
existing map. For now, let’s demonstrate this with a made up point
somewhere in Texas:
> X <- -100
> Y <- 30
> EID <- 1
> pointData <- data.frame(EID,X,Y)
> eventData <- as.EventData( +
You have to look carefully, but in southern Texas there is now a lit-
tle red dot. We began by manually creating a single point - speci-
ﬁed by X (longitude), Y (latitude), and EID (an identiﬁcation num-
ber) - and sticking it into a dataframe. Then we converted the data
in that dataframe into an EventData object. This is a custom class
of object speciﬁed by the PBSmapping package. The ﬁnal com-
mand above adds the EventData to the plot.
The idea of EventData is a little confusing, but if you remember
that this package was developed by biologists at the Paciﬁc Biol-
ogy Station to map sightings of ﬁsh and other natural phenomena
it makes more sense. In their lingo, an event was some observation
of interest that occurred at a particular day, time, and location. The
“event id” or EID <- 1 that we stuck in the data frame was just say-
ing that this was the ﬁrst point in our list that we wanted to plot.
For us it is not an event so much as a location of something we
wanted to see on the map.
Also note that the “projection=NA” parameter in the
as.EventData() coercion is just letting the mapping software know
that we don’t want our point to be transformed according to a map-
ping projection. If you remember from your Geography class, a pro-
jection is a mapping technique to make a curved object like the
Earth seem sensible on a ﬂat map. In this example, we’ve already
ﬂattened out the U.S., so there is no need to transform the points.
Next, we need a source of points to add to our map. This could be
anything that we’re interested in: the locations of restaurants,
crime scenes, colleges, etc. In Google a search for ﬁletype:xls or ﬁle-
type;csv with appropriate additional search terms can provide in-
teresting data sources. You may also have mailing lists of custom-
ers or clients. The most important thing is that we will need street
address, city, and state in order to geocode the addresses. For this
example, we searched for “housing street address list ﬁletype:csv”
and this turned up a data set of small businesses that have con-
tracts with the U.S. Department of Health and Human services.
Let’s read this in using read.csv():
> dhhsAddrs <- read.csv("DHHS_Contracts.csv")
599 obs. of 10 variables:
$ Contract.Number : Factor w/ 285 lev-
els "26301D0054","500000049",..: 125 125 125 279
164 247 19 242 275 70 ...
$ Contractor.Name : Factor w/ 245 lev-
els "2020 COMPANY LIMITED LIABILITY COMPANY",..:
1 1 1 2 2 3 4 6 5 7 ...
$ Contractor.Address : Factor w/ 244 lev-
els "1 CHASE SQUARE 10TH FLR, ROCHESTER, NY ",..:
116 116 116 117 117 136 230 194 64 164 ...
$ Description.Of.Requirement: Factor w/ 468 lev-
els "9TH SOW - DEFINTIZE THE LETTER CONTRACT",..:
55 55 55 292 172 354 308 157 221 340 ...
$ Dollars.Obligated : Factor w/ 586 lev-
els " $1,000,000.00 ",..: 342 561 335 314 294 2
250 275 421 21 ...
$ NAICS.Code : Factor w/ 55 lev-
els "323119","334310",..: 26 26 26 25 10 38 33 29
27 35 ...
$ Ultimate.Completion.Date : Factor w/ 206 lev-
els "1-Aug-2011","1-Feb-2013",..: 149 149 149 10
175 161 124 37 150 91 ...
Documents you may be interested
Documents you may be interested