81
Data Science with R
Hands-On
Text Mining
14.2 Letter Frequency
Z
J
Q
X
W
Y
K
V
F
B
G
H
P
D
M
C
U
L
S
N
O
T
R
A
I
E
0%
2%
4%
6%
8%
10%
12%
Proportion
Letter
Next we want to review the frequency of letters across all of the words in the discourse. Some
data preparation will transform the vector of words into a list of letters, which we then construct
afrequency count for, and pass this on to be plotted.
We again use a pipeline to string together the operations on the data. Starting from the vec-
tor of words stored in word we split the words into characters using str
split() from stringr
(Wickham,2015), removing the rst string (an empty string) from each of the results (using
sapply()). Reducing the result into a simple vector, using unlist(), we then generate a data
frame recording the letter frequencies, using dist
tab() from qdap. We can then plot the letter
proportions.
library(dplyr)
library(stringr)
words
%>%
str_split("")
%>%
sapply(function(x) x[-1])
%>%
unlist
%>%
dist_tab
%>%
mutate(Letter=factor(toupper(interval),
levels=toupper(interval[order(freq)])))
%>%
ggplot(aes(Letter, weight=percent))
+
geom_bar()
+
coord_flip()
+
labs(y="Proportion")
+
scale_y_continuous(breaks=seq(0, 12, 2),
label=function(x) paste0(x, "%"),
Copyright
2013-2015 Graham@togaware.com
Module: TextMiningO
Page: 35 of46
Draft Only
Generated 2016-01-10 10:00:58+11:00