41
Hands-On Data Science with R
Text Mining
Graham.Williams@togaware.com
10th January2016
Visithttp://HandsOnDataScience.com/formoreChapters.
Text Mining (or Text Analytics) applies analytic tools to learn from collections of text data,
like socialmedia, books,newspapers, emails, etc. The goal canbe consideredtobe similar to
humanslearning by readingsuch material. However, using automated algorithmswe canlearn
frommassiveamountsoftext,verymuchmorethanahumancan. Thematerialcouldconsistof
millionsofnewspaperarticlestoperhapssummarisethemainthemesandtoidentifythosethat
are of most interest to particular people. Or we might be monitoring twitter feedsto identify
emergingtopicsthat we might needtoact upon,asit emerges.
Therequiredpackagesforthischapter include:
library(tm)
# Framework for text mining.
library(qdap)
# Quantitative discourse analysis of transcripts.
library(qdapDictionaries)
library(dplyr)
# Data wrangling, pipe operator %>%().
library(RColorBrewer)
# Generate palette of colours for plots.
library(ggplot2)
# Plot word frequencies.
library(scales)
# Include commas in numbers.
library(Rgraphviz)
# Correlation plots.
Aswe work throughthis chapter,newR commandswill be introduced. Be sure to review the
command’sdocumentationandunderstandwhatthecommanddoes. Youcanaskforhelpusing
the?commandasin:
?read.csv
Wecanobtaindocumentationonaparticularpackageusingthehelp=optionoflibrary():
library(help=rattle)
This chapter is intended to be hands on. To learn eectively, you are encouraged to have R
running (e.g.,RStudio) andto runallthecommandsastheyappear here. Check that youget
thesameoutput,andyouunderstandtheoutput. Trysome variations. Explore.
Copyright
2013-2015GrahamWilliams. Youcanfreelycopy,distribute,
oradapt thismaterial,aslongastheattributionisretainedandderivative
work isprovidedunder thesamelicense.
31
DataScience withR
Hands-On
Text Mining
1 Getting Started: The Corpus
The primary package for text mining, tm (Feinerer and d Hornik, 2015), provides a framework
withinwhichweperform our text mining. Acollectionof other standard Rpackagesaddvalue
tothedataprocessingandvisualizationsfor textmining.
Thebasic conceptisthat of acorpus. Thisisacollectionoftexts,usuallystoredelectronically,
andfrom which we perform our analysis. A corpusmight bea collectionof newsarticlesfrom
Reutersor the publishedworksofShakespeare. Withineachcorpuswe willhaveseparatedocu-
ments,whichmightbearticles,stories,orbookvolumes. Eachdocumentistreatedasaseparate
entityorrecord.
Documentswhichwe wishtoanalyse comeinmany dierent formats. Quite afew formatsare
supportedby tm(FeinererandHornik,2015),thepackagewewillillustratetext miningwithin
thismodule. Thesupportedformatsinclude text,PDF,Microsoft Word,andXML.
Anumberofopensourcetoolsarealsoavailabletoconvertmostdocumentformatstotextles.
Forourcorpususedinitiallyinthismodule,acollectionofPDFdocumentswereconvertedtotext
usingpdftotextfromthexpdfapplicationwhichisavailableforGNU/LinuxandMS/Windows
andothers. OnGNU/Linux wecanconvert afolder ofPDFdocumentstotext with:
system("for f in *.pdf; do pdftotext -enc ASCII7 -nopgbrk $f; done")
The -enc ASCII7 ensuresthe text isconvertedto ASCII since otherwise we may end upwith
binary charactersinour text documents.
We can also convert Word documents to text using anitword, which is another application
available for GNU/Linux.
system("for f in *.doc; do antiword $f; done")
Copyright
2013-2015Graham@togaware.com
Module: TextMiningO
Page: 1of46
Draft Only
Generated2016-01-1010:00:58+11:00
29
DataScience withR
Hands-On
Text Mining
1.1 Corpus Sources and Readers
Thereare avariety of sourcessupportedby tm. We canusegetSources()tolistthem.
getSources()
## [1] "DataframeSource" "DirSource"
"URISource"
"VectorSource"
## [5] "XMLSource"
"ZipSource"
Inadditiontodierentkindsofsourcesofdocuments,ourdocumentsfortextanalysiswillcome
inmany dierent formats. Avariety aresupportedby tm:
getReaders()
## [1] "readDOC"
"readPDF"
## [3] "readPlain"
"readRCV1"
## [5] "readRCV1asPlain"
"readReut21578XML"
## [7] "readReut21578XMLasPlain" "readTabular"
## [9] "readTagged"
"readXML"
Copyright
2013-2015Graham@togaware.com
Module: TextMiningO
Page: 2of46
Draft Only
Generated2016-01-1010:00:58+11:00
C# HTML5 Viewer: Deployment on ASP.NET MVC under Views according to config in picture above. RasterEdge.XDoc.PDF.HTML5Editor. dll. Open RasterEdge_MVC3 DemoProject, copy following content to your project:
copy picture from pdf reader; how to copy pictures from a pdf document
54
DataScience withR
Hands-On
Text Mining
1.2 Text Documents
We load a sample corpus of text documents. Our corpus consists of a collection of research
papers all stored in the folder we identify below. To work along with us in this module, you
can create your own folder called corpus/txt and place into that folder a collection of text
documents. It doesnot needto be as many aswe use here but a reasonable number makesit
moreinteresting.
cname <- file.path(".", "corpus", "txt")
cname
## [1] "./corpus/txt"
Wecanlistsome of the lenames.
length(dir(cname))
## [1] 46
dir(cname)
## [1] "acnn96.txt"
## [2] "adm02.txt"
## [3] "ai02.txt"
## [4] "ai03.txt"
## [5] "ai97.txt"
## [6] "atobmars.txt"
....
Thereare 46documentsinthisparticularcorpus.
Afterloadingthetm(FeinererandHornik,2015)packageintotheRlibrarywearereadytoload
thelesfrom thedirectoryasthe sourceofthelesmakingupthecorpus,usingDirSource().
The source object ispassedonto Corpus()whichloadsthedocuments. We save the resulting
collectionofdocumentsinmemory,storedinavariablecalleddocs.
library(tm)
docs <- Corpus(DirSource(cname))
docs
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 46
class(docs)
## [1] "VCorpus" "Corpus"
class(docs[[1]])
## [1] "PlainTextDocument" "TextDocument"
summary(docs)
##
Length Class
Mode
## acnn96.txt
2
PlainTextDocument list
## adm02.txt
2
PlainTextDocument list
Copyright
2013-2015Graham@togaware.com
Module: TextMiningO
Page: 3of46
Draft Only
Generated2016-01-1010:00:58+11:00
19
DataScience withR
Hands-On
Text Mining
## ai02.txt
2
PlainTextDocument list
## ai03.txt
2
PlainTextDocument list
## ai97.txt
2
PlainTextDocument list
....
Copyright
2013-2015Graham@togaware.com
Module: TextMiningO
Page: 4of46
Draft Only
Generated2016-01-1010:00:58+11:00
15
DataScience withR
Hands-On
Text Mining
1.3 PDF Documents
IfinsteadoftextdocumentswehaveacorpusofPDFdocumentsthenwecanusethereadPDF()
reader functiontoconvert PDFintotext andhavethat loadedasout Corpus.
docs <- Corpus(DirSource(cname), readerControl=list(reader=readPDF))
This will use, by default, the pdftotext command from xpdf to convert the PDF into text
format. The xpdfapplicationneedstobeinstalledfor readPDF()towork.
Copyright
2013-2015Graham@togaware.com
Module: TextMiningO
Page: 5of46
Draft Only
Generated2016-01-1010:00:58+11:00
22
DataScience withR
Hands-On
Text Mining
1.4 Word Documents
A simple open source tool to convert Microsoft Word documents into text is antiword. The
separate antiwordapplicationneedstobeinstalled,but onceit isavailable itisusedby tmto
convert Worddocumentsintotext forloadingintoR.
Toloadacorpusof Worddocumentsweuse the readDOC()reader function:
docs <- Corpus(DirSource(cname), readerControl=list(reader=readDOC))
Oncewe haveloadedour corpustheremainder oftheprocessingofthecorpuswithinRisthen
asfollows.
Theantiwordprogramtakessomeusefulcommandlinearguments. Wecanpassthese through
totheprogramfrom readDOC()byspecifyingthem asthecharacter stringargument:
docs <- Corpus(DirSource(cname), readerControl=list(reader=readDOC("-r -s")))
Here,-rrequeststhatremovedtext beincludedintheoutput,and-srequeststhattexthidden
byWordbe included.
Copyright
2013-2015Graham@togaware.com
Module: TextMiningO
Page: 6of46
Draft Only
Generated2016-01-1010:00:58+11:00
30
DataScience withR
Hands-On
Text Mining
2 Exploring the Corpus
Wecan(andshould) inspect the documentsusing inspect(). Thiswillassureusthatdatahas
beenloadedproperly andasweexpect.
inspect(docs[16])
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 1
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 44776
viewDocs <- function(d, n) fd %>% extract2(n) %>% as.character() %>% writeLines()g
viewDocs(docs, 16)
## Hybrid weighted random forests for
## classifying very high-dimensional data
## Baoxun Xu1 , Joshua Zhexue Huang2 , Graham Williams2 and
## Yunming Ye1
## 1
##
....
Copyright
2013-2015Graham@togaware.com
Module: TextMiningO
Page: 7of46
Draft Only
Generated2016-01-1010:00:58+11:00
28
DataScience withR
Hands-On
Text Mining
3 Preparing the Corpus
Wegenerallyneedtoperformsome pre-processingofthetext datatoprepareforthetextanal-
ysis. Exampletransformationsinclude convertingthe texttolowercase,removingnumbersand
punctuation, removing stop words, stemmingand identifyingsynonyms. The basic transforms
areallavailablewithintm.
getTransformations()
## [1] "removeNumbers"
"removePunctuation" "removeWords"
## [4] "stemDocument"
"stripWhitespace"
Thefunctiontm
map()isusedtoapplyoneofthesetransformationsacrossalldocumentswithin
a corpus. Other transformations can be implemented using R functions and wrapped within
content
transformer()tocreate afunctionthat canbepassedthroughtotm
map(). Wewill
see anexampleofthat inthenext section.
Inthe following sectionswe will apply each of the transformations, one-by-one, to remove un-
wantedcharactersfromthetext.
Copyright
2013-2015Graham@togaware.com
Module: TextMiningO
Page: 8of46
Draft Only
Generated2016-01-1010:00:58+11:00
34
DataScience withR
Hands-On
Text Mining
3.1 Simple Transforms
Westartwithsomemanualspecialtransformswemaywanttodo. Forexample,wemightwant
toreplace \/",usedsometimestoseparatealternative words, withaspace. Thiswillavoidthe
two wordsbeingruninto onestring of charactersthrough the transformations. We might also
replace\@"and\|"withaspace,forthesame reason.
Tocreateacustomtransformationwemakeuseof content
transformer()tocreateafunction
toachievethetransformation,andthenapply ittothecorpususing tm
map().
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace, "nn|")
Thiscanbe donewithasingle call:
docs <- tm_map(docs, toSpace, "/|@|nn|")
Check the emailaddressinthe following.
inspect(docs[16])
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 1
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 44776
Copyright
2013-2015Graham@togaware.com
Module: TextMiningO
Page: 9of46
Draft Only
Generated2016-01-1010:00:58+11:00
Documents you may be interested
Documents you may be interested