Chapter 4
String manipulations with stringr
Aswesawinthepreviouschapters,Rprovidesausefulrangeoffunctionsforbasicstring
processingandmanipulationsof "character"data. Mostofthetimesthesefunctions s are
enoughandtheywillallowustogetourjobdone.However,theyhavesomedrawbacks.For
instance,considerthefollowingexample:
# some text vector
text = c("one""two""three", NA, "five")
# how many characters in each string?
nchar(text)
## [1] 3 3 5 2 4
Asyoucansee,nchar()givesNAavalueof2,asifitwereastringformedbytwocharacters.
Perhapsthismaybeacceptableinsomecases,buttakingintoaccountalltheoperationsin
R,itwouldbebettertoleaveNAasis,insteadoftreatingitasastringoftwocharacters.
Another awkwardexamplecan befoundwith paste(). Thedefault t separator is a blank
space,whichmoreoftenthannotiswhatwewanttouse. Butthat’ssecondary. Thereally
annoyingthingiswhenwewanttopastethingsthatincludezerolengtharguments. How
doespaste()behaveinthosecases? Seebelow:
# this works fine
paste("University""of""California""Berkeley")
## [1] "University of California Berkeley"
# this works fine too
paste("University""of""California""Berkeley")
43
Convert pdf file to powerpoint presentation - application software tool:C# Create PDF from PowerPoint Library to convert pptx, ppt to PDF in C#.net, ASP.NET MVC, WinForms, WPF
Online C# Tutorial for Creating PDF from Microsoft PowerPoint Presentation
www.rasteredge.com
Convert pdf file to powerpoint presentation - application software tool:VB.NET Create PDF from PowerPoint Library to convert pptx, ppt to PDF in vb.net, ASP.NET MVC, WinForms, WPF
VB.NET Tutorial for Export PDF file from Microsoft Office PowerPoint
www.rasteredge.com
44
## [1] "University of California Berkeley"
# this is weird
paste("University""of""California""Berkeley", NULL)
## [1] "University of California Berkeley "
# this is ugly
paste("University""of""California""Berkeley", NULL, character(0),
"Go Bears!")
## [1] "University of California Berkeley
Go Bears!"
Noticetheoutputfromthelastexample(theuglyone).TheobjectsNULLandcharacter(0)
havezerolength,yetwhenincludedinsidepaste()theyaretreatedasanemptystring"".
Wouldn’tbegoodifpaste()removedzerolengtharguments?Sadly,there’snothingwecan
dotochangenchar()andpaste(). Butfearnot. Thereisaverynicepackagethatsolves
theseproblemsandprovidesseveralfunctionsforcarryingoutconsistentstringprocessing.
4.1 Package stringr
ThankstoHadleyWickham,wehavethepackagestringrthataddsmorefunctionalityto
thebasefunctionsforhandlingstringsinR.Accordingtothedescriptionofthepackage(see
http://cran.r-project.org/web/packages/stringr/index.html)stringr
\isa setofsimplewrappersthat makeR’sstring functionsmoreconsistent,simpler
andeasiertouse. It t doesthis by ensuringthat: functionandargument t names (and
positions) areconsistent,allfunctions dealwithNA’s andzerolengthcharacter ap-
propriately,andtheoutputdatastructuresfromeachfunctionmatchestheinputdata
structuresofotherfunctions."
To installstringrusethefunctioninstall.packages(). Onceinstalled,loadittoyour
currentsessionwithlibrary():
# installing  stringr
install.packages("stringr")
# load  stringr
library(stringr)
CCBY-NC-SA3.0 GastonSanchez
HandlingandProcessingStringsinR
application software tool:VB.NET PowerPoint: Use PowerPoint SDK to Create, Load and Save PPT
NET method and sample code in this part will teach you how to create a fully customized blank PowerPoint file by using the smart PowerPoint presentation control
www.rasteredge.com
application software tool:C# PDF Text Extract Library: extract text content from PDF file in
But sometimes, we need to extract or fetch text content from source PDF document file for word processing, presentation and desktop publishing applications.
www.rasteredge.com
45
4.2 Basic String Operations
stringr provides functions for both1) basic manipulations and d 2)for regular expression
operations.Inthischapterwecoverthosefunctionsthathavetodowithbasicmanipulations.
Inturn,regularexpressionfunctionswithstringrarediscussedinchapter6.
Thefollowingtablecontainsthestringrfunctionsforbasicstringoperations:
Function
Description
Similarto
str
c()
stringconcatenation
paste()
str
length() numberofcharacters
nchar()
str
sub()
extractssubstrings
substring()
str
dup()
duplicatescharacters
none
str
trim()
removesleadingandtrailingwhitespace none
str
pad()
padsastring
none
str
wrap()
wrapsastringparagraph
strwrap()
str
trim()
trimsastring
none
As you can see,allfunctionsinstringr start with"str
" followedbya termassociated
tothetasktheyperform. Forexample,str
length()givesusthenumber(i.e. length)of
charactersinastring.Inaddition,somefunctionsaredesignedtoprovideabetteralternative
toalready existingfunctions. This s isthecaseof str
length() whichisintendedtobe a
substitute of nchar(). Other r functions, , however, don’t t have acorrespondingalternative
suchasstr
dup()whichallowsustoduplicatecharacters.
4.2.1 Concatenatingwithstr
c()
Let’sbeginwithstr
c(). Thisfunctionisequivalenttopaste()butinsteadofusingthe
whitespaceasthedefaultseparator,str
c()usestheemptystring"".
# default usage
str_c("May""The""Force""Be""With""You")
## [1] "MayTheForceBeWithYou"
# removing zero length objects
str_c("May""The""Force", NULL, "Be""With""You", character(0))
## [1] "MayTheForceBeWithYou"
Noticeanothermajordierencebetweenstr
c()andpaste(): zerolengthargumentslike
NULLandcharacter(0)aresilentlyremovedbystr
c().
CCBY-NC-SA3.0 GastonSanchez
HandlingandProcessingStringsinR
application software tool:C# Create PDF from OpenOffice to convert odt, odp files to PDF in
Note: When you get the error "Could not load file or assembly 'RasterEdge.Imaging. Basic' or any How to Use C#.NET Demo Code to Convert ODT to PDF in C#.NET
www.rasteredge.com
application software tool:VB.NET PowerPoint: Sort and Reorder PowerPoint Slides by Using VB.
you can choose to show your PPT presentation in inverted clip art or screenshot to PowerPoint document slide & profession imaging controls, PDF document, image
www.rasteredge.com
46
Ifwewanttochangethedefaultseparator,wecandothatasusualbyspecifyingtheargument
sep:
# changing separator
str_c("May""The""Force""Be""With""You", sep = "_")
## [1] "May_The_Force_Be_With_You"
# synonym function  str_join
str_join("May""The""Force""Be""With""You", sep = "-")
## [1] "May-The-Force-Be-With-You"
Asyoucanseefromthepreviousexamples,asynonymforstr
c()isstr
join().
4.2.2 Numberofcharacterswithstr
length()
As we’ve mentioned d before, the function str
length() is equivalent to o nchar(). Both
functions return n the number of characters in astring,that is, the length of f a string (do
notconfuseitwiththelength()ofavector). Comparedtonchar(),str
length()hasa
moreconsistentbehaviorwhendealingwithNAvalues. Insteadofgiving g NAalengthof2,
str
length()preservesmissingvaluesjustasNAs.
# some text (NA included)
some_text = c("one""two""three", NA, "five")
# compare  str_length  with  nchar
nchar(some_text)
## [1] 3 3 5 2 4
str_length(some_text)
## [1] 3 3 5 NA 4
Inaddition,str
length()hasthenicefeaturethatitconvertsfactorstocharacters,some-
thingthatnchar()isnotabletohandle:
# some factor
some_factor = factor(c(1, 1, 1, 2, 2, 2), labels = c("good""bad"))
some_factor
## [1] good good good bad bad bad
## Levels: good bad
# try  nchar  on a factor
CCBY-NC-SA3.0 GastonSanchez
HandlingandProcessingStringsinR
application software tool:VB.NET PowerPoint: Merge and Split PowerPoint Document(s) with PPT
documents and save the created new file in the sample code in VB.NET to finish PowerPoint document splitting If you want to see more PDF processing functions
www.rasteredge.com
application software tool:VB.NET Create PDF from OpenOffice to convert odt, odp files to PDF
1odt.pdf"). How to VB.NET: Convert ODP to PDF. This code sample is able to convert ODP file to PDF document. ' odp convert
www.rasteredge.com
47
nchar(some_factor)
## Error: ’nchar()’ requires a character vector
# now compare it with  str_length
str_length(some_factor)
## [1] 4 4 4 3 3 3
4.2.3 Substringwith str
sub()
Toextractsubstringsfromacharactervectorstringrprovidesstr
sub()whichisequivalent
tosubstring().Thefunctionstr
sub()hasthefollowingusageform:
str_sub(string, start = 1L, end = -1L)
The three arguments in the function are: a a string vector, a a start value indicating the
positionoftherstcharacterinsubstring,andanendvalueindicatingthepositionofthe
lastcharacter. Here’sasimpleexamplewithasinglestringinwhichcharactersfrom1to5
areextracted:
# some text
lorem = "Lorem Ipsum"
# apply  str_sub
str_sub(lorem, start t = = 1, , end d = = 5)
## [1] "Lorem"
# equivalent to  substring
substring(lorem, first t = 1, last = 5)
## [1] "Lorem"
# another example
str_sub("adios", 1:3)
## [1] "adios" "dios" "ios"
Aninterestingfeatureofstr
sub()isitsabilitytoworkwithnegativeindicesinthestart
andendpositions.Whenweuseanegativeposition,str
sub()countsbackwardsfromlast
character:
# some strings
resto = c("brasserie""bistrot""creperie""bouchon")
CCBY-NC-SA3.0 GastonSanchez
HandlingandProcessingStringsinR
application software tool:VB.NET PowerPoint: VB Codes to Create Linear and 2D Barcodes on
PowerPoint PDF 417 barcode library is a mature and Install and integrate our PowerPoint PLANET barcode creating to achieve PLANET barcode drawing on PPT file.
www.rasteredge.com
application software tool:How to C#: Overview of Using XDoc.Windows Viewer
Generally speaking, you can use this .NET document imaging SDK to load, markup, convert, print, scan image and document. Support File Types. PDF.
www.rasteredge.com
48
#  str_sub  with negative positions
str_sub(resto, start t = = -4, , end = = -1)
## [1] "erie" "trot" "erie" "chon"
# compared to substring (useless)
substring(resto, first t = -4, last = -1)
## [1] "" "" "" ""
Similartosubstring(),wecanalsogivestr
sub()asetofpositionswhichwillberecycled
overthestring.Butevenbetter,wecangivestr
sub()anegativesequence,somethingthat
substring()ignores:
# extracting sequentially
str_sub(lorem, seq_len(nchar(lorem)))
## [1] "Lorem Ipsum" "orem Ipsum" "rem Ipsum"
"em Ipsum"
"m Ipsum"
## [6] " Ipsum"
"Ipsum"
"psum"
"sum"
"um"
## [11] "m"
substring(lorem, seq_len(nchar(lorem)))
## [1] "Lorem Ipsum" "orem Ipsum" "rem Ipsum"
"em Ipsum"
"m Ipsum"
## [6] " Ipsum"
"Ipsum"
"psum"
"sum"
"um"
## [11] "m"
# reverse substrings with negative positions
str_sub(lorem, -seq_len(nchar(lorem)))
## [1] "m"
"um"
"sum"
"psum"
"Ipsum"
## [6] " Ipsum"
"m Ipsum"
"em Ipsum"
"rem Ipsum"
"orem Ipsum"
## [11] "Lorem Ipsum"
substring(lorem, -seq_len(nchar(lorem)))
## [1] "Lorem Ipsum" "Lorem Ipsum" "Lorem Ipsum" "Lorem Ipsum" "Lorem Ipsum"
## [6] "Lorem Ipsum" "Lorem Ipsum" "Lorem Ipsum" "Lorem Ipsum" "Lorem Ipsum"
## [11] "Lorem Ipsum"
Wecanusestr
sub()notonlyforextractingsubtringsbutalsoforreplacingsubstrings:
# replacing  Lorem  with  Nullam
lorem = "Lorem Ipsum"
str_sub(lorem, 1, 5) ) <- "Nullam"
lorem
## [1] "Nullam Ipsum"
CCBY-NC-SA3.0 GastonSanchez
HandlingandProcessingStringsinR
49
# replacing with negative positions
lorem = "Lorem Ipsum"
str_sub(lorem, -1) <- "Nullam"
lorem
## [1] "Lorem IpsuNullam"
# multiple replacements
lorem = "Lorem Ipsum"
str_sub(lorem, c(1, 7), c(5, 8)) <- c("Nullam""Enim")
lorem
## [1] "Nullam Ipsum" "Lorem Enimsum"
4.2.4 Duplicationwithstr
dup()
Acommonoperationwhenhandlingcharactersisduplication. TheproblemisthatRdoesn’t
have a specic function for that t purpose. But t stringr does: str
dup() duplicates s and
concatenatesstringswithinacharactervector.Itsusagerequirestwoarguments:
str_dup(string, times)
Therstinputisthestringthatwewanttorepeat.Thesecondinput,times,isthenumber
oftimestoduplicateeachstring:
# default usage
str_dup("hola", 3)
## [1] "holaholahola"
# use with differetn  times
str_dup("adios", 1:3)
## [1] "adios"
"adiosadios"
"adiosadiosadios"
# use with a string vector
words = c("lorem""ipsum""dolor""sit""amet")
str_dup(words, 2)
## [1] "loremlorem" "ipsumipsum" "dolordolor" "sitsit"
"ametamet"
str_dup(words, 1:5)
CCBY-NC-SA3.0 GastonSanchez
HandlingandProcessingStringsinR
50
## [1] "lorem"
"ipsumipsum"
"dolordolordolor"
## [4] "sitsitsitsit"
"ametametametametamet"
4.2.5 Paddingwith str
pad()
Anotherhandyfunctionthatwecanndinstringrisstr
pad()forpadding astring. . Its
defaultusagehasthefollowingform:
str_pad(string, width, side = "left", pad = " ")
The idea a of str
pad() is to take a string and d pad it with leading or trailing g characters
to a specied d total width. The default padding character r is a a space (pad = " "), and
consequentlythereturnedstringwillappeartobeeitherleft-aligned(side = "left"),right-
aligned(side = "right"),orboth(side = "both")
Let’sseesomeexamples:
# default usage
str_pad("hola", width = = 7)
## [1] "
hola"
# pad both sides
str_pad("adios", width h = 7, side = "both")
## [1] " adios "
# left padding with  #
str_pad("hashtag", width = = 8, , pad = "#")
## [1] "#hashtag"
# pad both sides with  -
str_pad("hashtag", width = = 9, , side = "both", pad = "-")
## [1] "-hashtag-"
4.2.6 Wrappingwith str
wrap()
Thefunctionstr
wrap()isequivalenttostrwrap()whichcanbeusedtowrapastringto
formatparagraphs. Theideaofwrappinga(long)stringistorstsplititintoparagraphs
CCBY-NC-SA3.0 GastonSanchez
HandlingandProcessingStringsinR
51
accordingtothegivenwidth,andthenaddthespeciedindentationineachline(rstline
withindent,followinglineswithexdent).Itsdefaultusagehasthefollowingform:
str_wrap(string, width = 80, indent = 0, exdent = 0)
Forinstance,considerthefollowingquote(fromDouglasAdams)convertedintoaparagraph:
# quote (by Douglas Adams)
some_quote = c(
"I may not have gone",
"where I intended to go,",
"but I think I have ended up",
"where I needed to be")
# some_quote in a single paragraph
some_quote = paste(some_quote, collapse = " ")
Now,saywewanttodisplaythetextofsome
quotewithinsomepre-speciedcolumnwidth
(e.g. widthof30). Wecanachievethisbyapplyingstr
wrap()andsettingtheargument
width = 30
# display paragraph with width=30
cat(str_wrap(some_quote, width = = 30))
## I may not have gone where I
## intended to go, but I think I
## have ended up where I needed
## to be
Besides displaying a (long) ) paragraph into several l lines, we e may also wish to add d some
indentation.Here’showwecanindenttherstline,aswellasthefollowinglines:
# display paragraph with first line indentation of 2
cat(str_wrap(some_quote, width = = 30, indent t = 2), "nn")
##
I may not have gone where I
## intended to go, but I think I
## have ended up where I needed
## to be
# display paragraph with following lines indentation of 3
cat(str_wrap(some_quote, width = = 30, exdent t = 3), "nn")
## I may not have gone where I
##
intended to go, but I
CCBY-NC-SA3.0 GastonSanchez
HandlingandProcessingStringsinR
52
##
think I have ended up
##
where I needed to be
4.2.7 Trimmingwithstr
trim()
Oneofthetypicaltasksofstringprocessingisthatofparsingatextintoindividualwords.
Usually,weendupwithwordsthathaveblankspaces,calledwhitespaces,oneitherendof
theword. Inthis s situation,wecanusethe str
trim()functiontoremoveanynumberof
whitespacesattheendsofastring.Itsusagerequiresonlytwoarguments:
str_trim(string, side = "both")
Therstinputis thestring tobestrimmed,andthesecondinputindicatesthesideon
whichthewhitespacewillberemoved.
Considerthefollowingvectorofstrings,someofwhichhavewhitespaceseitherontheleft,
ontheright,oronbothsides. Here’s s what str
trim() woulddotothemunderdierent
settingsofside
# text with whitespaces
bad_text = c("This"" example e ""has several
", "
whitespaces ")
# remove whitespaces on the left side
str_trim(bad_text, side = "left")
## [1] "This"
"example "
"has several
" "whitespaces "
# remove whitespaces on the right side
str_trim(bad_text, side = "right")
## [1] "This"
" example"
"has several"
"
whitespaces"
# remove whitespaces on both sides
str_trim(bad_text, side = "both")
## [1] "This"
"example"
"has several" "whitespaces"
4.2.8 Wordextractionwithword()
Weendthischapterdescribingtheword()functionthatisdesignedtoextractwordsfrom
asentence:
word(string, start = 1L, end = start, sep = fixed(" "))
CCBY-NC-SA3.0 GastonSanchez
HandlingandProcessingStringsinR
Documents you may be interested
Documents you may be interested