43
Text Mining Handbook
Casualty Actuarial Society E-Forum, Spring 2010
18
Create a hash of all words in the database.
Compute the TF-IDF statistic for each term on each record of the database.
Read the search string.
Compute the TF-IDF for the search string.
Compute the cosine correlation between the TF-IDF of the search string and each record in
the database.
Determine which record is the closest match and print it out.
2.7 Next Steps: Statistical Analysis to Derive Content
To derive content from text, techniques referred to as unsupervised learning are used. With
unsupervised learning techniques, the statistical analysis procedure has no dependent variable to fit a
model to. Instead the values of the variables in the data are used to group like records together. For
instance, in the GL claims data, all records with the words indicating a vehicle accident might be
grouped together. In the survey response data, all records invoking the word “credibility” might be
grouped together. All responses using “ERM” or “Enterprise Risk Management” might be grouped
together, but in a separate group from those with the word “credibility”. A procedure referred to as
clustering is used to perform the statistical analysis of the term-document matrix to group similar
records. Bilisoly (2008) illustrates using the data from the Perl preprocessing (such as the term-
document matrix) within an R program to perform the required statistical analysis. Bilisoly (2008),
Francis (2006) and Weiss et al. (2005) provide a more detailed description of how to apply clustering
and other unsupervised techniques to text mining applications. The next section of this paper will
introduce R and its use in text mining. Though Perl can be used to preprocess the data and perform
simple text analytics, we will introduce the R functions that read in and preprocess data as well as the
functions that perform the statistical analysis of the data that is required for content analysis.
3 The Software – R and the GL Database
3.1 –Introduction to R
One of the most popular open source analytical languages is R. This language is widely used to
perform statistical and data mining analyses. R can be downloaded and installed on one’s computer
from the CRAN Web Site: http://cran.r-project.org/.
Figure 3.1 displays the screen for the home
page of the Web site. By clicking on your operating system: Linux, Mac OS X, or Windows, under
“Download and Install R” on the top center, you can download and install R. While free
documentation for using R is available on the R Web Site, there are many books that the new user
might enjoy and benefit from. We recommend Introductory Statistics with R by Peter Daalgard, Modern
Applied Statistics with S by Venables and Ripley, Data Analysis and Graphics Using R by Maindonald and