open pdf file in asp.net using c# : Add text to pdf file SDK control service wpf web page azure dnn 2012-Keyphrases-TOCHI0-part634

19
“Without the Clutter of Unimportant Words”:
Descriptive Keyphrases for Text Visualization
JASON CHUANG, CHRISTOPHER D.MANNING,and JEFFREYHEER
,StanfordUniversity
Keyphrasesaid the exploration of text collectionsby communicating salient aspectsof documentsand are
often usedtocreateeffective visualizationsof text.WhilepriorworkinHCIandvisualization hasproposed
avariety of ways of presentingkeyphrases, less attention has been paid to selecting the best descriptive
terms.Inthisarticle,weinvestigatethestatisticalandlinguisticpropertiesofkeyphraseschosenbyhuman
judgesanddeterminewhichfeaturesaremostpredictiveofhigh-qualitydescriptivephrases.Basedon5,611
responsesfrom69graduatestudentsdescribingacorpusofdissertationabstracts,weanalyzecharacteristics
of human-generated keyphrases, includingphrase length,commonness, position,and partof speech.Next,
we systematically assess the contribution of each feature within statistical models of keyphrase quality.
Wethen introduce amethodforgroupingsimilarterms and varyingthe specificityof displayedphrases so
that applications can select phrasesdynamicallybased on the available screen space and current context
of interaction. Precision-recall measures find that our technique generates keyphrases that match those
selectedbyhuman judges.Crowdsourcedratingsoftagcloudvisualizationsrank ourapproach aboveother
automatictechniques.Finally,wediscusstheroleofHCImethodsindevelopingnewalgorithmictechniques
suitableforuser-facingapplications.
CategoriesandSubjectDescriptors:H.1.2[Models and Principles]:User/Machine Systems
GeneralTerms:HumanFactors
Additional KeyWordsandPhrases:Keyphrases,visualization,interaction,textsummarization
ACM ReferenceFormat:
Chuang, J., Manning, C. D., and Heer, J. 2012. “Without the clutter of unimportant words”: Descriptive
keyphrasesfortextvisualization.ACMTrans.Comput.-Hum.Interact.19,3,Article 19 (October2012),29
pages.
DOI =10.1145/2362364.2362367http://doi.acm.org/10.1145/2362364.2362367
1. INTRODUCTION
Document collections, from academic publications to blog posts, provide rich sources
of information.People explore these collections to understand their contents, uncover
patterns,or find documents matching an information need. Keywords (or keyphrases)
aid exploration by providing summary information intended to communicate salient
aspectsof one ormore documents.Keyphrase selectionis critical to effective visualiza-
tion and interaction, including automatically labeling documents, clusters, or themes
[Havre et al. 2000; Hearst 2009]; choosing salient terms for tag clouds or other text
visualization techniques[Collins etal.2009;Vi´egas et al.2006,2009];orsummarizing
text to support small display devices [Yang and Wang 2003; Buyukkokten et al. 2000,
ThisworkispartoftheMimirProjectconductedatStanfordUniversitybyDanielMcFarland,DanJurafsky,
ChristopherManning,andWalterPowell.ThisprojectissupportedbytheOfficeofthePresidentatStanford
University,the NationalScienceFoundationunderGrantNo.0835614,andtheBoeingCompany.
Authors’ addresses: J. Chuang, C. D. Manning, and J. Heer, 353 Serra Mall, Stanford, CA 94305;
emails:{jcchuang,manning,jheer}@cs.stanford.edu.
Permissionto make digitalorhardcopiesof partorallof thisworkforpersonalorclassroom useisgranted
without fee provided that copiesare not made ordistributed for profit orcommercial advantage and that
copiesshowthisnoticeonthefirstpageorinitialscreenofadisplayalongwiththefullcitation.Copyrightsfor
componentsof thisworkownedbyothersthan ACMmustbehonored.Abstractingwithcreditispermitted.
To copy otherwise, to republish, to post on servers, to redistribute to lists, orto use anycomponent of this
work in other works requires prior specific permission and/or a fee. Permissions may be requested from
Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)
869-0481,orpermissions@acm.org.
c 2012ACM1073-0516/2012/10-ART19$15.00
DOI 10.1145/2362364.2362367 http://doi.acm.org/10.1145/2362364.2362367
ACMTransactionsonComputer-HumanInteraction,Vol.19,No.3,Article19,Publicationdate: October2012.
Add text to pdf file - insert text into PDF content in C#.net, ASP.NET, MVC, Ajax, WinForms, WPF
XDoc.PDF for .NET, providing C# demo code for inserting text to PDF file
adding a text field to a pdf; add text to pdf in preview
Add text to pdf file - VB.NET PDF insert text library: insert text into PDF content in vb.net, ASP.NET, MVC, Ajax, WinForms, WPF
Providing Demo Code for Adding and Inserting Text to PDF File Page in VB.NET Program
add text to pdf file; how to insert text into a pdf with acrobat
19:2
J. Chuang etal.
2002].Whiletermshand-selected bypeopleare considered thegold standard,manually
assigning keyphrasesto thousands of documents simplydoes not scale.
To aid document understanding,keyphrase extraction algorithms select descriptive
phrasesfrom text.A common method isbag-of-wordsfrequency statistics[Laveret al.
2003; Monroe et al.2008;Rayson and Garside 2000;Robertson etal.1981;Salton and
Buckley 1988].However, suchmeasures may not be suitable forshort texts [Boguraev
and Kennedy 1999] and typically return single words, rather than more meaning-
ful longer phrases [Turney 2000]. While others have proposed methods for extracting
longerphrases [Barker and Cornacchia 2000;Dunning 1993;Evans etal.2000;Hulth
2003; Kimet al.2010; Medelyan and Witten 2006], researchers have yet to systemat-
ically evaluate the contribution of individual features predictive of keyphrase quality
and oftenrelyonassumptions—suchasthepresenceofareference corpusorknowledge
of document structure—that are notuniversally applicable.
In this article, we characterize the statistical and linguistic properties of human-
generated keyphrases. Our analysis is based on 5,611 responses from 69 students de-
scribingPh.D.dissertationabstracts.Weuse ourresultsto developa two-stage method
for automatic keyphrase extraction. We first apply a regression model to score candi-
date keyphrases independently; we then group similar terms to reduce redundancy
and control the specificity of selected phrases. Through this research, we investigate
the following concerns.
Reference Corpora. HCI researchers work with text from various sources,including
data whose domain is unspecified or in which a domain-specific reference corpus is
unavailable. We examine several frequency statistics and assess the trade-offs of se-
lecting keyphrases with and without a reference corpus. While models trained on a
specific domain can generate higher-quality phrases, models incorporating language-
levelstatisticsinlieuof adomain-specificreference corpusproducecompetitiveresults.
Document Diversity. Interactive systems may need to show keyphrases for a col-
lection of documents. We compare descriptions of single documents and of multiple
documents with varying levels of topical diversity. We find that increasing the size or
diversity of a collection reducesthe length and specificity of selected phrases.
Feature Complexity. Many existing tools select keyphrases solely using raw term
countsortf.idf scores[SaltonandBuckley1988],while recentwork[Collinsetal.2009;
Monroeetal.2008]advocatesmoreadvanced measures,suchasG
2
statistics[Dunning
1993; Rayson and Garside 2000]. We find that raw counts or tf.idf alone provide poor
summaries but that a simple combination of raw counts and a term’s language-level
commonness matches the improved accuracy of more sophisticated statistics. We also
examinetheimpactoffeaturessuchasgrammarandpositioninformation;forexample,
we find thatpart-of-speechtagging providessignificantbenefitsoverwhichmore costly
statistical parsing provides little improvement.
Term Similarity and Specificity. Multiword phrases identified by an extraction al-
gorithm may contain overlapping terms or reference the same entity (person, place,
etc). We present a method for grouping related terms and reducing redundancy. The
resulting organization enables users to vary the specificity of displayed terms and al-
lows applications to dynamically select terms in response to available screen space.
Forexample,a keyphrase labelmightgrow longer and more specificthrough semantic
zooming.
We assessourresultingextraction approach by comparing automatically and manu-
allyselectedphrasesandviacrowdsourcedratings.Wefindthattheprecisionandrecall
of candidate keyphrases chosenbyour model can match that of phraseshand-selected
ACMTransactionsonComputer-HumanInteraction,Vol. 19,No.3,Article19,Publicationdate:October2012.
VB.NET PDF Password Library: add, remove, edit PDF file password
This VB.NET example shows how to add PDF file password with access permission setting. passwordSetting.IsAssemble = True ' Add password to PDF file.
how to insert text box on pdf; how to add text box in pdf file
C# PDF Password Library: add, remove, edit PDF file password in C#
This example shows how to add PDF file password with access permission setting. passwordSetting.IsAssemble = true; // Add password to PDF file.
adding text pdf file; add text pdf file
Descriptive Keyphrasesfor Text Visualization
19:3
byhumanreaders.Wealsoapplyourapproachtotagcloudsasanexampleofreal-world
presentation of keyphrases. We asked human judges to rate the quality of tag clouds
using phrases selected by our technique and unigrams selected using G
2
.We find that
raters prefer the tag clouds generated by our method and identify other factors such
as layout and prominenterrors thataffect judgments ofkeyphrase quality. Finally,we
conclude the articlebydiscussingtheimplicationsofourresearchforhuman-computer
interaction, information visualization, and natural language processing.
2. RELATED WORK
Our research is informed by prior work in two surprisingly disjoint domains: (1) text
visualization and interaction and (2) automatic keyphrase extraction.
2.1. Text Visualization and Interaction
Manytextvisualizationsystemsuse descriptivekeyphrasesto summarizetext orlabel
abstract representations of documents [Cao et al. 2010; Collins et al. 2009; Cui et al.
2010; Havre et al. 2000; Hearst 2009; Shi et al. 2010; Vi´egas et al. 2006, 2009]. One
popular way of representing a document is as a tag cloud, that is,a list of descriptive
wordstypicallysizedbyrawtermfrequency.Variousinteractiontechniquessummarize
documents as descriptive headers for efficientbrowsing onmobile devices [Buyukkok-
ten et al. 2000, 2002; Yang and Wang 2003]. While HCI researchers have developed
methods to improve the layout of terms [Cui et al.2010;Vi´egaset al. 2009],they have
paid less attention to methods for selectingthe best descriptive terms.
Visualizations including Themail [Vi´egas et al. 2006] and TIARA [Shi et al. 2010]
display terms selected using variants of tf.idf (term frequency by inverse document
frequency [Salton and Buckley 1988])—a weighting scheme for information retrieval.
Rarelyaremoresophisticated methodsfromcomputationallinguisticsused.One excep-
tionis ParallelTagClouds[Collinsetal.2009],which weighttermsusingG
2
[Dunning
1993],a probabilisticmeasure of the significance of adocument termwith respect to a
reference corpus.
Othersystems,includingJigsaw[Stasko etal.2008]and FacetAtlas[Cao etal.2010],
identify salient terms by extracting named entities, such as people, places, and dates
[Finkel et al. 2005]. These systems extract specific types of structured data but may
miss other descriptive phrases. In this article, we first score phrases independent of
their status as entities but later apply entity recognition to group similar terms and
reduce redundancy.
2.2. Automatic Keyphrase Extraction
As previously indicated, the most common means of selecting descriptive terms is via
bag-of-words frequency statistics of single words (unigrams). Researchers in natural
language processing have developed various techniques to improve upon raw term
counts, including removal of frequent “stop words,” weighting by inverse document
frequency as in tf.idf [Salton and Buckley 1988] and BM25 [Robertson et al. 1981],
heuristics such as WordScore [Laver et al. 2003], or probabilistic measures [Kit and
Liu2008;Rayson and Garside2000] and the variance-weighted log-odds ratio [Monroe
et al. 2008].While unigramstatistics are popular in practice, there are two causes for
concern.
First, statistics designed for document retrieval weight terms in a manner that
improves search effectiveness,and it is unclear whether the same terms provide good
summaries for document understanding [Boguraev and Kennedy 1999; Collins et al.
2009].Fordecades,researchers have anecdotally noted that the best descriptive terms
are often neither the most frequent nor infrequent terms, but rather mid-frequency
terms [Luhn 1958]. In addition, frequency statistics often require a large reference
ACMTransactionsonComputer-HumanInteraction,Vol.19, No.3,Article19,Publicationdate:October2012.
C# PDF File & Page Process Library SDK for C#.net, ASP.NET, MVC
Read: PDF Text Extract; C# Read: PDF Image Extract; C# Write: Insert text into PDF; C# Write: Add Image to PDF; C# Protect: Add Password
adding text to a pdf file; how to insert text in pdf using preview
VB.NET PDF Text Extract Library: extract text content from PDF
this advanced PDF Add-On, developers are able to extract target text content from source PDF document and save extracted text to other file formats through VB
adding text box to pdf; adding text to pdf in acrobat
19:4
J. Chuang etal.
corpusand maynotworkwellforshorttexts[BoguraevandKennedy1999].Asaresult,
itisunclearwhichexistingfrequencystatisticsarebestsuited forkeyphraseextraction.
Second,the set of good descriptive terms usuallyincludesmultiwordphrases aswell
as single words. In a survey of journals, Turney [2000] found that unigrams account
for only a small fraction of human-assigned index terms. To allow for longer phrases,
Dunning proposed modeling words as binomial distributions using G2 statistics to
identify domain-specific bigrams(two-word phrases) [Dunning 1993].Systems suchas
KEA++ orMaui use pseudo-phrases (phrases that remove stop words and ignore word
ordering) forextractinglongerphrases[Medelyan and Witten2006].Hulthconsidered
all trigrams (phrases up to length of three words) in her algorithm [2003]. While the
inclusionoflongerphrasesmayallowformoreexpressivekeyphrases,systemsthatper-
mitlongerphrasescansufferfrompoorprecisionandmeaninglessterms.Theinclusion
of longerphrases mayalso result in redundant terms of varied specificity[Evanset al.
2000],suchas“visualization,”“datavisualization,”and“interactivedatavisualization.”
Researchers have taken several approaches to ensure that longer keyphrases are
meaningful and that phrases of the appropriate specificity are chosen. Many ap-
proaches [Barker and Cornacchia 2000; Daille et al. 1994; Evans et al. 2000; Hulth
2003] filter candidate keyphrases by identifying noun phrases using a part-of-speech
tagger or a parser. Of note is the use of so-called technical terms [Justeson and Katz
1995]thatmatchregularexpressionpatternsoverpart-of-speechtags.Toreduceredun-
dancy,Barkerand Cornacchia [2000]choose themostspecifickeyphrasebyeliminating
any phrases that are a subphrase of another. Medelyan and Witten’s KEA++ system
[2006] trains a na¨ıve Bayes classifier to match keyphrases produced by professional
indexers.However, all existing methods produce a static list of keyphrases and do not
account fortask- orapplication-specificrequirements.
Recently, the Semantic Evaluation (SemEval) workshop [Kim et al. 2010] held a
contestcomparingtheperformance of21keyphrase extractionalgorithmsoveracorpus
of ACM Digital Library articles.The winningentry,named HUMB [Lopez andRomary
2010],rankstermsusing bagged decision treeslearned fromacombinationoffeatures,
including frequency statistics, position in a document, and the presence of terms in
ontologies (e.g., MeSH, WordNet) or in anchor text in Wikipedia. Moreover, HUMB
explicitly models the structure of the document to preferentially weight the abstract,
introduction,conclusion,andsectiontitles.The systemisdesigned forscientificarticles
and intended to provide keyphrases forindexing digital libraries.
The aimsof ourcurrentresearch are different.Unlikepriorwork,we seektosystem-
aticallyevaluate the contributionsofindividualfeaturesto keyphrasequality,allowing
systemdesigners to make informed decisionsaboutthe trade-offsofaddingpotentially
costlyordomain-limitingfeatures.Wehave aparticularinterestindevelopingmethods
thatare easy to implement,computationally efficient,andmakeminimalassumptions
about input documents.
Second,ourprimarygoalistoimprovethedesignoftextvisualizationandinteraction
techniques,not the indexing of digital libraries.This orientation hasled us to develop
techniques for improving the quality of extracted keyphrases as a whole, rather than
just scoring terms in isolation (cf., [Barker and Cornacchia 2000; Turney 2000]). We
propose methods for grouping related phrases that reduce redundancy and enable
applications to dynamically tailor the specificity of keyphrases. We also evaluate our
approach in the context of text visualization.
3. CHARACTERIZING HUMAN-GENERATED KEYPHRASES
To better understand how people choose descriptive keyphrases, we compiled a corpus
of phrasesmanuallychosenbyexpert and non-expertreaders.We analyzed this corpus
toassesshowvariousstatisticalandlinguisticfeaturescontribute tokeyphrasequality.
ACMTransactionsonComputer-HumanInteraction,Vol. 19,No.3,Article19,Publicationdate:October2012.
C# PDF Text Extract Library: extract text content from PDF file in
How to C#: Extract Text Content from PDF File. Add necessary references: RasterEdge.Imaging.Basic.dll. RasterEdge.Imaging.Basic.Codec.dll.
add text to pdf using preview; how to add text to pdf
C# PDF insert image Library: insert images into PDF in C#.net, ASP
using RasterEdge.Imaging.Basic; using RasterEdge.XDoc.PDF; Have a try with this sample C#.NET code to add an image to the first page of PDF file.
how to add text to a pdf file in acrobat; how to add text to a pdf file in reader
Descriptive Keyphrasesfor Text Visualization
19:5
3.1. User Study Design
We asked graduate students to provide descriptive phrases for a collection of Ph.D.
dissertation abstracts.We selected 144 documents froma corpus of 9,068 Ph.D.disser-
tationspublished at StanfordUniversityfrom1993to2008.These abstractsconstitute
ameaningful and diverse corpus well suited to the interests of our study participants.
To ensure coverage over a variety of disciplines, we selected abstracts each from the
followingsix departments:Computer Science,MechanicalEngineering,Chemistry,Bi-
ology, Education, and History. We recruited graduate students from two universities
via student email lists. Students came from departments matching the topic areas of
selected abstracts.
3
.
1
.
1
.
S
t
u
d
y
P
r
o
t
o
c
o
l
.
We selected 24dissertations (aseight groupsofthreedocuments)
fromeach of the six departments in the followingmanner.We randomly selected eight
faculty members from among all faculty who have graduated at least ten Ph.D. stu-
dents. For four of the faculty members, we selected the three most topically diverse
dissertations.Forthe other fourmembers,we selected the three mosttopicallysimilar
dissertations.
Subjectsparticipated inthe study overthe Internet.Theywere presented with a se-
riesof webpagesandasked to readandsummarize text.Subjectsreceived threegroups
of documents in sequence (nine in total); they were required to complete one group of
documents before moving on to the next group.For each group of documents, subjects
first summarized three individual documents in a sequence of three webpages and
then summarized the three as a whole on a fourth page. Participants were instructed
to summarize the content using five or more keyphrases, using any vocabulary they
deemed appropriate.Subject were not constrained to only words from the documents.
They would then repeat this process for two more groups. The document groups were
randomlyselected such thatthey varied betweenfamiliar and unfamiliar topics.
We received 69 completed studies, comprising a total of 5,611 free-form responses:
4,399 keyphrases describing single documents and 1,212 keyphrases describing mul-
tiple documents.Note that while we use the terminology keyphrase in this article for
brevity, the longer description “keywords and keyphrases” was used throughout the
study to avoid biasing responses. The online study was titled and publicized as an
investigation of “keyword usage.”
3
.
1
.
2
.
I
n
d
e
p
e
n
d
e
n
t
F
a
c
t
o
r
s
.
We varied thefollwingthree independentfactorsintheuser
study.
Familiarity. We considered a subject familiar with a topic if they had conducted
research in the same discipline as the presented text. We relied on self-reports to
determine subjects’familiarity.
Document count. Participantswereasked tosummarize thecontentofeithera single
document or three documents as a group. In the case of multiple documents,we used
three dissertations supervised by the same primary advisor.
Topicdiversity. We measured the similaritybetweentwo documentsusingthecosine
ofthe angle between tf.idf termvectors.Our experimentalsetup provided sets ofthree
documents with either low or high topical similarity.
3
.
1
.
3
.
D
e
p
e
n
d
e
n
t
S
t
a
t
i
s
t
i
c
a
l
a
n
d
L
i
n
g
u
i
s
t
i
c
F
e
a
t
u
r
e
s
.
To analyze responses, we computed
the following features for the documents and subject-authored keyphrases. We use
“term” and “phrase” interchangeably. Term length refers to the number of words in a
phrase;ann-gram is a phrase consisting of nwords.
ACMTransactionsonComputer-HumanInteraction,Vol.19, No.3,Article19,Publicationdate:October2012.
VB.NET PDF File Compress Library: Compress reduce PDF size in vb.
Also able to uncompress PDF file in VB.NET programs. Offer flexible and royalty-free developing library license for VB.NET programmers to compress PDF file.
how to insert text into a pdf file; how to add text to a pdf file in preview
VB.NET PDF insert image library: insert images into PDF in vb.net
try with this sample VB.NET code to add an image As String = Program.RootPath + "\\" 1.pdf" Dim doc New PDFDocument(inputFilePath) ' Get a text manager from
adding text to pdf in preview; add text box in pdf document
19:6
J. Chuang etal.
Documents are the texts we showed to subjects, while responses are the provided
summarykeyphrases.Wetokenize textbased on the PennTreebankstandard[Marcus
et al. 1993] and extract all terms of up to length five. We record the position of each
phrase in the document as well as whether or not a phrase occurs in the first sen-
tence. Stems are the roots of words with inflectional suffixes removed. We apply light
stemming[Minnenet al. 2001] which removes only noun and verb inflections (suchas
plural s) according to a word’s part of speech.Stemmingallows usto group variants of
atermwhen counting frequencies.
Term frequency (tf) is the number of times a phrase occurs in the document (docu-
ment term frequency), in the full dissertation corpus (corpus term frequency), or in all
Englishwebpages(Webtermfrequency),asindicated bythe Google Web n-gramcorpus
[Brants and Franz 2006]. We define term commonness as the normalized term fre-
quencyrelative to the mostfrequentn-gram,eitherin the dissertationcorpusoronthe
Web. For example, the commonness of a unigram equals log(tf)/log(tf
the
),where tf
the
is the frequency of “the”—the most frequent unigram. When distinctions are needed,
we refer to the former as corpus commonnessand the latter as Web commonness.
Term position is a normalized measure of a term’s location in a document; 0 corre-
sponds to the first word and 1 to the last.The absolutefirst occurrence is the minimum
position of a term (cf., [Medelyan and Witten 2006]). However, frequent terms are
more likely to appear earlier due to higher rates of occurrence. We introduce a new
feature—therelativefirstoccurrence—to factoroutthecorrelationbetweenpositionand
frequency.Relative firstoccurrence (formallydefinedinSection4.3.1) isthe probability
that a term’s first occurrence is lower than that of a randomly sampled termwith the
same frequency. This measure makes a simplistic assumption—that term positions
are uniformly distributed—but allows us to assess term position as an independent
feature.
We annotate terms that are nounphrases,verb phrases,ormatchtechnical termpat-
terns[Justesonand Katz 1995] (see Table I).Part-of-speech information isdetermined
using the Stanford POS Tagger [Toutanova et al. 2003]. We additionally determine
grammatical information using the Stanford Parser [Klein and Manning 2003] and
annotate the corresponding wordsineach sentence.
3.2. Exploratory Analysis ofHuman-GeneratedPhrases
Usingthesefeatures,wecharacterizedthecollectedhuman-generatedkeyphrasesinan
exploratoryanalysis.Ourresultsconfirmobservationsfrompriorwork—theprevalence
of multiword phrases[Turney2000],preference formid-frequency terms [Luhn1958],
and pronounced use of noun phrases [Barker and Cornacchia 2000; Daille et al. 1994;
Evans et al. 2000;Hulth 2003]—and provide additional insights,including the effects
of document count and diversity.
Forsingle documents, the number of responses variesbetween 5 and 16 keyphrases
(see Figure 1). We required subjects to enter a minimum of five responses; the peak
at five in Figure 1 suggests that subjects might respond with fewer without this re-
quirement.However,it is unclearwhether this reflects a lackof appropriate choicesor
adesire to minimize effort. For tasks with multiple documents, participants assigned
fewer keyphrases despite the increase in the amount of textand topics.Subject famil-
iaritywith the readingsdid not have a discernible effecton the numberof keyphrases.
Assessing the prevalence of words versus phrases, Figure 2 shows that bigrams are
the most common response, accounting for 43% of all free-form keyphrase responses,
followed byunigrams(25%)andtrigrams(19%).Formultiple documentsordocuments
withdiverse topics,we observeanincrease in the use ofunigramsand acorresponding
ACMTransactionsonComputer-HumanInteraction,Vol. 19,No.3,Article19,Publicationdate:October2012.
C# PDF File Split Library: Split, seperate PDF into multiple files
page of your defined page number which starts from 0. For example, your original PDF file contains 4 pages. C# DLLs: Split PDF Document. Add necessary references
adding text to pdf document; adding text to pdf file
VB.NET PDF File Merge Library: Merge, append PDF files in vb.net
by directly tagging the second PDF file to the target one, this PDF file merge function VB.NET Project: DLLs for Merging PDF Documents. Add necessary references
add text pdf acrobat professional; how to enter text into a pdf
Descriptive Keyphrasesfor Text Visualization
19:7
0%
25%
50%
75%
5
6
7
8
9
10
11
12
13
14
15
16
S
i
n
g
l
e
D
o
c
0%
25%
50%
75%
5
6
7
8
9
10
11
12
13
14
15
16
M
u
l
t
i
p
l
e
D
o
c
s
0%
25%
50%
75%
5
6
7
8
9
10
11
12
13
14
15
16
D
i
v
e
r
s
e
D
o
c
s
Number of Keyphrases
Fig.1. How manykeyphrasesdopeopleuse?Participantsuse fewerkeyphrasestodescribemultiple docu-
mentsordocumentswithdiverse topics,despitetheincrease intheamountof textand topics.
0%
10%
20%
30%
40%
50%
1
2
3
4
5
6
7
8
9
10
S
i
n
g
l
e
D
o
c
0%
10%
20%
30%
40%
50%
1
2
3
4
5
6
7
8
9
10
M
u
l
t
i
p
l
e
D
o
c
s
0%
10%
20%
30%
40%
50%
1
2
3
4
5
6
7
8
9
10
D
i
v
e
r
s
e
D
o
c
s
Phrase Length
Fig. 2. Do people use words or phrases? Bigrams are the most common. For single documents, 75% of
responsescontainmultiple words.Unigramuse increaseswiththenumberanddiversityofdocuments.
decrease in the use of trigrams and longer terms. The prevalence of bigrams confirm
prior work [Turney 2000]. By permitting users to enter any response, our results
provide additionaldata on the tail end of the distribution:there is minimalgain when
assessing the quality of phrases longer than five words, which account for <5% of
responses.
Figure 3 shows the distribution of responses as a function of Web commonness. We
observe a bell-shaped distributioncentered around mid-frequency,consistent with the
distribution of significant words posited by Luhn [1958]. As the number of documents
and topic diversity increases, the distribution shifts toward more common terms. We
found similar correlations forcorpus commonness.
ACMTransactionsonComputer-HumanInteraction,Vol.19, No.3,Article19,Publicationdate:October2012.
19:8
J. Chuang etal.
0%
10%
20%
30%
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
S
i
n
g
l
e
D
o
c
0%
10%
20%
30%
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
M
u
l
t
i
p
l
e
D
o
c
s
0%
10%
20%
30%
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
D
i
v
e
r
s
e
D
o
c
s
Term Web Commonness
Fig.3. Dopeopleusegenericorspecificterms?Termcommonnessincreaseswith thenumberanddiversity
of documents.
TableI.TechnicalTerms
TechnicalTerm
T = (A|N)
+
(N|C) | N
CompoundTechnicalTerm
X = (A|N)
Nof (T|C) | T
Note: Technical terms are defined by part-of-speech regular ex-
pressions. N isanoun, Aan adjective,and C acardinalnumber.
We modify the definition of technical terms [Justeson and Katz
1995] by permitting cardinal numbersas the trailing word.Ex-
amples of technical termsincludethe following: hardware,inter-
active visualization, performing arts, Windows 95. Examples of
compoundtechnicaltermsincludethefollowing:gulfofexecution,
Warof1812.
TableII.PositionalandGrammatical Statistics
Feature
%ofKeyphrases
%of AllPhrases
Firstsentence
22.09%
8.68%
Relativefirstoccurrence
56.28%
50.02%
Nounphrase
64.95%
13.19%
Verbphrase
7.02%
3.08%
Technicalterm
82.33%
8.16%
Compoundtech term
85.18%
9.04%
Note:Positionandgrammarfeaturesofkeyphrasespresentinadoc-
ument (65% of total).Keyphrases occurearlierin adocument: two-
thirdsarenoun phrases,overfour-fifthsaretechnical terms.
For each user-generated keyphrase, we find matching text in the reading and note
that65%of the responses are present inthe document.Considering forthe rest of this
paragraph just the two-thirds of keyphrases present in the readings, the associated
positional and grammatical properties of this subset are summarized inTable II.22%
of keyphrases occur in the first sentence, even though first sentences contain only
9% of all terms. Comparing the first occurrence of keyphrases with that of randomly
sampled phrases of the same frequency, we find that keyphrases occur earlier 56%
of the time—a statistically significant result (χ
2
(1) = 88, p < 0.001). Nearly two-
thirds of keyphrases found in the document are partof a noun phrase (i.e., continuous
ACMTransactionsonComputer-HumanInteraction,Vol. 19,No.3,Article19,Publicationdate:October2012.
Descriptive Keyphrasesfor Text Visualization
19:9
subsequence fullycontained in the phrase). Only 7%are partof a verb phrase, though
this is still statistically significant (χ
2
(1) = 147,000, p < 0.001). Most strikingly, over
80%of the keyphrases are part of a technical term.
In summary, our exploratory analysis shows that subjects primarily choose multi-
wordphrases,prefertermswithmediumcommonness,and largelyuse phrasesalready
present in a document. Moreover, these features shift as the number and diversity of
documents increases.Keyphrase selectionalso correlates with term position,suggest-
ing we should treat documents as more than just “bags of words.” Finally, human-
selected keyphrases show recurring grammatical patterns, indicating the utility of
linguistic features.
4. STATISTICAL MODELINGOFKEYPHRASEQUALITY
Informed by our exploratory analysis, we systematically assessed the contribution
of statistical and linguistic features to keyphrase quality. Our final result is a pair
of regression models (one corpus-dependent, the other independent) that incorporate
termfrequency,commonness,position,and grammatical features.
We modeled keyphrasequalityusinglogisticregression.We chose thismodelbecause
itsresultsare readilyinterpretable:contributionsfromeachfeature canbe statistically
assessed,and the regression value can be used to rank candidate phrases.We initially
useda mixed model[Faraway2006],whichextendsgeneralized linearmodelsto letone
assess random effects, to include variation due to subjects and documents. We found
that the random effects were not significant and so reverted to a standard logistic
regression model.
We constructed the models over 2,882 responses. We excluded user-generated key-
phrases longer than five words (for which we are unable to determine term common-
ness; our data on Web commonness contains only n-grams up to length five) or not
present in the documents (for which we are unable to determine grammatical and
positional information). We randomly selected another set of 28,820 phrases from the
corpus as negative examples,with aweightof 0.1 (so thattotalweightsfor positive ex-
amples and negative examples are equal during model fitting).Coefficients generated
by logisticregression represent the best linearcombination of features thatdifferenti-
ate user-generated responses from the randomphrases.
We examinethree classesof features—frequency statistics,grammar,and position—
visited in order of their predictive accuracy as determined by a preliminary analysis.
Unlessotherwise stated,all featuresareaddedtotheregressionmodelasindependent
factors without interaction terms.
We present only modeling results for keyphrases describing single documents. We
did fitmodelsforphrasesdescribing multiple documents,and theyreflectobservations
from the previous section, for example, weights shifted toward higher commonness
scores.However,thecoefficientsforgrammaticalfeaturesexhibitlarge standarderrors,
suggestingthat the smaller data set ofmulti-document phrases(641 phrases vs.2,882
forsingledocs) isinsufficient.Asaresult,we leave furthermodelingofmulti-document
descriptions to future work.
We evaluate featuresusingprecision-recall curves.Precisionand recallmeasure the
accuracy of analgorithmbycomparing itsoutputto aknown“correct”setofphrases;in
thiscase,the listofuser-generatedkeyphrasesuptolengthfive.Precisionmeasuresthe
percentage ofcorrectphrasesinthe output.Recallmeasuresthe totalpercentageofthe
correctphrases captured bythe output.As more phrases are included,recall increases
but precision decreases. The precision-recall curve measures the performance of an
algorithmover an increasing number of output phrases. Higher precision is desirable
with fewer phrases, and a larger area under the curve indicates better performance.
ACMTransactionsonComputer-HumanInteraction,Vol.19, No.3,Article19,Publicationdate:October2012.
19:10
J. Chuang etal.
TableIII.FrequencyStatistics
Statistic
Definition
log(tf)
log
(
t
Doc
)
tf.idf
(t
Doc
/t
Ref
)·log(N/D)
G2
2
t
Doc
log
t
Doc
·T
Ref
T
Doc
·T
Doc
+t
Doc
log
t
Doc
·T
Ref
T
Doc
·T
Doc
BM25
3·t
Doc
/
(
t
Doc
+2
(
0.25+0.75·T
Doc
/r
))
·log
(
N/D
)
WordScore
(
t
Doc
−t
Ref
)
/
T
Doc
−T
Ref
log-oddsratio
log
t
Doc
t
Doc
−log
T
Doc
T
Doc
/
1
t
Doc
+
1
t
Doc
(weighted)
Note: Given a document from a reference corpus with N documents,
the score for a term is given by these formulas. t
Doc
and t
Ref
denote
term frequency in the document and reference corpus; T
Doc
and T
Ref
are the number of words in the document and reference corpus; Dis
thenumberofdocuments in which the term appears;r is theaverage
word count per document; t
and T
indicate measures for which we
incrementtermfrequenciesineachdocumentby0.01;termspresentin
thecorpusbutnotin the documentare defined ast
Doc
=t
Ref
−t
Doc
and
T
Doc
=T
Ref
−T
Doc
.Amongthefamilyof tf.idfmeasures, weselected a
reference-relativeformasshown.ForBM25,theparametersk
1
=2and
b= 0.75aresuggestedbyManningetal.[2008].Atermisanyanalyzed
phrase(n-gram).Whenfrequencystatisticsareappliedton-gramswith
n=1,thetermsarealltheindividualwordsinthecorpus.Whenn=2,
scoringisappliedtoallunigramsandbigramsinthecorpus,andsoon.
We also assessed each model using model selection criteria (i.e., AIC, BIC). As these
scores coincide with the rankingsfrom precision-recall measures,we omit them.
4.1. Frequency Statistics
We computed seven different frequency statistics.Oursimplest measure was logterm
frequency:log(tf).We also computedtf.idf,BM25,G
2
,variance-weightedlog-oddsratio,
and WordScore.Eachrequiresareference corpus,forwhichwe use the full dissertation
collection.We also createdasetofhierarchicaltf.idfscores(e.g.,asused byVi´egasetal.
in Themail [2006]) by computing tf.idf withfive nested reference corpora:all terms on
the Web, all dissertations in the Stanford dissertation corpus, dissertations from the
same school, dissertations in the same department, and dissertations supervised by
the same advisor. Due to its poor performance on 5-grams, we assessed four variants
of standard tf.idf scores: tf.idf on unigrams, and all phrases up to bigrams, trigrams,
and 5-grams.Formulas for frequency measures are shown in Table III.
Figure 4(a) shows the performance of these frequency statistics. Probabilistic
measures—namelyG
2
,BM25and weighted log-oddsratio—performbetterthancount-
based approaches (e.g., tf.idf) and heuristics such as WordScore. Count-based ap-
proaches suffer with longer phrases due to an excessive number of ties (many 4- and
5-gramsoccuronlyonceinthecorpus).However,tf.idfonunigramsstillperformsmuch
worse than probabilistic approaches.
4
.
1
.
1
.
A
d
d
i
n
g
T
e
r
m
C
o
m
m
o
n
n
e
s
s
.
During keyphrase characterization, we observed a
bell-shaped distribution of keyphrases as a function of commonness. We quantiled
commonnessfeaturesinto Webcommonnessbinsandcorpuscommonnessbinsinorder
to capture this nonlinearrelationship. We examined the effects of different bin counts
up to 20 bins.
ACMTransactionsonComputer-HumanInteraction,Vol. 19,No.3,Article19,Publicationdate:October2012.
Documents you may be interested
Documents you may be interested