pdf to image conversion using c# : Change password on pdf document Library control class asp.net azure winforms ajax hierdist0-part1939

Enhanced Word Clustering for Hierarchical Text
Classication
InderjitS.Dhillon
Dept. ofComputerSciences
Univ. ofTexas,Austin
inderjit@cs.utexas.edu
SubramanyamMallela
Dept. ofComputerSciences
Univ. ofTexas,Austin
manyam@cs.utexas.edu
RahulKumar
Dept. ofComputerSciences
Univ. ofTexas,Austin
rahul@cs.utexas.edu
ABSTRACT
In thispaperweproposeanew information-theoreticdivi-
sivealgorithmforwordclusteringappliedtotextclassiflca-
tion. In previous work,such \distributionalclustering"of
features has been found toachieveimprovements over fea-
tureselection in termsof classiflcationaccuracy,especially
at lower number of features [2, 28]. However the existing
clusteringtechniques are agglomerative in nature and re-
sult in (i) sub-optimal word clusters and (ii) high compu-
tational cost. In order to explicitly capture the optimal-
ityofwordclustersinaninformationtheoreticframework,
weflrstderiveaglobalcriterion forfeatureclustering. We
then present afast, divisive algorithmthat monotonically
decreasesthisobjectivefunctionvalue,thusconvergingtoa
localminimum. Weshowthatouralgorithmminimizesthe
\within-clusterJensen-Shannondivergence"whilesimulta-
neously maximizing the \between-cluster Jensen-Shannon
divergence". In comparisontothepreviouslyproposedag-
glomerativestrategiesourdivisivealgorithmachieveshigher
classiflcation accuracy especially at lower number of fea-
tures. Wefurthershowthatfeatureclusteringisanefiective
technique for buildingsmaller class models in hierarchical
classiflcation. Wepresentdetailedexperimentalresults us-
ing Naive Bayes and Support Vector Machines on the 20
Newsgroupsdatasetanda3-levelhierarchyofHTMLdoc-
umentscollectedfromDmozOpenDirectory.
1. INTRODUCTION
Givenasetofdocumentvectorsfd
1
;d
2
;:::;d
n
gandtheir
associated classlabels c(d
i
) 2fc
1
;c
2
;:::;c
l
g,textclassifl-
cationistheproblemofestimatingthetrueclasslabelofa
new document d. Thereexist awidevarietyof algorithms
for text classiflcation, ranging from the simple but efiec-
tiveNaiveBayesalgorithmtothemorecomputationallyde-
mandingSupportVectorMachines[24,10,29].
Acommon,andoftenoverwhelming,characteristicoftext
dataisitsextremelyhighdimensionality. Typicallythedoc-
ument vectors are formed using a vector-space or bag-of-
wordsmodel[26]. Evenamoderatelysizeddocumentcollec-
Permissiontomakedigitalorhard copiesofallorpartofthisworkfor
personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesare
notmadeordistributedforprotorcommercialadvantageandthatcopies
bearthisnoticeandthefullcitationontherstpage.Tocopyotherwise,to
republish,topostonserversortoredistributetolists,requirespriorspecic
permissionand/orafee.
KDD2002Edmonton,Alberta,CA
Copyright2002ACM1›58113›567›X/02/0007...
$
5.00.
tion can lead to adimensionality in thousands, for exam-
ple,oneofourtestdatasetscontains5,000webpagesfrom
www.dmoz.organdhasadimensionality(vocabularysize)of
14,538. Thishighdimensionalitycanbeasevereobstaclefor
classiflcationalgorithmsbasedonSupportVectorMachines,
LinearDiscriminantAnalysis, k-nearestneighboretc. The
problemis compoundedwhen thedocuments arearranged
inahierarchyofclassesandafull-featureclassiflerisapplied
ateachnodeofthehierarchy.
A way toreducedimensionalityis by the distributional
clusteringof words/features [25,2,28]. Eachwordcluster
canthenbetreatedasasinglefeatureandthusdimension-
alitycan bedrasticallyreduced. Asshown by[2,28],such
featureclusteringismoreefiectivethanfeatureselection[30],
especiallyatlowernumberoffeatures.Also,featurecluster-
ingappearstopreserveclassiflcationaccuracyascompared
to a full-feature classifler. Indeed in some cases of small
trainingsetsand noisy features,wordclusteringcan actu-
allyincreaseclassiflcationaccuracy. Howeverthealgorithms
giveninboth[2]and[28]areagglomerativeinnatureyield-
ingsub-optimalwordclustersatahighcomputationalcost.
Inthis paper,weflrst deriveaglobalcriterionthatcap-
tures the optimality of word clustering in an information-
theoreticframework. Thisleadstoanobjectivefunctionfor
clusteringthatisbasedonthegeneralizedJensen-Shannon
divergence[20] among an arbitrary number of probability
distributions. In order to flnd the best word clustering,
i.e., the clustering that minimizes this objective function,
we present a new divisive algorithm for clusteringwords.
Thisalgorithmisreminiscentofthek-meansalgorithmbut
usesKullbackLeiblerdivergences[19]insteadofsquaredEu-
clidean distances. We prove that our divisive algorithm
monotonically decreases the objectivefunction value, thus
convergingtoalocalminimum. Wealsoshowthatoural-
gorithmminimizes\within-clusterdivergence"andsimulta-
neouslymaximizes \between-cluster divergence". Thus we
flnd word clusters that are markedly better than the ag-
glomerativealgorithms of[2,28]. Theincreasedqualityof
our wordclusterstranslates tohigherclassiflcationaccura-
cies,especiallyatsmallfeaturesizesandsmalltrainingsets.
We provide empirical evidence of all theaboveclaims us-
ingNaiveBayesandSupportVectorMachinesonthe(a)20
Newsgroupsdataset, and(b)anHTMLdatasetcompris-
ing5,000webpagesarrangedina3-levelhierarchyfromthe
OpenDirectoryProject(www.dmoz.org).
We now givea brief outline of the paper. InSection2,
wediscuss related work andcontrastitwith our work. In
Section 3we brie°y review some useful concepts from in-
Change password on pdf document - C# PDF Password Library: add, remove, edit PDF file password in C#.net, ASP.NET, MVC, WinForms, WPF
Help to Improve the Security of Your PDF Document by Setting Password
pdf password security; copy protected pdf to word converter online
Change password on pdf document - VB.NET PDF Password Library: add, remove, edit PDF file password in vb.net, ASP.NET, MVC, WinForms, WPF
Help to Improve the Security of Your PDF Document by Setting Password
advanced pdf password remover; password pdf files
formation theory such as Kullback-Leibler(KL) divergence
andJensen-Shannon(JS) divergence,whilein Section4we
review text classiflers based on Naive Bayes and Support
Vector Machines. Section 5poses the question of flnding
optimal word clusters in terms of preservingmutualinfor-
mation between two random variables. Section 5.1 gives
the algorithmthat directly minimizes the resulting objec-
tivefunctionwhichisbasedonKL-divergences,andpresents
somepleasingresultsaboutthealgorithm,suchas conver-
gence and simultaneous maximization of \between-cluster
JS-divergence". In Section 6we present experimental re-
sultsthatshowthesuperiorityofour wordclustering,and
theresultingincreaseinclassiflcationaccuracy. Finally,we
presentourconclusionsinSection7.
A word about notation: upper-case letters such as X,
Y,C,W willdenoterandomvariables,whilescriptupper-
caseletterssuchasX,Y,C,W denotesets. Individualset
elementswilloftenbedenotedbylower-caseletterssuchas
x, w or x
i
, w
t
. Probability distributions will be denoted
byp,q,p
1
,p
2
,etc. whentherandomvariableisobviousor
byp(X),p(Cjw
t
),etc. tomaketherandomvariableexplicit.
2. RELATEDWORK
Textclassiflcationhasbeenextensivelystudied,especially
since the emergenceof the internet. Most algorithms are
based on the bag-of-words model for text [26]. A simple
butefiectivealgorithmistheNaiveBayesmethod[24]. For
text classiflcation, difierent variants of Naive Bayes have
beenused,butMcCallumandNigam[21]showedthatthe
variantbasedon themultinomialmodelleadstobetterre-
sults. For hierarchical text data, such as thetopic hierar-
chies of Yahoo! (www.yahoo.com) and theOpen Directory
Project(www.dmoz.org),hierarchicalclassiflcationhasbeen
studiedin[18,5]. Formoredetails,seeSection4.
Tocounterhigh-dimensionalityvariousmethodsoffeature
selectionhave beenproposedin [30, 18,5]. Distributional
clusteringofwordswasflrstproposedbyPereira,Tishby&
Leein[25]wheretheyused\soft"distributionalclustering
tocluster nouns accordingtotheir conditionalverb distri-
butions. Notethatsinceourmaingoalistoreducethenum-
beroffeaturesand themodelsize,weareonlyinterestedin
\hardclustering"whereeachwordcanberepresentedbyits
(unique)wordcluster. Fortextclassiflcation,Baker&Mc-
Callumusedsuchhardclusteringin[2],whilemorerecently,
Slonim&Tishbyhaveusedtheso-calledInformation Bot-
tleneckmethodforclusteringwordsin[28]. Both[2]&[28]
usesimilaragglomerativeclusteringstrategies that makea
greedymoveateveryagglomeration,andshowthatfeature
sizecanbeaggressivelyreducedbysuchclusteringwithout
muchlossinclassiflcationaccuracyusingNaiveBayes. Sim-
ilarresultshavebeenreportedforSVMs[3].
Twoother dimensionality/featurereductionschemes are
usedinlatentsemanticindexing(LSI)[7]anditsprobabilis-
ticversion[16]. Typicallythesemethodshavebeenapplied
intheunsupervisedsettingandasshownin[2],LSIresults
inlowerclassiflcationaccuraciesthanfeatureclustering.
Wenowlistthemaincontributionsofthispaperandcon-
trast themwith earlier work. Asour flrstcontribution,we
deriveaglobalcriterionthatexplicitlycapturestheoptimal-
ityofwordclustersinaninformationtheoreticframework.
Thisleadstoanobjectivefunctioninterms ofthegeneral-
izedJensen-Shannondivergencebetweenanarbitrarynum-
ber of probability distributions. As our second contribu-
tion, we present adivisive algorithm that uses Kullback-
Leibler divergence as the distance measure, and explicitly
minimizestheglobalobjectivefunction. Thisisincontrast
to[28]whichconsideredthemergingofjusttwowordclus-
ters at every step and derived a local criterion based on
theJensen-Shannondivergenceoftwoprobabilitydistribu-
tions. Their agglomerative algorithm, which is similar to
BakerandMcCallum’salgorithm[2],greedilyoptimizesthis
mergingcriterion. Thus,theirresultingalgorithmcanyield
sub-optimalclustersandiscomputationallyexpensive(the
algorithmin[28]isO(m
3
l)incomplexitywheremistheto-
talnumberofwordsandlisthenumberofclasses). Incon-
trastourdivisivealgorithmisO(mkl)wherekisthenumber
ofwordclustersrequired(typicallyk¿m). Notethatour
hardclusteringleadstoamodelsizeofO(k),whereas\soft"
clusteringinmethodssuchasprobabilisticLSI[16]leadsto
amodelsizeofO(mk). Finally,weshowthatourenhanced
word clusteringleads to higher classiflcation accuracy, es-
pecially when the training set is small and in hierarchical
classiflcationofHTMLdata.
3. INFORMATIONTHEORY
Inthissection,wequicklyreviewsomeconceptsfromin-
formationtheory which will beused heavilyin thispaper.
Formoredetailsonsomeofthismaterialseetheauthorita-
tivetreatmentinthebookbyCover&Thomas[6].
Let X be adiscreterandom variable that takes on val-
uesfromthesetX withprobabilitydistributionp(x). The
(Shannon)entropyofX[27]isdeflnedas
H(p)=¡
X
x2
X
p(x)logp(x):
TherelativeentropyorKullback-Leibler(KL)divergence[19]
betweentwodistributionsp
1
(x)andp
2
(x)isdeflnedas
KL(p
1
;p
2
)=
X
x2
X
p
1
(x)log
p
1
(x)
p
2
(x)
:
KL-divergenceisameasureofthe\distance"between two
probability distributions; however it is not a true metric
sinceitisnotsymmetricanddoesnotobeythetrianglein-
equality[6,p.18]. KL-divergenceisalwaysnon-negativebut
canbeunbounded;inparticularwhenp
1
(x)6=0andp
2
(x)=
0,KL(p
1
;p
2
)=1.Incontrast,theJensen-Shannon(JS)di-
vergencebetweenp
1
andp
2
deflnedby
JS
(p
1
;p
2
) = …
1
KL(p
1
;…
1
p
1
+…
2
p
2
)+…
2
KL(p
2
;…
1
p
1
+…
2
p
2
)
=
H(…
1
p
1
+…
2
p
2
)¡…
1
H(p
1
)¡…
2
H(p
2
);
where …
1
+…
2
= 1, …
i
‚ 0, is clearly a measure that is
symmetricinf…
1
;p
1
gandf…
2
;p
2
g,andisbounded[20].The
JS-divergence can be generalized to measure the distance
betweenanyflnitenumberofprobabilitydistributionsas:
JS
(fp
i
:1•i•ng)=H
ˆ
Xn
i=1
i
p
i
!
¡
Xn
i=1
i
H(p
i
); (1)
whichissymmetricinthef…
i
;p
i
g’s(
P
i
i
=1;…
i
‚0).
LetY be another randomvariablewith probability dis-
tribution p(y). ThemutualinformationbetweenXandY,
I(X;Y),isdeflnedas theKL-divergencebetweenthejoint
distributionp(x;y)andtheproductdistributionp(x)p(y):
I(X;Y) =
X
x
X
y
p(x;y)log
p(x;y)
p(x)p(y)
:
(2)
Online Change your PDF file Permission Settings
to make it as easy as possible to change your PDF You can receive the locked PDF by simply clicking download and If we need a password from you, it will not be
pdf password online; change password on pdf document
VB.NET PDF File Compress Library: Compress reduce PDF size in vb.
Document and metadata. outputFilePath = Program.RootPath + "\\" 3_optimized.pdf"; 'create optimizing TargetResolution = 150.0F 'to change image compression
pdf password; add password to pdf file without acrobat
Intuitively,mutualinformationis ameasureoftheamount
ofinformationthatonerandomvariablecontainsaboutthe
other. The higher its value the less is the uncertainty of
one random variable due to knowledge about the other.
Formally, it can be shown that I(X;Y) is the reduction
in entropy of one variable knowingthe other: I(X;Y) =
H(X)¡H(XjY)=H(Y)¡H(YjX)[6].
4. TEXTCLASSIFICATION
Twocontrastingclassiflersthatperformwellontextclas-
siflcationare(i)thesimpleNaiveBayesmethodand(ii)the
morecomplexSupport VectorMachines. Wenow givede-
tailsontheseclassiflers.
4.1 NaiveBayesClassier
Let C = fc
1
;c
2
;:::;c
l
g be the set of l classes, and let
W=fw
1
;:::;w
m
gbethesetofwords/features contained
in theseclasses. Given anew documentd, the probability
thatdbelongstoclassc
i
isgivenbyBayesrule,
p(c
i
jd)=
p(djc
i
)p(c
i
)
p(d)
:
Assumingagenerativemultinomialmodel[21]andfurther
assumingclass-conditionalindependenceofwordsyieldsthe
well-knownNaiveBayesclassifler [24],whichcomputesthe
mostprobableclassfordas
c
(d)=argmax
c
i
ˆ
p(c
i
jd)=p(c
i
)
Ym
t=1
p(w
t
jc
i
)
n(w
t
;d)
!
(3)
wheren(w
t
;d) is the number of occurrencesof word w
t
in
document d, and the quantities p(w
t
jc
i
) are usually esti-
matedusingLaplace’sruleofsuccession:
p(w
t
jc
i
)=
1+
P
d
j
2c
i
n(w
t
;d
j
)
m+
P
m
t=1
P
d
j
2c
i
n(w
t
;d
j
)
:
(4)
The class priors p(c
i
) areestimatedbythemaximumlike-
lihood estimate p(c
i
) =
jc
i
j
P
j
jc
j
j
. We now manipulate the
NaiveBayesrule in ordertointerpretitinan information
theoretic framework. Rewrite formula(3) bytakingloga-
rithmsanddividingbythelengthofthedocumentjdjtoget
c
(d)=argmax
c
i
(logp(c
i
)+
Xm
t=1
p(w
t
jd)logp(w
t
jc
i
)); (5)
wherethedocumentdmaybeviewedasaprobabilitydis-
tribution over words: p(w
t
jd) = n(w
t
;d)=jdj. Addingthe
entropy of p(Wjd), i.e., ¡
P
m
t=1
p(w
t
jd)logp(w
t
jd) to (5),
andnegating,weget
c
(d) =
argmin
c
i
Xm
t=1
p(w
t
jd)log
p(w
t
jd)
p(w
t
jc
i
)
¡logp(c
i
)(6)
=
argmin
c
i
(KL(p(Wjd);p(Wjc
i
))¡logp(c
i
));
whereKL(p;q)denotestheKL-divergencebetweenpandq
as deflned in Section 3. Note that here we have used W
todenote the randomvariable that takes values fromthe
setW. Thus,assumingequalclasspriors,weseethatNaive
Bayesmaybeinterpretedasflndingtheclasswhichhasmin-
imumKL-divergencefromthegivendocument. Asweshall
seeagainlater,KL-divergenceseemstoappear\naturally"
inoursetting.
By (5), wecan clearly see that Naive Bayes is a linear
classifler. Despite its crude assumption about the class-
conditional independence of words, Naive Bayes has been
foundtoyieldsurprisinglygoodclassiflcationperformance,
especiallyontextdata. Plausiblereasonsforthesuccessof
NaiveBayeshavebeenexploredin[9,12].
4.2 SupportVectorMachines
SupportVectorMachines(SVMs)[29]areinductivelearn-
ing schemes for solving the two class pattern recognition
problem. RecentlySVMshavebeenshowntogivegoodre-
sultsfortextcategorization[17].Themethodisdeflnedover
avectorspacewheretheclassiflcationproblemistoflndthe
decisionsurfacethat\best"separatesthedatapointsofthe
twoclasses. Inthecase oflinearlyseparable data, the de-
cisionsurfaceisahyperplanethatmaximizesthe\margin"
between the twoclasses. This hyperplane can be written
as ~w:~x¡b = 0, where ~x is a data point and the vector
~w and constant b are learned from the training set. Let
y
i
2 f+1;¡1g(+1 for positive class and ¡1 for negative
class)betheclassiflcationlabelforinputvector ~x
i
. Finding
the hyperplane can be translated into the following opti-
mizationproblem
Minimize:jj~wjj
subjectto
~w:~x
i
¡ b‚+1 for y
i
=+1;
~w:~x
i
¡ b•¡1 for y
i
=¡1:
Thisminimizationproblemcanbesolvedusingquadratic
programmingtechniques[29]. Thealgorithmsforsolvingthe
linearlyseparablecasecanbeextended tothecaseofdata
thatisnotlinearlyseparablebyeitherintroducingsoftmar-
ginhyperplanesorbyusinganon-linearmappingoftheorig-
inaldatavectors toahigher dimensional spacewhere the
data points are linearly separable[29]. Even though SVM
classiflersaredescribedasbinaryclassiflerstheycanbeeas-
ilycombinedtohandlethemulticlasscase. Asimple,efiec-
tivecombinationistotrainN one-versus-restclassiflersfor
theN classcaseandthenclassifythetestpointtotheclass
correspondingtothelargestpositivedistancetotheseparat-
inghyperplane. InallourexperimentsweusedlinearSVMs
because they are fast to learn and classify new instances
compared tonon-linearSVMs. Further,linear SVMs have
beenshowntodowellontextclassiflcation[17].
4.3 HierarchicalClassication
Hierarchicalclassiflcationutilizesahierarchicaltopicstruc-
turesuchasYahoo! todecomposetheclassiflcationtaskinto
a set of simpler problems, oneat each nodeinthehierar-
chy. We can simplyextend any classifler toperformhier-
archical classiflcation byconstructinga(distinct) classifler
at each internalnodeof the tree using all the documents
in its child nodes as the training data. Thus the tree is
assumedtobe\is-a"hierarchy,i.e., thetraininginstances
are inherited bythe parents. Then classiflcation is just a
greedydescentdownthetreeuntiltheleafnodeisreached.
Thiswayofclassiflcationhasbeenshowntobeequivalentto
thestandardnon-hierarchicalclassiflcationovera°atsetof
leaf classesifmaximumlikelihoodestimates ofall features
areused[23]. However,hierarchicalclassiflcationalongwith
feature selection has been shown toachieve better classi-
flcation results than a °at classifler[18]. This is because
each classiflercannowutilizeadifierentsubsetoffeatures
thataremostrelevanttotheclassiflcationsub-taskathand.
C# PDF File Compress Library: Compress reduce PDF size in C#.net
C#.NET DLLs: Compress PDF Document. Program.RootPath + "\\" 3_optimized.pdf"; // create optimizing TargetResolution = 150F; // to change image compression
add password to pdf preview; creating password protected pdf
C#: How to Set Your Web Document Viewer Width and Height
Convert Jpeg to PDF; Merge PDF Files; Split PDF Document; Remove Password from PDF; Change PDF Permission Settings. FREE TRIAL: HOW TO:
convert password protected pdf to normal pdf; acrobat password protect pdf
Furthermoreeachnodeclassiflerrequiresonlyasmallnum-
beroffeaturessinceitneedstodistinguishbetweenafewer
numberofclasses.Ourproposedfeatureclusteringstrategy
allows ustoaggressivelyreducethenumberof featuresas-
sociatedwitheachnodeclassiflerinthehierarchy.Detailed
experimentsontheDmozSciencehierarchyarepresentedin
Section6.
5. DISTRIBUTIONALWORDCLUSTERING
Let C be a discrete randomvariable that takes on val-
ues from the set of classes C = fc
1
;:::;c
l
g, and let W
be the random variable that ranges over theset of words
W=fw
1
;:::;w
m
g. Thejointdistributionp(C;W)canbe
estimatedfromthetrainingset. Nowsupposeweclusterthe
wordsintokclusters W
1
;:::;W
k
. Sinceweareinterested
in reducingthenumber offeaturesand themodelsize,we
onlylookat \hard"clusteringwhereeach word belongsto
exactlyonewordcluster,i.e,
W=[
k
i=1
W
i
; and W
i
\W
j
=`; i6=j:
LettherandomvariableW
C
rangeoverthewordclusters.
In order tojudgethequality of the word clusters we now
introduceaninformation-theoreticmeasure.
The information about C captured by W can be mea-
suredbythemutualinformationI(C;W). Ideally,inform-
ingwordclusters wewouldliketoexactlypreservethemu-
tualinformation;however clusteringusuallylowersmutual
information. Thus wewould like toflnda clusteringthat
minimizes the decrease in mutualinformation, I(C;W)¡
I(C;W
C
). Thefollowingtheoremstatesthatthischangein
mutualinformationcanbeexpressedintermsofthegener-
alizedJensen-Shannondivergenceofeachwordcluster.
Theorem 1. The change in mutualinformation due to
wordclusteringisgivenby
I(C;W)¡I(C;W
C
)=
Xk
j=1
…(W
j
)JS
0
(fp(Cjw
t
):w
t
2W
j
g)
where …(W
j
) =
P
w
t
2
W
j
t
, …
t
=p(w
t
), …
0
t
= …
t
=…(W
j
)
forw
t
2W
j
,andJSdenotesthegeneralizedJensen-Shannon
divergenceasdeflnedin(1).
Proof. Bythedeflnitionof mutualinformation(see(2)),
andusingp(c
i
;w
t
)=…
t
p(c
i
jw
t
)weget
I(C;W) =
X
i
X
t
t
p(c
i
jw
t
)log
p(c
i
jw
t
)
p(c
i
)
and I(C;W
C
) =
X
i
X
j
…(W
j
)p(c
i
jW
j
)log
p(c
i
jW
j
)
p(c
i
)
:
Weareinterestedinhardclustering,so…(W
j
)=
P
w
t
2
W
j
t
and p(c
i
jW
j
) =
P
w
t
2
W
j
(…
t
=…(W
j
))p(c
i
jw
t
), thus imply-
ingthatforallclustersW
j
,
…(W
j
)p(c
i
jW
j
) =
X
w
t
2
W
j
t
p(c
i
jw
t
);
(7)
p(CjW
j
) =
X
w
t
2
W
j
t
…(W
j
)
p(Cjw
t
):
(8)
Notethatthedistributionp(CjW
j
)isthe(weighted)mean
distributionoftheconstituentdistributionsp(Cjw
t
). Thus,
I(C;W)¡I(C;W
C
)=
X
i
X
t
t
p(c
i
jw
t
)logp(c
i
jw
t
X
i
X
j
…(W
j
)p(c
i
jW
j
)logp(c
i
jW
j
) (9)
sincetheextralog(p(c
i
))termscanceldueto(7). Theflrst
termin(9),afterrearrangingthesum,maybewrittenas
X
j
X
w
t
2
W
j
t
ˆ
X
i
p(c
i
jw
t
)logp(c
i
jw
t
)
!
=
¡
X
j
X
w
t
2
W
j
t
H(p(Cjw
t
))
=
¡
X
j
…(W
j
)
X
w
t
2
W
j
t
…(W
j
)
H(p(Cjw
t
)): (10)
Similarly,thesecondtermin(9)maybewrittenas
X
j
…(W
j
)
ˆ
X
i
p(c
i
jW
j
)logp(c
i
jW
j
)
!
=
¡
X
j
…(W
j
)H(p(CjW
j
))
=
¡
X
j
…(W
j
)H
0
B
@
X
w
t
2
W
j
t
…(W
j
)
p(Cjw
t
)
1
C
A
(11)
where(11)isobtainedbysubstitutingthevalueofp(CjW
j
)
from(8). Substituting(10) and (11) in (9) and usingthe
deflnition of Jensen-Shannon divergence from (1) gives us
thedesiredresult.
tu
Theorem1givesaglobalmeasureofthegoodnessofword
clusters,whichmaybeinformallyinterpretedasfollows:
1. The quality of word cluster W
j
is measured by the
Jensen-Shannondivergencebetweentheindividualword
distributions p(Cjw
t
) (weighted by the word priors,
t
=p(w
t
)). ThesmallertheJensen-Shannondiver-
gence the more \compact" is the word cluster, i.e.,
smalleristheincreaseinentropyduetoclustering(see(1)).
2. The overall goodness of the word clustering is mea-
sured bythe sum of the qualities of individual word
clusters(weightedbytheclusterpriors…(W
j
)=p(W
j
)).
Given theglobalcriterion of Theorem 1,we wouldnow
liketoflndanalgorithmthatsearchesfortheoptimalword
clusteringthatminimizesthiscriterion. Wenowrewritethis
criterioninawaythatwillsuggesta\natural"algorithm.
Lemma 1. ThegeneralizedJensen-Shannondivergenceof
a flnite setof probability distributions can be expressed as
the (weighted) sum of Kullback-Leibler divergences to the
(weighted) mean,i.e.,
JS
(fp
i
:1•i•ng)=
Xn
i=1
i
KL(p
i
;m)
(12)
where…
i
‚0;
P
i
i
=1andmisthe(weighted)meanprob-
abilitydistribution,m=
P
i
i
p
i
.
C#: How to Draw and Customize Annotation in HTML5 Document Viewer
Convert Jpeg to PDF; Merge PDF Files; Split PDF Document; Remove Password from PDF; Change PDF Permission Settings. FREE TRIAL: HOW TO:
add password to pdf file; password protected pdf
C# PDF Page Rotate Library: rotate PDF page permanently in C#.net
Empower C# Users to Change the Rotation Angle of PDF File Page Using C# Able to rotate page in PDF document to 90,180,270 degree in both clockwise and
add copy protection pdf; break pdf password
AlgorithmDivisive
KL
Clustering(P,ƒ,l,k,W)
Input: Pisthesetofdistributions,fp(Cjw
t
):1•t•mg,
ƒisthesetofallwordpriors,f…
t
=p(w
t
):1•t•mg
listhenumberofdocumentclasses,
kisthenumberofdesiredclusters.
Output: WisthesetofwordclustersfW
1
;W
2
;:::;W
k
g.
1. Initialization: for every word w
t
, assign w
t
to W
j
suchthatp(c
j
jw
t
)=max
i
p(c
i
jw
t
).Thisgiveslinitial
wordclusters;if k‚lspliteachclusterintoapproxi-
matelyk=lclusters,otherwisemergethel clustersto
getkwordclusters.
2. ForeachclusterW
j
,compute
…(W
j
) =
X
w
t
2
W
j
t
p(CjW
j
) =
X
w
t
2
W
j
t
…(W
j
)
p(Cjw
t
):
3. Re-compute all clusters: For each word w
t
, flnd its
newclusterindexas
j
(w
t
)=argmin
i
KL(p(Cjw
t
);p(CjW
i
));
resolvingtiesarbitrarily. Thuscomputethenewword
clustersW
j
,1•j•k,as
W
j
=fw
t
:j
(w
t
)=jg:
4. Stopifthechangeinobjectivefunctionvaluegivenby
(13)is\small"(say10
¡3
);Elsegotostep2.
Figure 1: Divisive Algorithm for word clustering
basedon KL-divergences
Proof. Use the deflnition of entropy to expand the ex-
pression forJS-divergence given in (1). Theresult follows
byappropriatelygroupingtermsandusingthedeflnitionof
KL-divergence.
tu
5.1 TheAlgorithm
ByTheorem1andLemma1,thedecreasein mutualin-
formationduetowordclusteringmaybewrittenas
Xk
j=1
…(W
j
)
X
w
t
2
W
j
t
…(W
j
)
KL(p(Cjw
t
);p(CjW
j
)):
Asaresultthequalityofwordclusteringcan bemeasured
bytheobjectivefunction
Q(fW
j
g
k
j=1
) = I(C;W)¡I(C;W
C
)
=
Xk
j=1
X
w
t
2
W
j
t
KL(p(Cjw
t
);p(CjW
j
)):
(13)
NotethatitisnaturalthattheKL-divergenceemergesasthe
distancemeasureintheaboveobjectivefunctionsincemu-
tualinformationisjusttheKL-divergencebetweenthejoint
distribution and the product distribution (see Section 3).
Writingtheobjectivefunctionintheabovemannersuggests
aniterative algorithmthat repeatedly (i) re-partitions the
distributions p(Cjw
t
) by their closeness in KL-divergence
totheclusterdistributions p(CjW
j
),and(ii)subsequently,
giventhenewwordclusters,re-computestheseclusterdistri-
butionsusing(8). Figure1describesthealgorithmindetail.
Notethatthisdivisivealgorithmbearssomeresemblanceto
the k-means or Lloyd-Max algorithm, which usually uses
squaredEuclideandistances[11,10,15,4].
Notethatourinitializationstrategyiscrucialtoouralgo-
rithm,seestep1inFigure1(alsosee[8,Section5.1])sinceit
guaranteesabsolutecontinuityofeachp(Cjw
t
)withatleast
one cluster distribution p(CjW
j
), i.e., guarantees that at
leastoneKL-divergenceisflnite. Thisisbecauseourinitial-
izationstrategyensuresthateverywordw
t
ispartofsome
clusterW
j
. Thusbytheformulaforp(CjW
j
) instep2,it
cannothappenthatp(c
i
jw
t
)6=0,andp(c
i
jW
j
)=0. Note
thatwecanstillgetsomeinflniteKL-divergencevaluesbut
thesedonotleadtoanydi–culty(indeedinanimplemen-
tation we can handle such \inflnityproblems"without an
extra\if"condition thankstothehandlingof\inflnity"in
theIEEE°oatingpointstandard[14,1]).
Wenowdiscuss thecomputationalcomplexityof oural-
gorithm.Step3ofeachiterationrequirestheKL-divergence
tobecomputedforeverypair,p(Cjw
t
)andp(CjW
j
). This
isthemostcomputationallydemandingtaskandcostsato-
talofO(mkl)operations. Generally,wehavefoundthatthe
algorithmconverges in 10-15iterations independentof the
sizeof thedataset. Thus thetotalcomplexityis O(mkl),
whichgrowslinearlywithm(notethatk¿m). Incontrast,
theagglomerativealgorithmof[28]costsO(m
3
l)operations.
The algorithm in Figure 1 has certain pleasing proper-
ties. As we will prove in Theorem 3, our algorithm de-
creasestheobjectivefunctionvalueateverystepandthusis
guaranteedtoconvergetoalocalminimuminaflnitenum-
ber ofsteps (notethat flndingtheglobalminimumisNP-
complete[13]). Also,byTheorem1and(13)weseethatour
algorithm minimizes the \within-cluster" Jensen-Shannon
divergence. It turns out that (see Theorem 4) that our
algorithm simultaneously maximizes the \between-cluster"
Jensen-Shannon divergence. Thusthe difierent word clus-
tersproducedbyouralgorithmare\maximally"farapart.
Wenowgiveformalstatementsofourresultswithproofs.
Lemma 2. Given probabilitydistributionsp
1
;:::;p
n
,the
distributionthatisclosest(on average) in KL-divergenceis
themeanprobabilitydistributionm,i.e.,givenanyprobabil-
ity distributionq,
X
i
i
KL(p
i
;q)‚
X
i
i
KL(p
i
;m);
(14)
where…
i
‚0,
P
i
i
=1andm=
P
i
i
p
i
.
Proof. UsethedeflnitionofKL-divergencetoexpandthe
left-handside(LHS)of(14)toget
X
i
i
X
x
p
i
(x)(logp
i
(x)¡logq(x)):
SimilarlytheRHSof(14)equals
X
i
i
X
x
p
i
(x)(logp
i
(x)¡logm(x)):
C# Word - Process Word Document in C#
easily complete various Word document processing implementations using C# demo codes, such as add or delete Word document page, change Word document pages order
password pdf; copy protection pdf
C# PDF Page Move Library: re-order PDF pages in C#.net, ASP.NET
program. Able to adjust and change selected PDF document page. Enable C# users to move, sort and reorder all PDF page in preview. Support
adding a password to a pdf file; add password to pdf document
SubtractingtheRHSfromLHSleadsto
X
i
i
X
x
p
i
(x)(logm(x)¡logq(x))=
X
x
m(x)log
m(x)
q(x)
=KL(m;q):
The result follows since theKL-divergence is always non-
negative[6,Theorem2.6.3].
tu
Theorem 2. The Algorithm in Figure 1 monotonically
decreases thevalueoftheobjectivefunctiongivenin(13).
Proof. Let W
(i)
1
;:::;W
(i)
k
be the word clusters at itera-
tioni,andletp(CjW
(i)
1
);:::;p(CjW
(i)
k
)bethecorrespond-
ingclusterdistributions.Then
Q(fW
(i)
j
g
k
j=1
) =
Xk
j=1
X
w
t
2
W
(i)
j
t
KL(p(Cjw
t
);p(CjW
(i)
j
))
Xk
j=1
X
w
t
2
W
(i)
j
t
KL(p(Cjw
t
);p(CjW
(i)
j(w
t
)
))
Xk
j=1
X
w
t
2
W
(i+1)
j
t
KL(p(Cjw
t
);p(CjW
(i+1)
j
))
=
Q(fW
(i+1)
j
g
k
j=1
)
wheretheflrstinequalityisduetostep3ofthealgorithm,
andthesecondinequalityfollowsfromstep2andLemma2.
Note that if equality holds, i.e., if the objective function
valueisequal atconsecutiveiterations, then step 4termi-
natesthealgorithm.
tu
Theorem 3. TheAlgorithminFigure1alwaysconverges
toalocalminimuminaflnite numberofiterations.
Proof. Theresult follows sincethe algorithm monotoni-
callydecreasestheobjectivefunctionvalue,whichisbounded
frombelow(byzero).
tu
We now showthat thetotalJensen-Shannondivergence
(which is constant for a given set of probability distribu-
tions)canbewrittenasthesumoftwoterms,oneofwhich
istheobjectivefunction(13)thatouralgorithmminimizes.
Theorem 4. Letp
1
;:::;p
n
beasetofprobabilitydistri-
butionsandlet…
1
;:::;…
n
becorrespondingscalarssuchthat
i
‚0,
P
i
i
=1. Supposep
1
;:::;p
n
areclusteredintok
clustersP
1
;:::;P
k
,andletm
j
bethe(weighted)meandis-
tributionofP
j
,i.e.,
m
j
=
X
p
t
2
P
j
t
…(P
j
)
p
t
; where …(P
j
)=
X
p
t
2
P
j
t
: (15)
Then thetotalJS-divergencebetween p
1
;:::;p
n
can be ex-
pressed as the sum of \within-cluster JS-divergence" and
\between-clusterJS-divergence",i.e.,
JS
(fp
i
:1•i•ng) =
Xk
j=1
…(P
j
)JS
0
(fp
t
:p
t
2P
j
g)
+JS
00
(fm
i
:1•i•kg);
where…
0
t
=…
t
=…(P
j
) andweuse…
00
asthe subscriptinthe
lasttermtodenote…
00
j
=…(P
j
).
Proof. ByLemma1,thetotalJS-divergencemaybewrit-
tenas
JS
(fp
i
:1•i•ng) =
Xn
i=1
i
KL(p
i
;m)
=
Xn
i=1
X
x
i
p
i
(x)log
p
i
(x)
m(x)
(16)
wherem=
P
i
i
p
i
.Withm
j
asin(15),andrewriting(16)
inorderoftheclustersP
j
weget
Xk
j=1
X
p
t
2
P
j
X
x
t
p
t
(x)
µ
log
p
t
(x)
m
j
(x)
+log
m
j
(x)
m(x)
=
Xk
j=1
…(P
j
)
X
p
t
2
P
j
t
…(P
j
)
KL(p
t
;m
j
)+
Xk
j=1
…(P
j
)KL(m
j
;m)
=
Xk
j=1
…(P
j
)JS
0
(fp
t
:p
t
2P
j
g)+JS
00
(fm
i
:1•i•kg);
where…
00
j
=…(P
j
),whichprovestheresult.
tu
Thisconcludesourformaltreatment. Wenowseehowto
usewordclustersinourtextclassiflers.
5.2 ClassicationusingWordClusters
TheNaiveBayesmethodcanbesimplytranslatedintous-
ingwordclustersinsteadofwords. Thisisdonebyestimat-
ingthe new parametersp(W
s
jc
i
) for wordclusters similar
tothewordparametersp(w
t
jc
i
)in(4)as
p(W
s
jc
i
)=
P
d
j
2c
i
n(W
s
;d
j
)
P
k
s=1
P
d
j
2c
i
n(W
s
;d
j
)
wheren(W
s
;d
j
)=
P
w
t
2
W
s
n(w
t
;d
j
).
Notethatwhenestimatesofp(w
t
jc
i
)forindividualwords
arerelatively poor,thecorrespondingwordcluster param-
eters p(W
s
jc
i
) provide more robust estimates resulting in
higherclassiflcationscores.
TheNaiveBayesrule(5)forclassifyingatestdocument
dcanberewrittenas
c
(d)=argmax
c
i
ˆ
logp(c
i
)+
Xk
s=1
p(W
s
jd)logp(W
s
jc
i
)
!
;
wherep(W
s
jd)=n(W
s
jd)=jdj. SVMscanbesimilarlyused
withwordclustersasfeatures.
6. EXPERIMENTALRESULTS
Thissectionprovidesempiricalevidencethatourdivisive
clusteringalgorithmofFigure1outperformsagglomerative
clusteringandvarious featureselection methods. Wecom-
pareourresultswithfeatureselectionbyInformationGain
andMutualInformation[30],andfeatureclusteringusingthe
agglomerativealgorithmin[2].WecallthelatterAgglomer-
ativeClusteringinthissectionforthepurposeofcomparison
with our algorithmwhich wecallDivisive Clustering. We
showthatDivisiveClusteringachieveshigherclassiflcation
accuracythanthebestperformingfeatureselectionmethod
when the training data is sparse and show improvements
oversimilarresultsreportedin[28].
0
0.2
0.4
0.6
0.8
1
2
5
10
20
50
100
200
500
Fraction of Mutual Information lost
Number of Word Clusters
20 Ng
Agglomerative Clustering
Divisive Clustering
Figure2: FractionofMutualInformation lostwhile
clusteringwordswith DivisiveClusteringissignifl-
cantly lower compared to Agglomerative Clustering
atallnumberoffeatures(on20Newsgroupsdata).
6.1 DataSetsandImplementationDetails
The 20 Newsgroups (20Ng) data set, collected by Ken
Lang,containsabout20,000articles evenlydivided among
20UseNet Discussion groups. Each newsgroup represents
oneclassintheclassiflcationtask. This dataset hasbeen
usedfortestingseveraltextclassiflcationmethods[2,28,21].
Duringindexingweskipped headers,pruned words occur-
ringinlessthan3documentsandusedastoplistbutdidnot
usestemming. Theresultingvocabularyhad35,077words.
We collected the Dmoz data from the Open Directory
Project(www.dmoz.org).Thedmozhierarchycontainsabout
3milliondocumentsand300,0000classes.Wechosethetop
Sciencecategoryandcrawledsomeoftheheavilypopulated
internalnodesbeneathitresultingina3-deephierarchywith
49leaf levelnodes and about 5,000total documents. For
ourexperimentalresults weignoreddocumentsat internal
nodes. Thelistofcategoriesandurlsweusedisavailableat
www.cs.utexas.edu/users/manyam/dmoz.txt. Whileindexing
weskippedtextbetweenhtmltags,prunedwordsoccurring
inlessthanflvedocuments,usedastoplistbutdidnotuse
stemming. Theresultingvocabularyhad14,538words.
Bow[22]isalibraryofCcodeusefulforwritingtextanaly-
sis,languagemodelingand informationretrievalprograms.
We extended Bow to index BdB (www.sleepycat.com) °at
flledatabases wherewe storedthetextdocumentsfore–-
cientretrievalandstorage. WeimplementedAgglomerative
andDivisiveClusteringwithinBow,andused Bow’sSVM
implementationinourexperiments.
6.2 Results
We flrst give evidence of the improved quality of word
clusters obtained byour algorithmascompared totheag-
glomerativeapproach. Wedeflnethefractionofmutualin-
formationlostduetoclusteringwordsas:
I(C;W)¡I(C;W
C
)
I(C;W)
:
Intuitively, lower the loss in mutual information the bet-
teristheclustering. ThetermI(C;W)¡I(C;W
C
)inthe
0
0.2
0.4
0.6
0.8
1
2
5
10
20
50
100
200
500
Fraction of Mutual Information lost
Number of Word Clusters
Dmoz
Agglomerative Clustering
Divisive Clustering
Figure3: FractionofMutual Informationlostwhile
clusteringwords with Divisive Clusteringis signifl-
cantly lower compared toAgglomerative Clustering
at all number offeatures (on Dmozdata).
numeratorof theaboveequationispreciselytheglobalob-
jective functionthat DivisiveClusteringattemptstomini-
mize(seeTheorem1). Figures2and3plotthefractionof
mutualinformationlostagainstthenumberofclustersfor
boththedivisiveandagglomerativealgorithmsonthe20Ng
and Dmoz data sets. Noticethat less mutual information
islostwithDivisiveClusteringcomparedtoAgglomerative
Clusteringatall number ofclusters,thoughthedifierence
ismorepronouncedatlowernumberofclusters.
Nextweprovideanecdotalevidencethatourwordclusters
arebetteratpreservingclassinformationascomparedtothe
agglomerativeapproach. Figure4showsthreewordclusters,
Cluster9andCluster10fromDivisiveClusteringandClus-
ter12fromAgglomerativeClustering. Theseclusters were
obtainedwhileforming20wordclusterswitha1=3-2=3test-
train split. While the clusters obtained by our algorithm
could successfullydistinguish between rec.sport.hockeyand
rec.sport.baseball,AgglomerativeClusteringcombinedwords
frombothclassesinasinglecluster. Thisresulted inlower
classiflcation accuracyfor bothclasses withAgglomerative
ClusteringcomparedtoDivisiveClustering. WhileDivisive
Clusteringachieved93.33%and94.07%onrec.sport.hockey
andrec.sport.baseballrespectively,AgglomerativeClustering
couldonlyachieve76.97%and52.42%.
6.2.1 ClassicationResultson20Newsgroupsdata
Figure5shows classiflcation accuracies on the20News-
groups data set for the algorithms considered. The hori-
zontalaxisindicatesthenumberoffeatures/clustersusedin
theclassiflcationmodelwhiletheverticalaxisindicatesthe
percentageoftestdocuments thatwereclassifled correctly.
Theresultsareaveragesof5-10trialsofrandomized1=3-2=3
test-trainsplitsofthetotaldata. Notethatweclusteronly
thewordsbelongingtothedocumentsinthetrainingset. We
usedtwoclassiflcation techniques, SVMs and NaiveBayes
(NB)forthepurposeofcomparison. ObservethatDivisive
Clustering(SVMas wellasNB) achieves signiflcantlybet-
terresultsatlowernumberoffeaturesthanfeatureselection
usingInformationGainandMutualInformation.Withonly
50clusters,DivisiveClustering(NB)achieves78.05%accu-
Cluster10
Cluster9
Cluster12
Divisive
Divisive
Agglomerative
Clustering
Clustering
Clustering
(Hockey)
(Baseball)
(HockeyandBaseball)
team
hit
team
detroit
game
runs
hockey
pitching
play
baseball
games
hitter
hockey
base
players
rangers
season
ball
baseball nyi
boston
greg
league
morris
chicago
morris
player
blues
pit
ted
nhl
shots
van
pitcher
pit
vancouver
nhl
hitting
bufialo
ens
Figure4: TopfewwordssortedbyMutualInforma-
tion in Clusters obtained by Divisive and Agglom-
erative approacheson20Newsgroupsdata.
10
20
30
40
50
60
70
80
90
100
1
2
5
10
20
50
100
200
500
1000
5000
35077
% Accuracy
Number of Features
Divisive Clustering (Naive Bayes)
Divisive Clustering (SVM)
Information-Gain (Naive Bayes)
Information Gain (SVM)
Mutual Information (Naive Bayes)
Figure5: ClassiflcationAccuracyon20Newsgroups
data with 1=3-2=3test-trainsplit.
racy |just 4.1%short of theaccuracy achieved by afull
featureNBclassifler. Wealsoobservedthatthelargestgain
occurs when the number of clusters equals thenumber of
classes(for20Ngdatathisoccursat20clusters).Whenwe
closelyobservedthesewordclusterswefoundthatmanyof
themcontainedwordsrepresentingasingleclassinthedata
set,forexampleseeFigure4. Weattributethisobservation
toourefiectiveinitializationstrategy.
In Figure6, we plot the classiflcationaccuracy on 20Ng
data usingNaive Bayes whenthetrainingdatais sparse.
Wetook2%oftheavailabledata,thatis20documentsper
class, fortrainingandtestedon theremaining98%ofthe
documents. Theresultsareaveragesof5-10trials.Weagain
observethatDivisiveClusteringobtainsbetterresultsthan
InformationGainatallnumberoffeatures.Italsoachieves
asigniflcant12%increaseoverthemaximumpossibleaccu-
racy achieved by Information Gain. This is in contrastto
Figure5where Information Gain eventuallycatches upas
weincreasethenumberoffeatures.Whenthetrainingdata
issmallthewordbyclassfrequencymatrixcontainsmany
zero entries. By clustering words we obtain more robust
estimates of word class probabilities which lead to higher
classiflcationaccuracies.
0
10
20
30
40
50
60
70
80
1
2
5
10
20
50
100
200
500
1000
5000
35077
% Accuracy
Number of Features
Divisive Clustering
Information Gain 
Figure6: ClassiflcationAccuracyon20Newsgroups
datawith2% Trainingdata (usingNaiveBayes).
0
20
40
60
80
100
2
5
10
20
50
100
200
500
% Accuracy
Number of Word Clusters
Agglomerative Clustering
Divisive Clustering
Figure 7: Divisive Clustering leads to higher accu-
racy than Agglomerative Clusteringon 20Ngdata
(1=3-2=3test-train splitwithNaiveBayes).
0
10
20
30
40
50
60
70
80
90
2
5
10
20
50
100
200
500
1000
10000
% Accuracy
Number of Features
Divisive Clustering(Naive Bayes)
Divisive Clustering(SVM)
Information Gain (Naive Bayes)
Information Gain(SVM)
Mutual Information(Naive Bayes)
Figure 8: Classiflcation Accuracy on Dmoz data
with 1=3-2=3test-trainsplit.
0
10
20
30
40
50
60
1
2
5
10
20
50
100
200
500
1000
10000
% Accuracy
Number of Features
Divisive Clustering
Information Gain 
Figure 9: Classiflcation Accuracy on Dmoz data
with 2% Trainingdata (usingNaive Bayes).
0
10
20
30
40
50
60
70
80
2
5
10
20
50
100
200
500
% Accuracy
Number of Word Clusters
Agglomerative Clustering
Divisive Clustering
Figure10: DivisiveClusteringachieveshigheraccu-
racy than Agglomerative Clustering on Dmozdata
(1=3-2=3test-train splitwith Naive Bayes).
Figure7comparestheclassiflcationaccuraciesofDivisive
Clusteringand Agglomerative Clusteringon the20News-
groupsdatausingNaiveBayes. NotethatDivisiveCluster-
ingachievesbetterclassiflcationresultsthanAgglomerative
Clusteringatallnumberof features,thoughagain the im-
provementsaremoresigniflcantatlowernumberoffeatures.
6.2.2 ClassicationResultsonDmozdataset
Figure8showsclassiflcationresultsfortheDmozdataset
when webuild a °at classifler over the leaf set of classes.
Unlikethe previous plots, feature selection sometimes im-
provesclassiflcationaccuracysinceHTMLdataappearsto
beinherentlynoisy. Weobserveresultssimilartothoseob-
tained on 20 Newsgroups data,but notethat Information
Gain(NB) hereachievesaslightly higher maximum,about
1.5%higherthanthemaximumaccuracyobservedwithDi-
visiveClustering(NB).Toovercomethis,BakerandMcCal-
lum[2]triedacombinationoffeature-clusteringandfeature-
selectionmethods. Morerigorous approaches tothis prob-
lemareatopicoffuturework. FurthernotethatSVMsfare
0
10
20
30
40
50
60
70
80
90
5
10
20
50
100
200
500
1000
5000
10000
% Accuracy
Number of Features
Information Gain (Flat)
Divisive (Flat)
Divisive (Hierarchical)
Figure11: Classiflcation resultson Dmozhierarchy
using Naive Bayes. Observe that the Hierarchical
Classiflerachievessigniflcantimprovementsoverthe
Flatclassiflerswithvery fewnumberof features.
worsethan NBatlow dimensionalitybutbetter at higher
dimensionality,whichisconsistentwithknownSVMbehav-
ior[29]. Infutureworkwewillusenon-linearSVMsatlower
dimensionstoalleviatethisproblem.
Figure9plots theclassiflcation accuracy on Dmozdata
usingNaive Bayes when the trainingset is just2%. Note
again that we achieve a 13% increase in classiflcation ac-
curacy with Divisive Clusteringover the maximum possi-
blewithInformationGain. This reiteratestheobservation
thatfeatureclusteringisanattractiveoptionwhentraining
dataislimited. Figure10comparesDivisiveClusteringwith
Agglomerative ClusteringonDmozdatawhere weobserve
similarimprovementsaswith20Newsgroupsdata.
6.2.3 HierarchicalClassicationonDmozHierarchy
Figure 11 shows classiflcation accuracies obtained by 3
difierentclassiflersonDmozdata(NaiveBayeswastheun-
derlyingclassifler). ByFlat,wemeanaclassiflerbuiltover
theleaf setof classes in thetree. Incontrast,Hierarchical
denotesahierarchicalschemethatbuildsaclassiflerateach
internalnodeof thetopichierarchy(seeSection 4.3). Fur-
ther weapplyDivisive Clusteringateach internalnodeto
reducethenumberoffeaturesintheclassiflcationmodelat
thatnode. Thenumberofwordclustersisthesameateach
internalnode.
Figure11 compares theHierarchical Classifler with two
°at classiflers, onethat employs Information Gain for fea-
tureselectionwhiletheotherusesDivisiveClustering.Note
that Divisive Clusteringperforms remarkably well for Hi-
erarchical Classiflcation even at very low number of fea-
tures. Withjust10features,HierarchicalClassiflerachieves
64.54%accuracy,whichisslightlybetterthanthemaximum
obtainedbythetwo°atclassiflersatanynumberoffeatures.
At50features,HierarchicalClassiflerachieves68.42%,asig-
niflcant6%higherthanthemaximumobtainedbythe°at
classiflers.ThusDivisiveClusteringappearstobeanatural
choiceforfeaturereductionincaseofhierarchicalclassiflca-
tionasitallowsustomaintainhighclassiflcationaccuracies
usingverysmallnumberoffeatures.
7. CONCLUSIONSANDFUTUREWORK
Inthispaper,wehavepresentedaninformation-theoretic
approach to\hard"word clustering for text classiflcation.
First,we derivedaglobalobjectivefunction thatcaptures
thedecreaseinmutualinformationduetoclustering. Then
we presented a divisive algorithm that directly minimizes
thisobjectivefunction,convergingtoalocalminimum. Our
algorithmminimizes thewithin-clusterJensen-Shannondi-
vergence,andsimultaneouslymaximizesthebetween-cluster
Jensen-Shannondivergence.
Finally,weprovidedanempiricalvalidation oftheefiec-
tiveness of our word clustering. We have shown that our
divisiveclusteringalgorithmobtains superiorwordclusters
thantheagglomerativestrategiesproposedpreviously[2,28].
We have presented detailed experiments using the Naive
BayesandSVMclassiflersonthe20NewsgroupsandDmoz
data sets. Our enhanced word clustering results in sig-
niflcantimprovements inclassiflcationaccuraciesespecially
at lower number of features. When the training data is
sparse,featureclusteringachieveshigherclassiflcationaccu-
racythanthemaximumaccuracyachievedbyfeatureselec-
tionmethodssuchasinformationgainandmutualinforma-
tion.Ourdivisiveclusteringmethodisanefiectivetechnique
forreducingthemodelcomplexityofahierarchicalclassifler.
In future work we intend to conduct experiments at a
largerscaleonhierarchicalwebdatatoevaluate the efiec-
tivenessoftheresultinghierarchicalclassifler. Reducingthe
number of features makes it feasible to run computation-
allyexpensiveclassiflerssuchasSVMsonlargecollections.
Whilesoftclusteringincreasesthemodelsize,itisnotclear
how it afiects classiflcation accuracy. In future work, we
wouldliketoexperimentally evaluatethetradeofibetween
softandhardclustering.
Acknowledgements.
We are grateful to Andrew
McCallumandByronDomforhelpfuldiscussions.Forthis
research,ISDwassupportedbyaNSFCAREERGrant(No.
ACI-0093404)whileMallelawassupportedbyaUTAustin
MCDFellowship.
8. REFERENCES
[1] ANSI/IEEE,NewYork.
IEEEStandardforBinaryFloatingPointArithmetic,
Std754-1985edition,1985.
[2] L.D.BakerandA.McCallum.Distributional
clusteringofwordsfortextclassiflcation.InACM
SIGIR,pages96{103,1998.
[3] R.Bekkerman,R.El-Yaniv,Y.Winter,and
N.Tishby.Onfeaturedistributionalclusteringfortext
categorization.InACM SIGIR,pages146{153,2001.
[4] P.BerkhinandJ.D.Becher.Learningsimple
relations:Theoryandapplications.InSecondSIAM
DataMiningConference,pages420{436,2002.
[5] S.Chakrabarti,B.Dom,R.Agrawal,and
P.Raghavan.Usingtaxonomy,discriminants,and
signaturesfornavigatingintextdatabases.In
Proceedingsofthe23rdVLDBConference,1997.
[6] T.M.CoverandJ.A.Thomas.Elementsof
InformationTheory.JohnWiley&Sons,1991.
[7] S.Deerwester,S.T.Dumais,G.W.Furnas,T.K.
Landauer,andR.Harshman.IndexingbyLatent
SemanticAnalysis.Journalof theAmericanSociety
forInformation Science,41(6):391{407,1990.
[8] I.S.DhillonandD.S.Modha.Concept
decompositionsforlargesparsetextdatausing
clustering.MachineLearning,42(1):143{175,2001.
[9] P.DomingosandM.J.Pazzani.Onthetheoptimality
ofthesimpleBayesianclassiflerunderzero-oneloss.
MachineLearning,29(2-3):103{130,1997.
[10] R.O.Duda,P.E.Hart,andD.G.Stork.Pattern
Classiflcation.JohnWiley&Sons,2ndedition,2000.
[11] E.Forgy.Clusteranalysisofmultivariatedata:
E–ciencyvs.interpretabilityofclassiflcations.
Biometrics,21(3):768,1965.
[12] J.H.Friedman.Onbias,variance,0/1-loss,andthe
curse-of-dimensionality.DataMiningandKnowledge
Discovery,1:55{77,1997.
[13] M.R.Garey,D.S.Johnson,andH.S.Witsenhausen.
ThecomplexityofthegeneralizedLloyd-Maxproblem.
IEEETrans.Inform.Theory,28(2):255{256,1982.
[14] D.Goldberg.Whateverycomputerscientistshould
knowabout°oatingpointarithmetic.ACM
ComputingSurveys,23(1),1991.
[15] R.M.GrayandD.L.Neuhofi.Quantization.IEEE
Trans.Inform.Theory,44(6):1{63,1998.
[16] T.Hofmann.Probabilisticlatentsemanticindexing.
InProc.ACMSIGIR.ACMPress,August1999.
[17] T.Joachims.Textcategorizationwithsupportvector
machines: learningwithmanyrelevantfeatures.In
ProceedingsofECML-98,pages137{142,1998.
[18] D.KollerandM.Sahami.Hierarchicallyclassifying
documentsusingveryfewwords.InICML,1997.
[19] S.KullbackandR.A.Leibler.Oninformationand
su–ciency.Ann.Math.Stat.,22:79{86,1951.
[20] J.Lin.DivergencemeasuresbasedontheShannon
entropy.IEEETrans.Inform.Theory,37(1),1991.
[21] A.McCallumandK.Nigam.Acomparisonofevent
modelsfornaivebayestextclassiflcation.InAAAI-98
Workshopon LearningforTextCategorization,1998.
[22] A.K.McCallum.Bow: Atoolkitforstatistical
languagemodeling,textretrieval,classiflcationand
clustering.www.cs.cmu.edu/mccallum/bow,1996.
[23] T.Mitchell.Conditionsfortheequivalenceof
hierarchicalandnon-hierarchicalbayesianclassiflers.
Technicalreport,CALD,CMU,1998.
[24] T.M.Mitchell.MachineLearning.McGraw-Hill,1997.
[25] F.Pereira,N.Tishby,andL.Lee.Distributional
clusteringofEnglishwords.In31stAnnualMeetingof
theACL,pages183{190,1993.
[26] G.SaltonandM.J.McGill.IntroductiontoModern
Retrieval.McGraw-HillBookCompany,1983.
[27] C.E.Shannon.Amathematicaltheoryof
communication.BellSystemTechnicalJ.,27,1948.
[28] N.SlonimandN.Tishby.Thepowerofwordclusters
fortextclassiflcation.In23rdEuropeanColloquiumon
InformationRetrievalResearch(ECIR),2001.
[29] V.Vapnik.TheNatureof StatisticalLearningTheory.
Springer-Verlag,NewYork,1995.
[30] Y.YangandJ.O.Pedersen.Acomparativestudyon
featureselectionintextcategorization.InICML,1997.
Documents you may be interested
Documents you may be interested