c# pdf reader dll : Delete pages pdf preview control application platform web page html wpf web browser soules-sosp050-part250

Connections: Using Context to Enhance File Search
CraigA.N.Soules,Gregory R.Ganger
CarnegieMellonUniversity
ABSTRACT
Connectionsisalesystemsearchtoolthatcombinestradi-
tionalcontent-basedsearchwithcontextinformationgath-
ered fromuser activity. By tracingle systemcalls,Con-
nections can identify temporal relationships between les
and use them to expand and reorder traditional content
searchresults.Doingsoimprovesbothrecall(reducingfalse-
positives)and precision (reducingfalse-negatives). Forex-
ample,Connectionsimprovestheaveragerecall(from13%
to22%) and precision (from23%to29%) on therst ten
results. Whenaveragedacrossallrecalllevels,Connections
improvesprecisionfrom17%to28%. Connectionsprovides
thesebenets withonlymodestincreases in averagequery
time(2seconds),indexingtime(23secondsdaily),andin-
dexsize(under1%oftheuser’sdataset).
CategoriesandSubjectDescriptors
H.3.3[InformationStorageandRetrieval]:Information
Search and Retrieval; D.4.3 [Operating Systems]: File
SystemsManagement|Fileorganization
GeneralTerms
Algorithms,Design,HumanFactors,Management
Keywords
context,lesystemsearch,successormodels
1. INTRODUCTION
Usersneedmoreeectivewaysoforganizingandsearch-
ingtheirdata. Overthelasttenyears,theamountofdata
storageavailabletoindividualusershasincreasedbynearly
twoordersofmagnitude[16],allowingtoday’suserstostore
practicallyunboundedamountsofdata. Thisshiftsthechal-
lenge for individual users from deciding what to keep to
ndingparticularleswhenneeded.
Mostpersonalcomputersystemstodayprovidehierarchi-
cal,directory-basednamingthatallowsuserstoplaceeach
lealongasingle,uniquepath. Althoughusefulonasmall
scale,havingonlyoneclassicationforeachleisunwieldy
forlargedatasets. Whentraversinglargehierarchies,users
maynotremembertheexactlocationofeachle. Or,users
maythinkofaleinadierentmannerthanwhentheyled
it,sendingthemdownanincorrectpathinthehierarchy.
Permissiontomakedigitalorhard copiesofallorpartofthisworkfor
personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesare
notmadeordistributedforprotorcommercialadvantageandthatcopies
bearthisnoticeandthefullcitationontherstpage.Tocopyotherwise,to
republish,topostonserversortoredistributetolists,requirespriorspecic
permissionand/orafee.
SOSP’05,October2326,2005,Brighton,UnitedKingdom.
Copyright2005ACM1›59593›079›5/05/0010...
$
5.00.
Attribute-based naming allows users toclassify each le
withmultipleattributes[9,12,37]. Onceinplace,theseat-
tributesprovideadditionalpathstoeachle,helpingusers
locatetheirles. However,itisunrealisticandinappropri-
atetorequireuserstoproactivelyprovideaccurateanduse-
fulclassications. Tomakethesesystemsviable,theymust
automatically classify the user’s les, and,infact, this re-
quirementhasledmostsystemstoemploysearchtoolsover
hierarchicallesystemsratherthanchangetheirunderlying
methodsoforganization.
Themostprevalent automated classicationmethod to-
dayis content analysis: examiningthecontents andpath-
names of les todetermineattributes that describethem.
Systemsusingattribute-basednaming,such astheSeman-
ticlesystem[9],usecontentanalysistoautomateattribute
assignment. Search tools,suchasGoogleDesktop[11],use
contentanalysistomapuserqueriestoarankedlistofles.
Althoughclearlyuseful,therearetwolimitationstocon-
tentanalysis. First,onlyleswithunderstandablecontents
canbeanalyzed(e.g.,itisdiculttoidentifyattributesfor
movieclipsor musicles). Second,examiningonlyale’s
contents overlooks a key way that users think about and
organizetheirdata: context.
Contextis\theinterrelatedconditionsinwhichsomething
exists or occurs"[43]. Examplesof ale’s context include
other concurrently accessed les, the user’s current task,
eventheuser’sphysicallocation|anyactionsordatathat
theuser associateswith thele’s use. Arecentstudy[38]
showedthatmostusersorganizeandsearchtheirdatausing
context. For example, auser maygroup les related toa
particular task into asingle directory, or search for ale
byrememberingwhatotherlestheywereaccessingatthe
time. Thesecontextualrelationshipsmay beimpossibleto
gatherfromale’scontents.
Thefocusofourworkistoincreasetheutilityoflesys-
temsearchusingcontext. Inthispaper,wespecicallyex-
aminetemporallocality,oneoftheclearestformsofcontext
andonethathasbeensuccessfullyexploitedinotherareasof
lesystems. Temporallocalitycapturesale’ssetting,con-
nectingles through theactions that make upusertasks.
Connections is anew search tool that identies temporal
contextualrelationships between les atthetimethey are
being accessed using traces of le systemactivity. When
auser performsasearch,Connectionsrstlocateslesus-
ingtraditionalcontent-basedsearchandthenextendsthese
resultswithcontextuallyrelatedles.
UserstudieswithConnectionsshow thatcombiningcon-
tentanalysiswithcontextanalysisimprovesbothrecall(in-
creasingthenumberofrelevanthits)andprecision (return-
ingfewerfalsepositives)overcontentanalysisalone. When
compared to Indri, a state-of-the-art content-only search
tool [25], Connections increases average precision at each
recalllevel,increasingtheoverallaveragefrom17%to28%.
When consideringjust the top 30results, Connections in-
Delete pages pdf preview - remove PDF pages in C#.net, ASP.NET, MVC, Ajax, WinForms, WPF
Provides Users with Mature Document Manipulating Function for Deleting PDF Pages
delete pdf pages in reader; add and delete pages from pdf
Delete pages pdf preview - VB.NET PDF Page Delete Library: remove PDF pages in vb.net, ASP.NET, MVC, Ajax, WinForms, WPF
Visual Basic Sample Codes to Delete PDF Document Page in .NET
delete pages on pdf online; add and remove pages from a pdf
creases average recall from18%to34%, and average pre-
cision from17% to 23%. With nocuto, Connections in-
creasesaveragerecallfrom34%to74%and averagepreci-
sionfrom15%to16%.
Theremainderofthispaperisorganizedasfollows. Sec-
tion2describesrelatedwork. Section3outlinesthedesign
andimplementationofConnections.Section4analyzesthe
utilityofConnections’scontext-basedsearch. Section5dis-
cussesinterestingconsiderationsforbuildingcontext-based
searchsystemslikeConnections.
2. BACKGROUNDANDRELATEDWORK
Contextis oneofthekeywaysthatusersremember and
locate data[38]. For example, a user may not remember
wheretheystoredadownloadedemailattachment,butmay
rememberseveralcontextuallyrelateditems: thesenderof
theemail,approximatelywhentheemailarrived,wherethey
werewhentheyreadtheemail,orwhattheyweredoingat
the time theydownloadedthe attachment. Automatically
identifyingthele’scontextcanassisttheuserinlocatingit
laterusinganyofthisrelatedcontextualinformation,rather
thanonlythelenameorcontentsoftheattachment.
Thissection beginsbydescribingbackground workfrom
lesystemsandwebsearchthatmotivatedourexploration
ofcontext-basedsearch.Itthendescribesworkthatsuccess-
fullyutilizesConnection’sspecicformofcontext,temporal
locality,inotherdomains. Thesectionendswithadescrip-
tionofsearchandorganizationalsystemsthatutilizeother
formsofcontext,andhowourworkonidentifyingcontextual
relationshipswithtemporallocalitycouldenhancethem.
2.1 Semanticlesystems
Manyresearchershaveidentiedorganizationalproblems
withstricthierarchicalnaming. Severalproposedattribute-
based naming as asolution: attachingmultiple attributes
(orkeywords) toaletoimproveorganizational structure
and search capabilities. The Semantic le system[9] was
oneofthersttoexploreattribute-basednaming,providing
asearchable mappingof hcategory,valueipairings toles.
Otherresearch[12,37]andcommercial[8]systemsmergehi-
erarchicalnamespaceswith atattribute-basednamespaces
invariousways,providingneworganizationalstructures.
Althoughattribute-basednaminghasthepotentialtoim-
provelesystemsearch,thesesystemsfocusedonthemech-
anismtostoreadditionalattributes, not thesources ofat-
tributes. Traditionally, attributes come fromtwo sources:
usersandcontentanalysis. Understandably,mostusersare
unwillingtoperformthedicult and time-consumingtask
ofmanuallyassigningattributes totheirles. As aresult,
rather than changing the underlyingle systemstructure,
recent focus has beenon usingcontentanalysis toprovide
lesystemsearchonexistinghierarchicalsystems.
2.2 Content›basedsearch
Thesimplestcontent-basedsearchtools(e.g.,UNIXtools
ndandgrep)scan thecontentsofasetoflesforagiven
term(or terms), returningall hits. Tospeedthis process,
toolssuchas locate andGlimpse[29]useindicestoreduce
the amountof dataaccessed duringasearch. These tools
workonthepremisethattheprovidedkeywordscannarrow
theresultinglistoflestoamorehuman-searchablesize.
CommercialsystemssuchasGoogleDesktop[11]andX1
desktopsearch[44]leveragetheworkdoneintext-basedin-
formationretrievaltoprovidemoreaccurate,rankedsearch
results. Although the exact methods of commercial sys-
tems areunpublished,it is likelytheyuse techniquessim-
ilar to cutting-edge information retrieval systems such as
Terrier [40] and Indri[25]. Thesemoderntoolsuseproba-
bilistic models tomapdocuments to terms [41], providing
ranked results that includeboth fulland partial matches.
Probabilitiesaregeneratedusingmethodssuchastermfre-
quencywithinadocument,inversetermfrequencyacrossa
collection,andothermorecomplexlanguagemodels[31].
2.3 Contextinwebsearch
Despite these tools, one could argue that it is easier to
nd things on theweb than in one’sownlespace|even
with the order of magnitude larger search space and the
non-personalizedorganization. Thisseemsstrange,butit’s
largelyenabledbytheavailabilityof user-providedcontext
attributesintheformofhyperlinks.
Just as in current le system search, early web search-
engines [26, 30, 45] relied on user input (user submitted
webpageclassications)andcontentanalysis(wordcounts,
word proximity,etc.). More recently, twotechniques have
emergedthat usetheinherent linkstructureofthewebto
identifycontextuallinks. TheHITSalgorithm[20]denesa
sub-graphofthewebusingcontentsearchresults,andthen
usesthelinkstructureofthegraphtoidentifytheauthority
andhubnodes. ThepopularGooglesearchengine[10]uses
the link structure in twoprimary ways. First,it uses the
textassociatedwithahyperlinktoguidecontentclassica-
tionsfor the linked site. Second, it usesPageRank [5],an
algorithmthatusesthelinkstructureofthewebtocalculate
the\importance"ofindividualsites.
Although successful for the web, these techniques face
challenges in personallesystems becausetagged,contex-
tuallinksdonotinherentlyexistinthelesystem.Ourwork
aims toadd untagged contextuallinks between les using
temporallocality. Anevaluationofcombiningourapproach
withbothHITSandPageRankisdiscussedinSection4.3.3.
Another context-based approach seen in web search is
personalization, using the user’s current context to target
search results. WebGlimpse [28] took a rst step toward
personalizedsearchusingtheconceptofa\neighborhood,"
theset ofwebpageswithin acertain hyperlinkdistanceof
agivenpage. Userscouldchoosetosearchonlywithinthe
current page’s neighborhood, creating adirected search of
potentiallyrelatedpagesbasedontheuser’scurrentcontext.
Morerecentworkhasfocusedontargetingresultstopartic-
ulartopicsorinterests,sometimesgatheredfromauser’sre-
centactivity[15,39].Thesesystemsremoveresultsthatdo
notrelatetotheuser’scurrentcontext,improvingprecision.
Conversely,Connectionsextendsresultsusingcontextinfor-
mation gathered fromprevious activity. We believe these
techniquescomplementeachotherwell: removingunrelated
content-based results preventsConnections fromgathering
lesfromanunrelatedcontext.
2.4 Identifyingcontext: temporallocality
Inlesystems,temporallocalitycanprovidesomeofthe
contextualcluesthataresoreadilyavailableontheweb. By
observinghowusersaccesstheirles,thesystemcandeter-
minecontextualrelationshipsbetweenles. Our workuses
these contextualrelationships toenhanceexistingcontent-
basedsearchtools.
How to C#: Preview Document Content Using XDoc.Word
How to C#: Preview Document Content Using XDoc.Word. Get Preview From File. You may get document preview image from an existing Word file in C#.net.
delete pages out of a pdf; delete pages pdf document
How to C#: Preview Document Content Using XDoc.PowerPoint
How to C#: Preview Document Content Using XDoc.PowerPoint. Get Preview From File. You may get document preview image from an existing PowerPoint file in C#.net.
delete pages from pdf in reader; delete page from pdf document
Toidentifytemporalrelationships,weborrowfromother
workthatusestemporallocalitytomodelhowusersaccess
their data. One example ofthis is usingtemporallocality
topredictaccesspatternsforleprefetching.
Almostalllesystemsuseprefetchingtohidestoragela-
tenciesbypredictingwhattheuserwillaccessnextandread-
ingitintothecachebeforetheyrequestit.Someprefetching
schemes[13,21]usetemporallocalitytocorrelatecommon
user access patterns with individualuser contexts. These
systems keep ahistory ofle access patterns usingsucces-
sor models: directed graphs that predict the next access
based on the most recent accesses. If a sequence of ac-
cessesmatchesoneofthestoredsuccessormodels,thesys-
temassumesthattheuser’scontextmatchesthismodel,and
prefetchesthespecieddata.
Thesuccessoftheseschemeshasledtoavarietyofalgo-
rithmsfor buildingsuccessor models [2, 24,27]. Similarly,
manycachehoardingschemes usesuccessormodels topre-
dictwhichlesauserislikelytoneediftheybecomediscon-
nectedfromthenetwork[19,22]. Connectionsusessuccessor
modelstoidentifyrelationshipsbetweenles,astheyhave
successfullyidentiedrelatedlesinotherdomains.
2.5 Existingusesofcontextforlesearch
Severalsystemsleverageotherformsofcontextasaguide
for le organization and search. Gathering context from
temporallocality,asConnectionsdoes,couldenhancethese
systems byprovidingadditionalcontextualcluesforclassi-
cation.
TheHaystack[34]andMyLifeBits[7]projectsusecontext-
baseddataorganizationatthecoreoftheirinterfacedesign,
allowinguserstogroupandassignclassicationstoobjects
morequickly. Ifthese interfaces motivateusers toprovide
additionalclassications,theywillresultinimprovedsearch
facilities. Our workcould assistusers of such asystemby
addingautomatedgroupingsbasedontemporallocality.
Afew systems attempt todetermine the user’s current
context to predict and prefetch potentially desired data.
TheLumiereprojectprovidesuserswithhelpdataintheMi-
crosoftOceapplication suite; attemptingtopredictuser
problems by predicting their current context from recent
actions[17].TheRememberanceAgent[35,36]continually
provides alist ofrelated les (basedon contentsimilarity)
totheuserwhiletheyareworking. Byfeedingrecentlyac-
cessedledataintoacontent-basedsearchsystem,itlocates
leswithsimilarcontentsthatmayre ecttheuser’scurrent
context. Our work couldenhancesuchsystemsbyprovid-
ingadditionalcontextuallyrelated les | those based on
temporallocalityratherthancontentsimilarity.
Mostcontent-basedsearchtoolsorganizetheirsearchre-
sults, allowing the user tohone in on the set of les that
is most likelytocontain what they aresearchingfor. For
example,theLifestreamsproject[6]orderssearchresultsus-
ingthelatestaccesstimeoftheresultingles. TheGrokker
search tool [14] clusters search results, grouping together
leswithsimilarcontents. Ourworkcouldbeusedtoclus-
terresultsusingcontextualrelationshipsratherthan,orin
additionto,content-similarityoraccesstime.
3. CONNECTIONS
Connections combines traditional content analysis with
contextualrelationshipsidentiedfromtemporallocalityof
leaccesses.Thissectiondescribesitsarchitecture,relation-
Applications
Tracer
File system
User
Relation
Graph
Content−based
Search
Results
Results
Context−enhanced Search
Keywords
Figure 1: Architecture of Connections. Both applica-
tionsand the le system remain unchanged, as the
only information required by Connections can be
gathered either by atransparenttracingmoduleor
directlyfromexistingle system interfaces.
shiptrackingandresultrankingalgorithms,andourproto-
typeimplementation.
3.1 Architecture
Intraditional content-onlysearchsystems,theusersub-
mitskeywordstothesearchtool,whichreturnsrank-ordered
resultsdirectlytotheuser.Often,thesearchtoolisseparate
fromthelesystem,usingabackgroundprocesstoreadand
indexledata.
Figure1illustratesthearchitectureofConnections. From
auser’sperspective,Connections’scontext-enhancedsearch
isidenticaltoexistingcontent-only search: atoolseparate
fromthe lesystemthat takes in keywords andreturns a
rankedlistofresults. Internally,whenConnectionsreceives
keywords from the user, it begins with a content search,
retrieving the same results as a content-only search tool.
Itfeedstheseresultsintotherelation-graph,whichlocates
additionalhitsthrough contextualrelationships. Thecom-
binedresultsarethenrankedandpassedbacktotheuser.
Toidentifyandstoretheserelationships,Connectionsadds
two new components: the tracer and the relation-graph.
The tracer sits between applications and the le system,
monitoringall lesystemactivity. Connections uses these
tracestoidentifycontextualrelationshipsbetweenles.
Therelation-graphstoresthecontextualrelationshipsbe-
tween les. Each lein thesystemmaps toanodein the
graph. Edgesbetweennodesrepresentcontextualrelation-
ships betweenles, withtheweight of theedge indicating
the strength of the relationship. Because dierent users
mayhavedierentcontextsforaparticularle,Connections
maintainsaseparaterelation-graphfor each userbasedon
theirleaccessesalone. Thisalsoseparatesusertasksfrom
backgroundsystemactivityinsingle-usersystems.
Threealgorithmsdrivethecontext-basedportionsofCon-
nections. Therst algorithmtakesthecapturedletraces
and identies the contextual relationships, creating the
relation-graph. The second algorithm takes content-based
search results and locates contextually related les within
the relation-graph, creating a smaller result-graph. The
thirdalgorithmusestheresult-graphtorankthecombined
setofcontent-basedandcontext-relatedresults.
VB.NET PDF File Compress Library: Compress reduce PDF size in vb.
a preview component enables compressing and decompressing in preview in ASP images size reducing can help to reduce PDF file size Delete unimportant contents:
copy pages from pdf to another pdf; delete pages from pdf preview
C# WinForms Viewer: Load, View, Convert, Annotate and Edit PDF
Erase PDF images. • Erase PDF pages. Miscellaneous. • Select PDF text on viewer. • Search PDF text in preview. • View PDF outlines. Related Resources.
delete page pdf file; delete page numbers in pdf
SystemCall
Description
open(S)
Opens leSreadingorwriting
read(S)
Readsdatafrom leS
write(D)
Writes datatoleD
mmap(S)
MapsleS intoamemoryregion
stat(S)
ReadstheinodeofleS
dup(S, D)
DuplicateslehandleStoD
link(S, D)
AddsdirectoryentryD forleS
rename(S, D)
ChangesthenameofleS toD
Table 1: File system calls. This table lists the le
system calls considered byConnections wheniden-
tifyingrelationshipsfromtraces. Eachparameteris
identied aseither asource(S) ordestination(D).
Each of thealgorithms in Connections is specicallyde-
signed for  exibility, and as suchhave several tunablepa-
rameters. This  exibilityallowsus tostudyarangeofop-
tionsforeachalgorithm. Anevaluationofsensitivitywithin
eachalgorithmisprovidedinSection4.3.
3.2 Identifyingrelationships
Connectionsidentiestemporalrelationshipsbyconstruct-
ingasuccessormodelfromletraces. Filesaccessedwithin
agivenwindowoftimeareconnectedintherelation-graph.
Overtime,auser’saccesspatternsformprobabilisticmap-
pingsbetweenlesthatarethebasis ofConnections’scon-
textualrelationships. Thespecicalgorithmforgenerating
therelation-graphisdescribedbythreeparameters:relation
window,edgestyle,andoperationlter.
Relationwindow:Therelationwindowmaintainsalist
ofinputlesaccessed withinthelastN seconds.
1
Concep-
tually,thiscapturestheperiodoftimeduringwhichauser
isfocusedonaparticulartask. Tooshortawindowwillmiss
keyrelationships,whiletoolargeawindowwillconnectles
fromunrelatedtasks.
Whenthewindow seesan output le,itcreates anedge
in therelation-graph withweight 1fromeachof the input
les to the output le. If such an edgealready exists, its
weightisincremented. Toavoidcreatingheavyweightings
during longsequences of output operationstoasingle le
(e.g.,largelewrites),anytwolesareonlyconnectedonce
whiletheinputlestaysintherelationwindow.
Edge style: The edge style species whether edges of
therelation-grapharedirectedorundirected. Directionin-
dicateshowtheedgesoftherelation-graphmaybefollowed
during a search. Directed edges point from input les to
outputles,whileundirectededges maybefollowed either
direction. Conceptually,undirectedlinksreversethecausal
natureoftherelation-graph,allowingsearchestolocatethe
inputofagivenle.
Operationlter: Eachentryinthetracecorrespondsto
alesystemoperation. An operationlterspecieswhich
system calls to consider from the trace and classies the
sourceand/ordestinationlesaccessedbyeachsystemcall
(showninTable1)asaninputoranoutput.
Inthispaper,weconsiderthreedierentoperationlters:
1
We also considered using a xed number of les as the
window,butinpracticewefoundthattheburstinessofle
accesses made this approachperform poorly. Related les
duringbursts werenotconnected,and unrelated leswere
connectedoverlongperiodsofidleness.
open,read/write,and all-ops. Theopenlterclassies the
sourceleofanopencallasbothinputandoutput.Concep-
tually, this captures stricttemporallocality: lesaccessed
nearbyintimebecomerelated.
Theread/writelterclassiesthesourceleofareadcall
asinputand thedestinationleofawritecallasoutput.
Conceptually, this captures causal data relationships: the
datareadfromonelemayaectthedatalaterwrittento
anotherle,relatingtheles.
Theall-opslterclassiesthesourceleofammap,read,
stat,dup,link,or renameasinput. Itclassiesthedesti-
nationleofawrite,dup,link,orrenameasoutput. This
lterextends on the causalrelationships oftheread/write
lter,addinginotheraccess-to-modicationrelationships.
3.3 Searchingrelationships
Thecontext-basedportionofasearchinConnectionsstarts
with theresultsof acontent-based search. Foreachlein
these results, Connections performs a breadth-rst graph
traversal startingat the node for that le. Files touched
duringthe traversal are added totheresult-graph, asub-
graphwiththeresultsforaspecicsearch.
In the process of building the relation-graph, incorrect
edgescanform.Forexample,whenauserswitchescontext
between disparatetasks (e.g., fromwritingpersonalemail
toexamining aspreadsheet), the edges formed duringthe
transitioncouldbemisleading. Ouralgorithmattemptsto
reducethenumberofsuch paths introduced tothe result-
graphusingtwotunableparameters: pathlengthandweight
cuto.
Path length: Path length is themaximum number of
steps taken from any startingnode in the graph. As the
systemfollowsedgesfurtherandfurtherfromaninitialle,
the strength of therelationship grows weaker and weaker.
Bylimitingthepathlength,onereducesthenumberoffalse
positivescreatedbyalongchainofedgesthatleadstoun-
relatedles.
Weightcuto:Theweightcutospeciesthatanedge’s
weightmustmakeupatleastagivenminimumpercentage
ofeither the source’soutgoingweightor thesink’s incom-
ingweight. Inthis manner, lightlyweightededges coming
from or to les with few total accesses are still followed,
but only themost heavilyweightededges are followed for
frequentlyaccessed les. Thislimits theeectsofcontext-
switches,removinglinksbetweenoft-accessed les that are
rarelyaccessedtogether.
To see how the weight cutoand the path length work
together, consider theexample relation-graph in Figure 2.
AssumethatDwasthestartingpoint forasearchusinga
pathlengthof2anda30%weightcuto.Connectionsstarts
byexaminingDand seesthattheedgeDBmakesuponly
20%ofD’soutgoingweightand20%ofB’sincomingweight,
thus it is not followed. Following edge DE, Connections
repeatstheprocedure. Inthiscase,althoughbothEBand
EF make up less than 30%of E’s outgoingweight, they
makeupmorethan30%oftheincomingweightofnodesB
andF respectively,andbotharefollowed. Thustheresult-
graph wouldcontain onlyedges DE,EF,EGand EB. If
thepathlengthwereincreasedto3,theresult-graphwould
alsocontainedgesBAandBC. Similarly,ifthecutowere
reduced to 15%, edge DB would be followed, the result-
graphwouldcontainallofthepresentededges.
C# PDF Page Insert Library: insert pages into PDF file in C#.net
document files by C# code, how to rotate PDF document page, how to delete PDF page using C# .NET, how to reorganize PDF document pages and how
delete pages from a pdf reader; delete pdf pages acrobat
How to C#: Preview Document Content Using XDoc.excel
How to C#: Preview Document Content Using XDoc.Excel. Get Preview From File. You may get document preview image from an existing Excel file in C#.net.
delete pages on pdf; delete page from pdf file
A
C
B
D
E
F
G
8
2
6
5
100
8
2
Figure2: Relation-graphexample. Eachofthenodesin
the relation-graph map toa le in a user’s system.
Edges indicaterelated les with weights specifying
thestrengthoftherelationship. Notethattheedge
weights in the gureare specically chosen for the
algorithmbehaviorexampleinthetext.
3.4 Rankingresults
Most modern search tools rank order results to provide
theirbestguessesrst. Connectionsimplementsthreerank-
ingalgorithms:Basic-BFS,analgorithmthatpushesweights
down edges in abreadth-rstmanner,and twoextensions
toBasic-BFS based on thepopularweb-search algorithms
HITS[20]andPageRank[5].
3.4.1 Basic›BFS
Basic-BFS uses the rankings provided bycontent-search
toguidetherankings of contextuallyrelated items. Close
relations, and relations with multiple paths tothem, will
receivemoreweightthandistantrelationswithfewincoming
paths. Intuitively,thisshouldmatchtotheuser’sactivity:
if ale is rarely used in associationwith content-matched
les,itwillreceivealowrank,andvice-versa.
LetN bethesetof allnodesin theresult-graph,and P
bethepathlengthusedtogeneratetheresult-graph. Each
n2N isassignedaweightw
n
0
bythecontent-basedsearch
scheme. If a le is not ranked by content analysis, then
w
n
0
=0. Connectionsthenrunsthefollowingalgorithmfor
P iterations.
LetE
m
betheset ofallincomingedges tonodem. Let
e
nm
2E
m
bethepercentageoftheoutgoingedgeweightat
nforagivenedgefromntom. Assumingthatthisisthe
i
th
iterationofthealgorithm,thenlet:
w
m
i
=
X
e
nm
2E
m
w
n
(i 1)
[e
nm
+(1 )]
Thevaluew
m
i
representsalloftheweightpushedtonode
m duringiteration iof thealgorithm, and dictates how
much to trust the specic weightingof an edge. After all
runsofthealgorithm,thetotalweightofeachnodeisthen:
w
n
=
XP
i=0
w
n
i
Thissum,w
n
,representsthecontributionsofallcontex-
tualrelationshippathstonodenplusthecontributionofits
originalcontent ranking. Thenal rankingofresults sorts
eachlefromhighestweighttolowest.
Asan exampleofhowthealgorithmworks,assumethat
the graph in Figure2 is theresult-graph, the path length
is2,=0:25,andthecontent-basedsearchreturnsw
D
0
=
4 and w
B
0
= 2. Consider w
B
. On the rst pass of the
algorithm,w
B
1
is updatedbased onw
D
0
and w
E
0
. Inthis
case,w
E
0
=0,soonlyw
D
0
aectsitsvalue:
w
B
1
=4[(2=10)0:25+0:75]=3:2
On the second pass,w
E
1
=3:8(usingtheformulafrom
above)andw
D
1
=0,thus:
w
B
2
=3:8[(8=113)0:25+0:75]=2:92
The nal weight w
B
, is then the sum of these weights,
2+3:2+2:92=8:12.
3.4.2 HITS
TheHITSalgorithmattemptstolocateauthorityandhub
nodeswithinagraph,givenaspecicsetofstartingnodes.
Authoritynodes are those with incominglinks frommany
hubnodes,whilehubnodesarethosewithoutgoinglinksto
manyauthoritynodes. Intheweb,authoritiesareanalogous
topages linkedmanytimesforaparticular topic(e.g.,the
ocialSOSPweb site),whilehubs areanalogous topages
with listsof links toauthorities (e.g.,apage withlinks to
allACMconferencewebsites).
HITS identies authorities and hubs using three steps.
First,itrunsacontent-basedsearchtolocateaninitialsetof
nodes. Second,itcreatesasub-graph oftherelation-graph
bylocatingallnodeswithincoming/outgoinglinksfrom/to
thestartingnodes. Third,itrunsarecursive algorithmto
locatethe\principaleigenvectorsofapairofmatricesM
auth
andM
hub
derivedfromthelinkstructure"[20].Theseeigen-
vectorsindicatetheauthorityandhubprobabilitiesforeach
node.
ConnectionsimplementsHITSintwoways. Therstim-
plementation,HITS-original,runsanunmodiedversionof
HITS.Thesecondimplementation,HITS-new,beginswith
theresult-graphderived inSection 3.3,andthen runsonly
the third part of the HITS algorithm. Our evaluation in
Section4.3.3examinesboththehubandauthorityrankings
ofeachimplementation.
3.4.3 PageRank
PageRank is the rankingalgorithmused by the Google
websearchengine[5]. Ittakesthegraphofhyperlinksinthe
webandcalculatestheprincipaleigenvectorofastochastic
transition matrix describingthis graph. This eigenvector
describestheprobabilitiesforreachingaparticularnodeon
arandomwalkofthegraph. Thisprobabilityisreferredto
asapage’sPageRank.
Within Connections, we use the Power Method [18] to
calculate the PageRank of each le in the relation-graph.
Unfortunately,Google’smethod of mergingcontent search
resultswithPageRank is not documented,thuswe imple-
mentedthreepossibleusesforale’sPageRankwithinCon-
nections.
Therstimplementation,PR-before,appliesale’sPage-
Ranktoitscontent-basedranking(i.e.,taketheproductof
theoriginalrankingandthePageRankasthenewranking),
and then runs the Basic-BFS algorithm. The second im-
plementation,PR-after,runstheBasic-BFSalgorithm,and
then applies the PageRank tothe nalresults. Thethird
VB.NET PDF delete text library: delete, remove text from PDF file
Visual Studio .NET application. Delete text from PDF file in preview without adobe PDF reader component installed. Able to pull text
delete page from pdf acrobat; add and delete pages in pdf online
C# Word - Delete Word Document Page in C#.NET
doc.Save(outPutFilePath); Delete Consecutive Pages from Word in C#. int[] detelePageindexes = new int[] { 1, 3, 5, 7, 9 }; // Delete pages.
delete pages pdf files; delete pages of pdf online
implementation,PR-only,ignoresthecontentrankings,and
usesonlyPageRanktoranktheleswithintheresult-graph.
3.5 Implementation
Our prototype implementation of Connections has the
threecomponents shown in Figure 1: atracer,a content-
basedsearch,andarelation-graph. Tominimizeforeground
impact, only the tracer runs on the system continuously,
while indexing required by content-based search and the
relation-graphrunseitherduringidletimeorasbackground
processes. Thedelayin indexingonlyaects users if they
searchforlescreatedbetweenindexingperiods,ascenario
thatalreadyexistswithtoday’scontent-onlysearchtools.
Thetracingcomponentsitsatthesystemcalllayerinthe
kernelandwatchesuseractivity,tracingalllesystemand
processmanagementcalls. Processmanagementcallsallow
properreconstructionofledescriptoractivity. Thetracing
component is operating system specic, and Connections
currentlyrunsexclusivelyunder Linux2.4kernels. Similar
system call tracing infrastructure exists in other systems
(e.g.,WindowsXP),andportingConnectionsshouldnotbe
dicult.Theperformanceimpactofthetracingcomponent
isminimal,aswithotherlesystemtracingtools[3].
The content-based search component uses Indri [25], a
state-of-the-art content analysis tool. We chose Indri be-
causeofits consistentlyhigh performance(and thatofits
predecessors) in several tracks of theText REtrievalCon-
ference (TREC)over the last fewyears[1,23,32]. TREC
is anannual, competitive rankingof content-onlyinforma-
tionretrievalsystemswithdierenttracksusingdistinctcor-
poraof dataandqueriesgearedtowardparticularretrieval
tasks[42].
Connectionscreatestherelation-graphusingthealgorithm
described in Section 3.2 and stores it using BerkeleyDB
4.2[33]. Connectionssearchestherelation-graph usingthe
algorithmdescribedin Section3.3and rankstheresults of
thesearchusingthealgorithmsdescribedinSection3.4.
UsersspecifyqueriesinConnectionsasasetofkeywords
and(optionally)oneormoreletypes. Ifletypesarespec-
ied,nalqueryresults areltered toremoveothertypes.
Forexample,ausersearchingforacopyofthispapermight
inputthekeywordscontent,contextwiththetypes.ps,.pdf.
Connectionswouldperformitssearchusingcontentandcon-
text, and then lter thenal results showingonly.ps and
.pdfles.
4. EVALUATION
Our evaluation of Connections has three parts. First,
we evaluate the utility of Connections’s context-enhanced
search, comparingits precision and recall against Indri, a
state-of-the-artcontent-onlysearchtool;as hoped,thead-
ditionofcontextmakesthesearchtoolmoreeective. Sec-
ond,we evaluate the sensitivityofthe various parameters
withinConnections,showingthat,whilethesettingsofpa-
rameters aect search quality, using\reasonable"settings
that areclose tooptimal is sucient tosee benets from
context. Third, we evaluate the performance of indexing
and queryingin Connections,ndingthat both space and
timeoverheadsfromaddingcontextanalysisareminimal.
4.1 Experimentalapproach
Our evaluation compares Indri’s (version 3.1.1) content-
onlysearch[25]toConnections’s context-enhanced search.
Tocomparetheutilityofthesetwosystems,weborrowand
adapt techniquesfrominformation retrieval[4]. Tradition-
ally,content-onlysearchtoolsareevaluatedusinglargepub-
liccorporaof data,such asarchivedlibrarydataor collec-
tionsofpubliclyaccessiblewebsites. Queriesaregenerated
by experts and evaluated byindividuals familiar with the
material. These\oracle"results arethen comparedtothe
resultsgeneratedbythesystemunderevaluation.
Unfortunately, two subtle dierences make le system
search, and especially context-enhanced search, more dif-
culttoevaluate.First,thenatureofthequeries(searching
forolddata)demandthattracesexistoveralongperiodof
time;Connectionscannotprovidecontext-enhancedresults
ifithasnotracedataforthedesireddata. Second,because
thedatais personal,onlyits owner can createmeaningful
queries and act as oracle for evaluating query results, es-
pecially when queries must be formed with the period of
tracinginmind. Doingotherwisewouldrender theexperi-
mentuseless,sincetracingwouldbepresentoverthelifetime
ofaproductionsystem.
4.1.1 Gatheringdata
Togather contextdata, wetraced thedesktop comput-
ers of six computer science researchers for a period of six
months.Usingthetraces,wegeneratedarelation-graphfor
eachuserusingthefollowingdefault parameters: a30sec-
ondrelationwindow,adirectededgestyle,andaread/write
operationlter.
2
To gather content-based search results (both for the
content-onlysystemandConnections’sinternaluse),weran
Indri over the set of all parsable document types on the
users’computers(anylesappearingtocontaintext,aswell
asPDFandPostscriptles).
Eachusersubmitted 3-5queries. Table2liststhreesub-
mitted queries as representative examples;they cannot all
belistedforbothspaceandprivacyreasons. Weranqueries
in Indri (both alone and internally to Connections) using
the\#combine()"operator. WeranConnections’srelation-
graph search algorithm using the default parameters of a
pathlengthof 3andaweightcutoof0.1%,and usedthe
Basic-BFSrankingalgorithmwithanparameterof0.75.
4.1.2 Evaluation
Recallandprecisionmeasuretheeectivenessofthesearch
systemin matchingthe\oracle"results. A system’srecall
isthenumberof relevant documents retrievedovertheto-
talnumberspeciedbytheoracle. Asystem’sprecision is
thenumber of relevant documentsretrieved over the total
numberofdocumentsretrieved.
Unfortunately,onlytheuserofthesystemknowsthedata
wellenough toact as oracle for its queries, and our users
were notwillingtoexamine everyleintheir systems for
eachquery. Toaccountforthis,weuseatechniqueknown
aspooling[4]that combinestheresultsfroma number of
dierentsearchtechniques,generatingasetof resultswith
good coverageofrelevant les. In ourcase,wepooledsev-
eralcontext-enhanced searches usingboth broader param-
eter settings andthedefaultsettings beingevaluated,and
presentedthemtousers.Usersthenchosetherelevantdoc-
umentsfromthispooledsetoflestocreatetheoracle.
2
We chose these settings after performing the sensitivity
analysisdescribedinSection4.3.
C# PDF delete text Library: delete, remove text from PDF file in
Delete text from PDF file in preview without adobe PDF reader component installed in ASP.NET. C#.NET PDF: Delete Text from Consecutive PDF Pages.
delete pages of pdf reader; delete blank pages in pdf
C# PowerPoint - Delete PowerPoint Document Page in C#.NET
doc.Save(outPutFilePath); Delete Consecutive Pages from PowerPoint in C#. int[] detelePageindexes = new int[] { 1, 3, 5, 7, 9 }; // Delete pages.
delete pdf pages online; add remove pages from pdf
QueryNum
Query
FileTypes
Description
1
osdi,background
.ps,.pdf
Papersthatmadeuptherelatedwork
ofaparticular papersubmission
3
content,context,gure
.eps
Figuresrelatingtocontent-based
orcontext-basedsearch
14
mozilla,obzerver,log
N/A
Mozillawebbrowsinglogsgenerated
bytheobzerver tracingtool
Table 2: Selected searchqueries. This table shows three specic user-submitted queries. Each query’s search
terms and letypearelisted, alongwithan English description ofthe search submitted bytheuser.
Wecomparetherecallandprecision ofdierentsystems
usingtwotechniques. Thersttechniqueistoexaminethe
recall/precisioncurveofeachsystem. Thiscurveplotsthe
precision of thetwosystems at each of 11standard recall
levels(0%-100%in 10%increments) [4]. Examiningthis
curveshowshowwellasystemrankstheresultsthatitgen-
erates. At each recall level n, the curve plots the highest
precisionseenbetweennandn+1. Tocalculatetheaver-
agerecall/precisionvaluesoverasetofqueries,theprecision
ofeachqueryat agiven recallleveliscalculated,andthen
averaged.
The second technique is to examine the recall and pre-
cision of each systemwith xednumbers of results. Most
search systems present onlya few results tothe user at a
time(e.g.,apagewiththerst10results),requiringprompt-
ingfromtheuserformoreresults.Forexample,resultcut-
os of 10, 20, and 30 maymap to1, 2, or 3pages of re-
sults,afterwhichmanyusersmaygiveuportryadierent
query. Examiningtherecallandprecisionatlowresultcut-
osshowshowquicklyausercouldlocaterelevantdatawith
thesystem. Examiningtherecallandprecisionwithanin-
niteresultcutoshows how manyrelevantresults can be
locatedusingthesystem.
4.2 Theutilityofcontext
This section compares the recall and precision of Indri
tothat ofConnections. First,we comparetherankingsof
thetwosystemsusingtheirrecall/precisiongraphs.Second,
weexaminetheinteractiveperformanceofthetwosystems,
comparingtheir recalland precision at various result cut-
os.Third,weexamineeachofthequeriesindetailtogetan
understandingofConnections’sspecicstrengthsandweak-
nesses. Fourth,wepresentanecdotalevidenceaboutthead-
vantagesofcontext-enhancedsearchfromauser’sperspec-
tive. Fifth,wediscuss usinganother popular content-only
search tool, Glimpse, in place of Indri, and the eect on
search utility. Sixth,wecompareautomated contextrela-
tionships to the relationshipsinherent in the existinguser
organizationusingIndri-Dir,asystemthat usesdirectories
ascontextualclusters.
4.2.1 Rankingperformance
Figure3showsboththerawrecall/precisiondataintable
form, as well as a plot of the data. The most noticeable
feature of this data is thatConnections outperforms Indri
at everyrecalllevel(as shown by its linebeing higher on
the gure ateach point). Thisindicatesthat Connections
ndsmorerelevantdata(asevidencedbyitshighprecision
at high recall levels) and ranks it higher (as evidenced by
itshigherprecisionatlowerrecalllevels)thancontent-only
search.
Cuto
Recall%
Precision%
Indri Connections
Indri Connections
10
13
22
23
29
20
16
29
20
25
30
18
34
17
23
50
25
40
17
21
100
28
45
17
20
inf.
34
74
15
16
Table 3: Recalland precision at varying cutos averaged
over 25 queries. This table lists the recall and pre-
cision levels ofIndri and Connections at six dier-
ent cutopoints. Lowcutos showhowthe system
performsinaninteractivesituation,whereusersre-
questpagesofresults. Highercutosshowhowthe
system performs when the user is trying tolocate
all available informationon a topic.
4.2.2 Interactivequeryperformance
Table3 shows therecalland precision levels of the two
systems atvarious cutopoints. Again,the keyfeatureof
thisdataisthatbycombiningcontentandcontext,Connec-
tionsoutperformscontent-onlysearchateverycutopoint,
increasingbothrecallandprecision.
Connectionsalsosignicantlyincreasesthetotalnumber
ofresults found by the system. Withnoresult cuto(in-
nite), Connections increases average recall across the 25
queries by 40%. These results indicate that not only will
usersbemorelikelytondtheirdataquickly,butthatthey
willhaveabetterchanceofndingtheirdataatall.
4.2.3 Individualqueryperformance
Table4shows theperformanceof thesetwoschemesfor
each of the 25queries using aresult cuto of 1000. The
mostnoticeableresultisthat,formostqueries,Connections
providesmorecorrectresultsthancontent-onlysearchwith
similarorbetterprecision.Toassistwithinterpretation,the
horizontallinespartitionthequeriesinto3categories.
Forqueries1-18,theuserspeciedaletype. Thislter
reduces thenumberofretrieved results, improvingaverage
precision. In queries11-14, the userspecied thatthele
wasanimage(e.g.,.jpgor.eps),makingitmuchmoredi-
cultforIndritolocaterelevantles. Inqueries11,12,and
13,Connectionswasabletoleverageitscontextualrelation-
shipstolocaterelevantimages.
For queries19-24,theuserdidnotspecifyaletype. In
threecases,Connectionsimprovedrecallandprecision. For
queries21and22,Connectionsrankedtherelevantcontent-
only results lower than Indri, resultingin lower precision.
ImprovementsintherankingalgorithmcouldhelpConnec-
Recall%
Precision %
Indri
Connections
0
36
41
10
33
41
20
26
35
30
18
31
40
17
29
50
17
28
60
14
26
70
12
24
80
5
24
90
4
14
100
4
11
average
17
28
0
20
40
60
80
100
0
20
40
60
80
100
Indri
Connections
Figure3: Precisionat11recallpointsaveragedover25queries. ThetableontheleftliststheprecisionofIndriand
Connections at11 dierent recall levels, as well as theaverage over all levels. This indicates how accurate
theresultsofasystemare;higherprecisionlevelsmeanthatmoredataisfoundmorequicklybythesystem.
The ploton theright isa graphical representation ofthe data in the table. A perfectsystemwould have a
line across thetopat100foreach recall point.
tionsmatch,ifnotimproveupon,thesequeriesbypushing
therelevantresultshigherintherankings.
Forquery25,Indri’sresultsdidnotexistintherelation-
graph. This is aside-eect of the experimental setup; the
relation-graphonlycontainsdataonlesaccessedduringthe
periodoftracing. IfConnectionswasinplacethroughouta
system’slifetime,somecontextualdatawouldexist.
Examining some of the queries where Connections was
unabletoimprove search eectivenessprovides interesting
insights. Forqueries 22and24,themost relevantles lo-
catedbyIndriweremailboxles.Such\meta-les"arecom-
posedofseveralsmallersub-unitsofdata.Becausethetrace
datacannotdistinguishamongrelationshipsfor individual
sub-units,theselesoftenhavemisleadingedges,makingit
dicult for Connections toprovideaccurateresults. This
problemindicatesaneedforsomelevelofapplicationassis-
tance(e.g.,storingindividualemailsinseparateles).
For queries 14 and 23, the search terms had multiple
meanings within the user’s data. For example, in query
23,oneofthewordstheuserspeciedwas\training"tore-
fertotheirworkoutschedule,buttheyalsohappenedtobe
workingonaprojectrelatedtomachinelearningthatoften
containedtheword\training." Thesedisjointusesofasin-
glewordindicatethatsomelevelofresultclusteringcould
beusefulin presentingresults tousers. Byclusteringcon-
textuallyrelatedresults together,theranksofdisjointsets
couldbeadjustedtoincludesomeresultsfromeachcluster.
4.2.4 Usersatisfaction
Another important consideration for any search system
is the satisfaction of the user, both with the ease of use
ofthesystemand withtheprovidedresults. Although not
easilymeasurable,wehaveanecdotalevidencethatindicates
context-enhancedsearchcanimprovetheuser’ssatisfaction
withlesystemsearch.
One improvementnoted by users is that queries can be
more\intuitive." Forexample,inquery1(seeTable2),the
keywordsintuitivelydescribewhattheuserissearchingfor,
but content-basedsearch toolsareunlikelytoever provide
accurateresultsforsuchaquery.Often,usersappeartobe
searchingbasedon their context,butareinsteadforcedto
comeupwithcontent-friendlysearchterms.
Anotherimprovementnotedbyusersisthekindofresults
found by the system. Several users mentioned that Con-
nectionslocatedrelevantlesthattheyhadn’tremembered
were on their machine. Rather thanlookingfor aspecic
lethattheuserremembers, auser’s search terms canbe
lessdirected,relyingmoreonthesearchsystemtoprovide
thedesireddata.
Althoughfar fromascientic studyofuser satisfaction,
such anecdotal evidence lends weight to the argument for
context-basedsearch.
4.2.5 Othercontentanalysistools
Inexploringtheutilityofcombiningcontentanalysiswith
context analysis, we also implemented a version of Con-
nections using Glimpse [29] as the content analysis tool
(Glimpse-Connections). BecauseGlimpsedoesnotrankits
searchresults(andthusneitherdoesGlimpse-Connections),
it is impossibletouseany comparisons that rely on rank-
ing,suchasrecall/precisioncurvesorspecicresultcutos.
Thus, recall and precision can only be compared with an
inniteresultcuto.
ComparingGlimpsetoGlimpse-Connectionswithinnite
cuto,weseeresultssimilartothoseinTable3: Glimpsehas
a20%recalland29%precision,whileGlimpse-Connections
hasa62%recalland48%precision.Thereducedrecalland
increasedprecisionofthesetwosystemsovertheIndri-based
systemsisduetoGlimpse’sstrictbooleanANDofallquery
terms,whichresultsinfewerhitsthanIndriformostqueries.
4.2.6 Directoriesascontext
Traditionally,usersorganizetheirlesintoadirectoryhi-
erarchy,grouping relatedles together. As such, it might
seemthatusingthesegroupingsascontextualrelationships
could provide many of the same benets as Connections;
however,inpracticeitdoesnot. Toexplorethispossibility,
webuiltIndri-Dir,atoolthatuses the directorystructure
to enhance search results. Specically, Indri-Dir looks in
Category
Query
Indri
Connections
Description
Num
Total
Correct
Recall%
Precision%
Total Correct
Recall%
Precision %
Typed
1
14
0
0
0
40
11
100
28
queries
2
8
0
0
0
30
2
100
7
3
40
8
62
20
59
13
100
22
4
39
3
50
8
58
4
67
7
5
40
3
30
8
59
8
80
14
6
116
53
71
46
138
64
85
46
7
111
76
71
68
134
87
81
65
8
165
55
72
33
187
65
86
35
9
345
0
0
0
380
13
87
3
10
2
2
25
100
31
5
63
16
11
(1000)
0
0
0
18
16
100
89
12
(1000)
0
0
0
27
9
100
33
13
(445)
0
0
0
58
1
100
2
14
(1000)
0
0
0
15
0
0
0
15
(1000)
0
0
0
1
0
0
0
16
11
0
0
0
1000
0
0
0
17
47
13
81
28
1000
13
81
1
18
23
7
100
30
36
7
100
19
Untyped
19
956
2
1
0
1000
42
13
4
queries
20
934
26
41
3
1000
28
44
3
21
786
327
37
42
1000
354
40
35
22
756
14
100
2
1000
14
100
1
23
231
1
100
0
1000
0
0
0
24
65
0
0
0
1000
0
0
0
Nodataavailable
25
(6)
0
0
0
(6)
0
0
0
Table 4: Query resultdetails at1000 resultcuto. For the two search systems, this table shows: (1) the total
number ofresultspresented totheuser, (2)from those, thetotal numberofcorrectresults,(3)therecallof
thesystem,and (4)theprecisionofthesystem. Whennolesoftherequestedtypearefound, the number
ofleslocatedbefore lteringislisted inparenthesis.
the directories of content-based results for les of the re-
quested type, assigning theseles thecombined weight of
anycontentmatchesinthedirectory.Ifnotypeisspecied,
Indri-Dir adds alllesin the directory assigningthemthe
weightofthehighestrankedcontentresultinthatdirectory.
Indri-Dir signicantly underperforms both Connections
and Indri on all metrics. The reason for this is two-fold.
First,Indri-Dirreliesonusersorganizingtheirlesintodi-
rectories in contextually meaningful ways, butmany users
have too many les to do this eectively (.e.g., cluttered
homedirectories,downloadfolders,\paper"directories,etc.).
Second,Indri-Dirreliesonadirectory’sorganizationtomatch
thecontextoftheuser’ssearch,butusersoftenorganizeles
in onewayand then usetheminanother. For example,a
usermightdownloadalloftheproceedingsforaparticular
conference intoa single directory, but later nd that one
particularpaperisofuseintheirproject. Ratherthannd-
ingother papers related totheproject, Indri-Dir will nd
otherpapersfromthatconference.
4.3 Sensitivityanalysis
Tounderstandthesensitivityof dierent parameterset-
tings,weexamined awidevariety of parameter congura-
tions for each of the three phases of context search. For
spaceconsiderations,wepresentasubsetoftheresultsus-
ingthreequeries(thoselistedinTable2)thatrepresentthe
space. For each query,wepresentarecall/precisiongraph,
likethatshowninFigure3. Ineachsetofgraphs,weexam-
inethesensitivity of asingle parameter, usingthedefault
settingsforallotherparameters.
4.3.1 Identifyingrelationships
Relationwindow:Figure4presentstherecall/precision
curvesforConnectionsconguredtouseeachofvedierent
relationwindowsizes: 10,30(ourdefault),60,120,and300
seconds. Thesegraphsillustratethat alargerwindow size
tendstoreduceprecision.Theincreaseinlinksateachnode
results in theweight cutoremovingsomeaccurate links.
However,asshown in query 1, toosmallof awindow can
resultinmissingsomerelationshipsduetoedgesnotbeing
formed.
Edgestyle:Figure5presentstherecall/precisioncurves
for Connections congured touseeither directed (our de-
fault) or undirectededgestyles. In almostevery case, the
directed edge style outperforms the undirected edge style.
Thereasonforthisisnuanced. Withinthetraces,thereare
many misleading input les that are related tomanyles
(e.g.,.bashrcor.emacs).Addingtheseasoutputlessignif-
icantlyincreasesthenumberofoutgoingedgesateachnode,
causingtheweightcutotoremovesomerelevantedges.Al-
thoughthesemisleadingedgesmaynotbefollowed,cutting
theadditionaledges removespaths thatwouldhaveother-
wiselocatedrelevantles.
Operation lter: Figure6presentstherecall/precision
curves forConnections congured touse each of threeop-
eration lters: read/write (our default),open,and all-ops.
Across allqueries, the open lter performs poorly; its in-
creasednumber ofedges result in manyincorrect relation-
ships being followed. In cases where the user specied a
type (such as queries1 and 3), the all-ops andread/write
ltersperformsimilarly. However,inuntypedqueries(such
asquery14),theall-opslterprovideslowerprecision. The
0
20
40
60
80
100
0
20
40
60
80
100
Query 1
10s
30s
60s
120s
300s
0
20
40
60
80
100
0
20
40
60
80
100
Query 3
10s
30s
60s
120s
300s
0
20
40
60
80
100
0
20
40
60
80
100
Query 14
10s
30s
60s
120s
300s
Figure4: SensitivityanalysisofConnectionsusingvedierentrelationwindowsizes.
0
20
40
60
80
100
0
20
40
60
80
100
Query 1
Directed
Undirected
0
20
40
60
80
100
0
20
40
60
80
100
Query 3
Directed
Undirected
0
20
40
60
80
100
0
20
40
60
80
100
Query 14
Directed
Undirected
Figure 5: Sensitivity analysisofConnections usingthetwodierentedgestyles.
0
20
40
60
80
100
0
20
40
60
80
100
Query 1
Read/write
Open
All-ops
0
20
40
60
80
100
0
20
40
60
80
100
Query 3
Read/write
Open
All-ops
0
20
40
60
80
100
0
20
40
60
80
100
Query 14
Read/write
Open
All-ops
Figure6: Sensitivityanalysisof Connectionsusingthreedierentoperationlters.
0
20
40
60
80
100
0
20
40
60
80
100
Query 1
1
2
3
4
0
20
40
60
80
100
0
20
40
60
80
100
Query 3
1
2
3
4
0
20
40
60
80
100
0
20
40
60
80
100
Query 14
1
2
3
4
Figure7: Sensitivity analysisof Connectionsusingfourdierentpathlengths.
0
20
40
60
80
100
0
20
40
60
80
100
Query 1
0%
0.1%
1%
2%
5%
0
20
40
60
80
100
0
20
40
60
80
100
Query 3
0%
0.1%
1%
2%
5%
0
20
40
60
80
100
0
20
40
60
80
100
Query 14
0%
0.1%
1%
2%
5%
Figure8: Sensitivityanalysisof Connectionsusingvedierentweightcutos.
Documents you may be interested
Documents you may be interested