embed pdf in winforms c# : Fill in pdf form reader software control project winforms azure .net UWP extraction0-part164

Automatic Extraction of Reference Linking
Information from Online Documents
Donna Bergmark
Cornell DigitalLibraryResearch Group
CSTR2000-1821
Abstract
TheWeb,withitsexplosivegrowth,isbecominganefficientresourceforup-to-
dateinformation forthescientificresearcher. Informalonlinearchives arereposi-
toriesfortechnicalreports. Proceedingsaremoreand morecommonlypublished
ontheWeb. Thecollectionofonlinejournalsisgrowing. Indeed,agood number
of online journals are “born digital”. Manyresearchers simply puttheir papers
up on their own web site. The large volume of online material makes it quite
desirabletobeableto accessciteddocumentsimmediatelyfromthecitingpaper.
Implementingthisdirectaccessiscalled “referencelinking”.
Somereferencelinkingservicesexisttoday.Anumberofcommercialpublishers,
recognizing thesignificant value-added nature of referencelinking, havebanded
togetherto formtheCrossReforganization. TheCrossRefpublisherssharetheir
metadata,whichenables themto interlinktheirjournals. This metadata is not,
however,availablewithoutafeetoorganizationsorindividualsoutsideofCrossRef.
Thevastmajorityofonlinescholarlyliteratureisaccompaniedbylittleorno
metadata. Sinceitisdesirabletolinkupthisliteratureaswell,theproblemofau-
tomaticallyreferencelinkingonlinescholarlyliteratureintheabsenceofmetadata
and authorinterventionisaproblemverymuchworthconsidering.
Thispaperexploresthisproblemindetail,andpresentssomealgorithmsforex-
tractingmetadatafromonlinetextsandlinkingfull-textdocumentstogether. The
extenttowhichreferencelinkingoftheonlineliteraturecanbedoneautomatically
isthereforethemaintopicofthispaper.
CNRI/DarpaGrant #2057/57-02andNSFGrant#IIS-9907892
1
Fill in pdf form reader - extract form data from PDF in C#.net, ASP.NET, MVC, Ajax, WPF
Help to Read and Extract Field Data from PDF with a Convenient C# Solution
java read pdf form fields; sign pdf form reader
Fill in pdf form reader - VB.NET PDF Form Data Read library: extract form data from PDF in vb.net, ASP.NET, MVC, Ajax, WPF
Convenient VB.NET Solution to Read and Extract Field Data from PDF
how to save a filled out pdf form in reader; extracting data from pdf forms
1 Introduction
Linking documents together seems to be a naturalproclivityof scholars. From
followingtheWorldBookEncyclopedia’s“seealso’s”totheimmediatesuccessof
EugeneGarfield’sCitationIndexinthe60’sand70’sto theWebtoday,following
linksto relatedinformationisirresistible. htmlwouldnotbehalfso popularthe
languageitistodaywereitnotforitssupportforanchorsandlinksandthehttp
protocol.
ReferencelinkingliessomewherebetweenCitationIndex’sstaticlinkdiscovery
andtheWeb’sauthor-insertedlinks. Itfindsreferencesinonlinetechnicalmaterial
to otheronlinematerialandthenturnstheseinto anchorsbyembedding themin
alinktoanonlinecopyofthecitedwork,ortoaservicewhichcanprovideacopy
ofthecitedwork. Aconsortiumofscholarlypublishers,CrossRef,isalreadydoing
thisfortheirownjournals[2]. Accordingtotheirpromotionalliterature,CrossRef
plansto doasubstantialamountofreferencelinking:
Attheoutset,morethanthreemillionarticlesacrossthousandsofjour-
nalswillbelinkedthroughCrossRef,andmorethanhalfamillionmore
articles will belinkedeach year thereafter. Such linking will enhance
theefficiencyofbrowsingandreading theprimaryscientificandschol-
arlyliterature. Itwillenablereaderstogainaccessto logicallyrelated
articles with one or two clicks – an objectivewidely accepted among
researchers as a naturaland necessary partof scientificand scholarly
publishinginthedigitalage. –www.crossref.org
Bysharingtheirmetadatawitheachother,thepublishersinterlinktheirjour-
nals. But this covers only a tiny portion of the online literature. What about
scholarly papers thatexistonlinein repositories,archives, people’s homepages,
vendor sites,and so on? Can these beinterlinked as well, with minimal human
intervention?
Our project in reference linking is directed towards exploring the extent to
which reference linking information can beextracted automatically from online
documents,withoutauthororeditorialintervention.
This work is partof a larger project, OpCit,which is discovering techniques
forlinkingvery-large-scalepreprintarchives[11]. WhileOpCitatSouthamptonfo-
cusesonlargeandfairlyregularcollectionsofliterature,weatCornellarefocusing
onfewerbutirregularlyformattedpapers,withavarietyofreferencestyles. The
project as a wholethus is moving toward bringing referencelinking valueto the
scholarlysideoftheWeb[5].
2 Reference Linking Tasks
Referencelinking meansturning references withinanonlinedocumentinto “live
references” so that whileviewing a scholarly paper ordocument onyour screen,
youcanfollowreferencesinthatpapertoothernetworkaccessiblepapersandview
thoseaswell(in separatewindows on yourterminal). It is especiallyattractive
to developthistechnologybecauseoftheincreasing numberof technicaljournals
and magazinesavailableonline[15].
Most commonly,references arefound in a latesection of an article; thissec-
tionisoftenlabeledReferences,Bibliography,orList of References,making
2
VB.NET PDF Form Data fill-in library: auto fill-in PDF form data
to PDF. Image: Remove Image from PDF Page. Image Bookmark: Edit Bookmark. Metadata: Edit, Delete Metadata. Form Process. Extract Field Data. Data: Auto Fill-in Field
extract pdf data into excel; exporting data from pdf to excel
C# PDF Form Data fill-in Library: auto fill-in PDF form data in C#
A professional PDF form filler control able to be integrated in Visual Studio .NET WinForm and fill in PDF form use C# language.
extract data from pdf to excel; export excel to pdf form
it relatively easy for a computer programto locatethem. Within the reference
section,individualreferencesincreasinglyincludeURLspointingdirectlytoonline
information. In manycases, references that do not includeURLsmention pub-
lished journalarticles thatcanalso beresolved to anonlinecopy. For example,
ACMjournals canbefound in print as wellas online, assumingthat thereader
orthereader’sinstitution is a subscriberto theACMDigitalLibrary. SFX[23]
isaleadingcontenderforresolvingreferenceswhiletakingtheuser’scontextinto
consideration. However,itisoftenthecasetheURLsmustbe“discovered”. That
isalso ataskofreferencelinking.
Wesplitreferencelinkingapplicationsinto twoparts[3]: fulltextanalysisand
fulltextpresentation. Todate,wehavebeenworkingmostlyontheanalysispart.
Thepresentationofthetext,includinglivelinks,isanotherproject.
Analysis considers a paper to becomposed of three parts: the front matter
or header material (title, authors, etc.), the body (containing reference anchors
and theircontexts),and thereferencesection (containingthereferencestrings).
1
Based onthisdecomposition,wehavedefinedfoursmalleranalysistasksforeach
item
2
analyzed:
• HeaderMaterial -
– Determining generaldata abouttheanalyzed item(authors,title,year
ofpublication)byparsing theheadermaterial
• TheBody-
– Scanning the body of the text for reference anchors (e.g. [10]) and
collectingthecontextsofthesereferenceanchors
• TheReferenceSection -
– Analyzing thereferencestringsinthereferencesection
– Matchingthereferenceanchorstothetagsonthereferencestrings
Thesetasksimplicitlycreatetwoothertasks: determiningwhereheadermaterial
endsandthebodybegins;anddeterminingwherethebodyendsandtheReference
Sectionbegins. Section5 of thispaperdiscusses thesesix(total)tasks in detail,
posing them as programming problems along with the algorithmicsolutions we
use.
3 Document Names
Intalkingaboutextractingreferencelinkinginformationfromfulltext,itisimpor-
tanttobeclear aboutthedifferencebetweenanalyzeditemsandtheworkscited
bythoseitems. Botharewritings(orcreations),butwithanimportantdifference:
theanalyzeditemactuallyexistsonline,andweknowitscurrentlocation.
References,ontheotherhand,aretoworkswhichmayormaynotexistonline.
Theymayno longer existanywhere. Theproblemistoidentifytheworkjustby
1
Wehavebeendeliberatelycasual abouttheterm “reference”uptothispoint, butnowwedistin-
guishbetween atagplacedinthepaper’s text and the actual citation at the endof the paper. The
formerwillcalledthereferenceanchorandthelatterwillbethereferencestring.
2
Theworditemisdefined bythelibrarycommunityasan actualcopyof awork,whileworkisthe
abstract notion forapublication of which zeroor morecopies might actually exist[22]. In n our work,
weareconcerned onlywithonlineitems.
3
C# WPF PDF Viewer SDK to annotate PDF document in C#.NET
Text box. Click to add a text box to specific location on PDF page. Line color and fill can be set in properties. Copyright © <2000-2016> by <RasterEdge.com>.
export pdf form data to excel spreadsheet; extract data out of pdf file
VB.NET PDF Password Library: add, remove, edit PDF file password
passwordSetting.IsAnnot = True ' Allow to fill form. passwordSetting document. passwordSetting.IsAssemble = True ' Add password to PDF file.
how to flatten a pdf form in reader; export pdf data to excel
parsing thereferencestring. Iftheitemispartofanonlinejournalorisinsome
repository,thenithasaDocumentObjectIdentifier[19]consistingoftherepository
name,aseparator,andtheuniquenameofthatobjectwithintherepository. For
example,thedoiof a D-Lib paper looks like10.1045/december99-miller. But
manyonlinereferences,e.g. onaresearcher’shomepage,donothaveadoi. What
doweuseasanamethen? Theproblemofgettinguniquedoisforrandomarchives
orrepositoriesisanopenquestion.
In CrossRef, a publisher canuse metadata (title,author, etc.) to lookup a
doifromtheCrossRefdatabase. Butthisserviceisnot (yet)generally available
toreferencelinkingprojectssuchasthoseatCornellandUniversityofSouthamp-
ton. While plans are underway to add SFXto CrossRef, which would make it
available to generalusers who have been granted access to some of theseonline
collections,otherworksarebestrepresentedbyaurnsynthesizedfromthework’s
bibliographicinformation.
Wecouldhaveused a single,central,uniqueintegerfor documentidentifica-
tion,butintheinterestof promoting distributed object oriented approaches we
preferred to avoid this serial bottleneck. We chose to construct our own urns
byconcatenating threestrings: thefirst author’s last name(or“*” if unknown),
the4-digityearofpublication (or “*”ifunknown),andthefirst20 charactersof
thelower-casedtitle. This becomes asufficientlyprecisehashkeyforlookingup
awork to seeifit has been previously analyzed asan item, or seen before as a
reference in another item. Similarly,theproject at Southamptonuses theyear,
month,andarticlenumberofanarXivitemasitsurn. Manufacturingaurnout
ofbibliographicdatagivesoneadistributablewayofmakingagoodkey,because
with bibliographicdata inhandonecancomputethekeydirectly,withoutdoing
tablelookups.
Incompleteurnscanbecompletedasmoreitemsareanalyzed. For example,
theyearmightbemissinginonereferenceto awork,butsuppliedinanother.
Wesynthesizeurnsforbothitemsandreferences,thoughwheredoisorurls
are available, we keepthemaswell. Thusa singleworkwillhaveone synthetic
urn, zero or moredoi’s, and zero or moreurls. Having a doi means that the
workwasanalyzedasaniteminarepositoryanditsreferencesareavailable.
4 Preprocessing Online Documents
Oneproblemwith automaticreferencelinking is that not all formats (bitmaps,
TeX,PDF,PostScript,etc.) areequallyeasyto parse. Forthisreason beforean
onlinedocumentisanalyzed,thefirst stepis usuallyto transformthedocument
intoa formatmoresusceptibletoanalysis. Thetwomostcommontargetformats
areasciiandxhtml(thexmlversionofhtml).
ResearchIndex(formerlyCiteSeer)[13]usesa versionofpstotextthatinserts
fonttags intothedocumentas thedocument isconverted fromPostScript/PDF
into ascii.
3
Similarapproachesareused in analyzing OCRconversionsfrombit
maps. Summers [21]derives paper segments(suchas titleand authors)fromin-
specting the geometric layout of a scanned document. Morerecently,Caton [7]
pointedoutthatpresentationdirectivescanbeusedtogeneratetagsthathelpnav-
igateadocument. Ingeneral,thesevariousformats withtheirfontnotationscan
3
Thepstotextprogram comeswiththeGhostScriptpackage[14].
4
C# PDF Password Library: add, remove, edit PDF file password in C#
passwordSetting.IsAnnot = true; // Allow to fill form. passwordSetting document. passwordSetting.IsAssemble = true; // Add password to PDF file.
extracting data from pdf files; extract pdf data to excel
VB.NET PDF - Annotate PDF with WPF PDF Viewer for VB.NET
Text box. Click to add a text box to specific location on PDF page. Line color and fill can be set in properties. Copyright © <2000-2016> by <RasterEdge.com>.
pdf form data extraction; save data in pdf form reader
beconvertedtohtmltags,andthenanalyzedbyourXHTMLAnalyzersoftware.
ThesoftwarefromSouthampton[10],whichreference-linksPDFfilesfound in
thearXivrepositoryatLosAlamos,usesAcrobattoolstoconvertPDFintoascii
textpriortoanalysis. Likewise,[9]discussesthepreprocessingofWorddocuments
into aformthatcanbeanalyzed. Ingeneral,theconversiontools listedinTable
1arerecommendedforpreprocessingfull-textdocumentsintoa formthatcanbe
analyzed.
Full-text
Analyzable Layout
format
Conversionalgorithm
format
info
ASCII
noconversion
ASCII
none
HTML
Tidy/JTidy
XHTML
HTTP tags
DVI
dvips,pstoascii
ASCII
fonts
PostScript pstoascii
ASCII
fonts
PDF
pdfps,pstoascii
ASCII
fonts
bitmaps
OCR
ASCII
? depends
Word
saveasHTML,Tidy/JTidy XHTML
HTTPtags
Table 1: Conversiontoolsto prepare for parsing. See also TOM Conversion Service,
http://tom.cs.cmu.edu/.
Thereferencelinkingproject atCornellhas so fardealtonlywithhtmldoc-
uments. It turns out that most html documents are not well-formed and are
therefore difficult to parse. Although html parsers exist (see the javax Swing
packagefor example),theyarenotwelldocumented anddo nothavemuchfunc-
tionality. For that reason, wefirst convert thehtml into well-formed xml (i.e.
xhtml) usingtheveryexcellentJTidy[20]package. Itcleansup thetags,lower-
cases them,tries to resolve problems. Onlyinambiguouscases, does Tidy give
upandoutputnothing. Ifxhtmlcanbeobtained,itcanbeanalyzedbyanxml
parser,ofwhichseveralgoodonesexist,includingjaxpfromSunandXercesfrom
theApacheproject.
JTidy cannot unambiguously make every html document into a parseable
xhtml document. Using the april2000 version of JTidy on D-Lib
4
papers, for
example,wefoundthat220outofthe280papers,or79%,couldbeconvertedinto
xhtmlwithnofatalerrors(albeitnumerouswarnings). Fortheinterestedreader,
Figure1showsexamplesnippetsofhtmlwhichcouldnotbetidied. Thefirstone
ismissingtheaelementtag infrontofthehrefattribute;thesecondoneusesan
unknowntag,<it>,undoubtedlyforitalicbutthatisnotwhathttpuses;andthe
thirdhasamalformed<TD>element. Thefourthlineonlygetsawarning because
Tidycandiscardtheunexpected</a>.
Wedo notstorethepreprocessed documents. Wekeep only the information
collected during analysis. There may be problems down the line if the online
paperischangedanddeviatestoo muchfromthecollected information. Butsee
annotationworkatBerkeley[18]forpossiblesolutionstothisproblem.
4D-Libisanonlinejournalwhichhasbeenappearing11times ayear sinceJuly1995.
5
VB.NET PDF - Annotate PDF Online with VB.NET HTML5 PDF Viewer
on PDF page. Outline width, outline color, fill color and transparency are all can be altered in properties. Drawing Tab. Item. Name. Description. 7. Draw free
extract data from pdf into excel; c# read pdf form fields
C# HTML5 PDF Viewer SDK to annotate PDF document online in C#.NET
on PDF page. Outline width, outline color, fill color and transparency are all can be altered in properties. Drawing Tab. Item. Name. Description. 7. Draw free
cannot save pdf form in reader; filling out pdf forms with reader
<href="http://www.minitel.fr">http://www.minitel.fr</a>
in&nbsp;<it>Proceedings of the 20th Annual International ACM SIGIR...
<TABLE>... <TD WIDTH=2<BR></TD> ... </table>
<center><img src="images/book.gif" border"0"></a></center>
Figure 1: html snippets whose http tags cannot be converted into xml by Tidy,
usually becauseitisnotalegaltag. The last line isrepairedbyTidy.
5
Extracting Reference InformationfromOn-
line Documents
WenowturntoadetaileddiscussionofthereferencelinkingtaskslistedinSection
2. Each task is described andproblems in carrying out the task aredelineated.
Whereappropriate,somesolutionsandworkingalgorithmstoattacktheproblems
arepresented. Weassumethatthedocumentbeingprocessedhasbeenconverted
intoxhtml,asdiscussedintheprevioussection.
5.1 Extracting an Item’s Metadata
Whyis it important to havean item’s bibliographic data when analyzing it for
referencelinkingapplications? Themainreasonisthatsinceweareanalyzingthis
item,wehavetheonlinelocation ofthisitem. Itis alinkablecopyofawork. In
ordertoknowwhatworkthatis,weneedto knowtheitem’sbibliographicdata,
such astitleandauthorsand year ofpublication. Oncewehavedeterminedthe
workofwhichthisitemisacopy,thenifwecomeacross thisworkina reference
listinthefuture,wealreadyknowthatwehavealinkablereference,andweknow
itslocation.
Eithertheitemisaccompaniedbymetadata,asisthecasewithOpenArchive
items [26]andmorerecentD-Libpapers, orelseit hastobeextractedfromthe
textofthepaper. Wedothelatter.
Toextractthemetadataforananalyzeditem,layoutcluesarenecessary. Usu-
allypresentationinformation(suchasfontchanges)isusedtodeterminewhatthe
titleis. Titles usuallyoccurinalargefont,nearthebeginning ofapaper.
MetadataExtraction Algorithm(forXHTML)
settitle1=valueof<title>elementifthereisone
Scanfor anyofthefollowing:
<H1>text</H1>,<H2>text</H2>,<font size="+3">text</font>,
<font size="+2">text</font>,<font size="5">text</font>
settitle2=“text”
iftitle2isshorterthantitle1,
thenscanforsubtitleandappendtotitle2
6
In html, one can assume the title is contained in an <H1> or <H2>element,
although it happens sometimes that the title is simply set off by a <FONT>
elementthatincreasesthetextsize.
Multiline titles can be extracted from html documents by reconciling the
parsedtitlewiththe<title>elementifoneexists. Itisveryhelpfulifthe<title>
element contains a ‘:’ separating the maintitlefromthesubtitle. If there does
appear to bemoretitletobescanned,thenlookfor<h3>or<font size="+2">.
Inthegeneralcase,thetitlemustbeassumed to besetoffinitsownparagraph
and/orbeterminatedbythefontreverting to normalsize.
Oncethetitle hasbeenlocated,theauthorscomenext. Determining theau-
thorsofanunmarked-updocumentisparticularlydifficult. Althoughitisrelatively
easytodeterminewheretheauthorsectionis,parsingthattextforauthornames
is problematic because it is difficult to separate author names from institution
names.
However, markup tags do help,plusthepresenceof commas in the text is a
clue. Anytag denotestheend ofanauthor’s name;sincelastnamesdon’tcome
first in thefront matter, commas usually denotetheend of oneauthor’s name.
Hereisthealgorithmweusetoparseouttheauthornamestrings:
MetadataExtractionAlgorithm(cont.)
Rule1. Alwaysusethefirstlineafter thetitleasanauthorstring.
Rule2. Always use textsetoff by anyof thefollowing as a string of
authornames:
<p> text </p>
<center> text </center>
<strong> text </strong>
Rule3. Individualauthor names are terminated by any tag, such as
<br>,orbyacomma.
Rule4. Theauthornamesectionisterminationbythefirstheader,such
as<h3>.
Finallythedateofpublication can sometimes bedetermined fromtheitem’s
urlordoi. Itisrarelycontainedin thetextofthedocumentitself. Findingthe
publicationyearcanbeverydifficult,andwarrants furtherresearch.
5.2 Locating the Body of the Text
Thebodyofthetext startswhen thereare no moreauthorslistedintheheader
material. The most effective algorithm is to check scanned text for something
that looks likea section heading (but notasbig as themain title)and contains
theword Abstract,Introduction,orContents. Whenfound,theformat ofthe
headershouldberememberedforlateruse,whenlookingfortheReferenceSection.
5.3 Finding Reference Anchors in the Text
Onceintothebodyofthetext,oneneedsto locatethereferenceanchors. Figure
2listssomeoftheformatsfoundinD-Lib.
ScanningforstringsasshownintheleftcolumnofFigure2isstraightforward.
Initiallyeachsentenceinthetextissearchedfora’(’,’[’,or’{’. Differencesamong
themare:
7
[1] or [1,3] or [8-10]
See Hakkala (1996)
[Bruce and Wayne ]
Bruce and others (1997)
[Bruce et al.]
Bruce and Wayne (1998)
(Bruce & Wayne, 1998)
(Bruce,1998)
(Bruce, 1998, Wayne, 1999)
(Bruce et al., 1998)
(CNRI, 1997)
{Digital Library Initiative}
[Bruce, 1996; Wayne, 1999]
Figure2: Somereferenceanchorformats,asfoundinD-Libpapers.
• Onlythe‘[’canbefollowedbydigits(unlessitisa“loneyear”)
• Parenthesized references must have a year included in them, in order to
distinguishthemfromparenthesizedexpressions.
Wehandlenumericalranges byreplacing [1-3]with [1][2][3]. Comma-ed
andsemi-colon-edlistsaresimilarlybrokenupintoindividualreferences;(Smith,
1998; Jones, 1999)forexampleisnormalizedto[Smith, 1998][Jones, 1999].
This all works quitereliably. One problem is how to handle some authors’
penchantfor using thereference as a part of speech,e.g. Caplan and Guenther
(1996) explored the difficulties ... This needs to be parsed into [Caplan
and Guenther, 1996]inordertomatchitsreferencestringwhichcouldlooklike:
Caplan, Priscilla, and Rebecca Guenther. (1996). Metadata for
Internet Resources: The Dublin Core Metadata Elements Set and ...
Hereisouralgorithmforhandlingthisproblem:
Algorithm for Referencesusedas PartsofSpeech:
ifaloneyearisseen,e.g. (1996)
then do
Scanbackwardsfromtheyeartothebeginningofthesentenceortoan
uncapitalizedwordorstrangepunctuation,then
do
accept
Namelist = { Name [, Name]
+
"and" Name | Name Etal |
Name "and others"}
enddo
output‘[’ NameList ‘, ’ year ‘]’
where:
Nameisa capitalletterfollowed bysmallletters,-,or ’andcanendin
acomma
Etal = "et al." | "et. al." | "et. al"
Thisillustratestheproblemsinvolvedwithpullingreferenceanchorsoutofthe
text. Formanyonlinejournals,thereisno“house” referencestyle,andcertainly
not for author deposited papers in archives. The good thing is, though, that
8
once the format of thereferences is determined, it holds throughout thepaper.
ResearchIndexusesaninterestingtechniquetodeterminewhichformatisused: it
first countsthenumber of’(’and ’[’ in the text. Whichever is more frequent is
takenasanimportanthintforfinding referenceanchors.
Wefindreferenceanchorsbyexaminingeachportionofthebody,sentenceby
sentence. Initiallythreeparsersarerunoneachsentence. Whenonereturnsmore
hitsthantheotherforthesamesentence,thatparser,oravariantofit,becomes
theonlytextparserusedfortheremainderofthepaper. Forexample,ifthefirst
reference found is “(Brownand Allen,1999)” further references willbeassumed
to matchtheparenthesized listofauthorsandyearpattern. Wehavefoundthat
6grammarsaresufficienttofindthereferences inD-Libarticles:
SQUARE_BRACKETS_AROUND_NUMERALS
PARENTHESES_AROUND_NAMES_AND_YEAR
SQUARE_BRACKETS_AROUND_ACRONYMS
PARENTHESES_AROUND_COMMAED_NAMES_AND_YEARS
BRACKETS_AROUND_COMMAED_NAMES_AND_YEARS
CURLY_BRACKETS_AROUND_ACRONYMS
Oncewegettoasentencethatcontainsmorethanonereference,itbecomesa
context. Agivenreferencemightappearinoneormorecontexts. Thesereferences
and containing contexts shouldbesavedforlater,whentheycan bematchedup
withreferencestrings(seeSection5.6).
FindContextsAlgorithm(seebelow)
Given: asentence,S
IfthereisadesignatedparserPdo:
LetrefsInText=setofreferences inSusingP
IfrefsInText= ∅,saveS asacontext
end
Elsedo:
Letu=setofreferencesinSofform[...]
Letv=setofreferencesinSofform(... year)
Letw=setofreferencesinSofform{...}
refsInText=maxofu,v,w
IfrefsInText=∅:
SaveSasa context
SetP=theparserthatproducedrefsInText
end
Problemswithfindingcontextsinclude:
Premature sentence termination -Assuming “.” ends sentences, then false
endingsdueto “etal.” or“etc.” or“44.4” couldcausetruncatedcontexts,ifnot
treatedasspecialcases. Fortunately,itisrelativelyeasytocheckforthese.
Context and Reference Anchor Disassociation - A few authors will put
theirreferenceanchorsoutsideofthesentencetowhichitrelates,e.g.
In the past, this has been a big deal.[8] However, no more.
The anchorclearlybelongs to the first sentence,butwillbeanalyzed as partof
the second. Thebest algorithmhere is notethat theterminating period of the
first sentence is followed bya typical anchor delimiter (here, “[”) and save the
9
sentencejustincase. Thenifanalysisofthesecondsentencerevealsthepresence
ofananchoratthebeginning,anditisfollowedbyacapitalletter,thentheanchor
canbeputwiththefirstsentenceinsteadofthesecond.
False Contexts -Theseare where atextfragment,for example[1907]is spu-
riously parsed to bea referenceanchor. No harm is done unless it happens to
match oneof thereferencestrings,in which casea false context will appearfor
thatreference.
5.4 Locating the List of References
Current reference linking data extraction tools all seem to use the same tech-
niqueforlocatingthereferencessection: lookforaheadinglikeReferences. The
Southampton software that uses deciter (we’ll call it DLS Version 1999) scans
references fromtheend ofthe paper forward and stopswhen the referencesec-
tionheadingisfound. This isveryefficient in termsofscan time. Wewoulddo
thesame,butwewantedtoscanthepaper fromthetopin ordertopickupthe
contexts. Otherwise,thealgorithmis thesame: recognize a section headerthat
sayssomething similar to“References”. Hereis the list of headings found while
analyzing D-Libpapers:
References
Bibliography
Notes and References
Note and References
<section#.> References
Notethat it is important to examineonlysection headings for thebibliography
keyword; finding References in a table of contents doesn’t count. Additional
complicationsinvolvedwithlocating theReferenceSectionare:
NoReferenceSection - Thereferencesappearasfootnotesratherthanbeing
collectedattheend(aformatpopularinsomecircles). Inthiscase,thereference
stringshavetobelocatedandcollectedwhilescanningthebodyofthetext.When
theendofthepaperisreached,thecollectedreferencestringsaretreated justas
thoughtheyhadallcomeattheendofthepaperinabibliography.References,in
eithercase,areassignedanordinalvaluedependingontherelativeorderinwhich
thereferencestringwasencountered,vis avisotherreferencestrings.
Referencesareina DifferentFile- Anunsolvedproblemiswhattodowith
referencesthatareinaseparatefilefromtherepositoryitem. Forexample,some
htmldocumentscontainalinktoaseparatepagethatholdsthereferencestrings.
Determiningtheactualsetoffilesthatcompriseadocumentisanopen,important
problem
ReferenceSection Losesits Markup - Sincethereferencesectionislocated
byexamining sectionheaders,it is crucialthatthesection header bethere. For
example,JTidycouldremovethe<H3>markupduetoothersyntaxproblems. A
relatedproblemiswhenthefirstsectionheaderafterthefrontmatterisinappro-
priatelytagged.
10
Documents you may be interested
Documents you may be interested