Efficientsearchinhiddentext
oflargeDjVudocuments
JanuszS.Bień
September12,2011
Abstract
Thepaperdescribesanopen-sourcetoolwhichallowstopresentend-
userswithresultsofadvancedlanguagetechnologies. ItreliesontheDjVu
format,whichforsomeapplicationsisstillsuperiortoothermodernfor-
matsincludingPDF/A.TheDjVuGPLedtoolsarenotlimitedjusttothe
DjVuLibrelibrary,butarebeingsupplementedbyvariousnewprograms,
suchaspdf2djvudevelopedbyJakubWilk. Itallowsinparticulartocon-
verttoDjVuthePDFoutputofpopularOCRprogramslikeFineReader
preservingthehiddentextlayerandsomeotherfeatures.
The tool in question has been n conceived d bythe present author and
consistof amodificationofthePoliqarp corpusquerytool,used for Na-
tionalCorpusofPolish;hisideashavebeenverysuccesfullyimplemented
byJakub Wilk. . Thenew w system,calledheresimplyPoliqarpfor DjVu,
inherits fromits origin not onlythe powerfull search facilities based on
two-level regular expressions, but alsotheability torepresent low-level
ambiguities and other linguistic c phenomena. Although h at t present t the
toolisused mainlytofacilitateaccess totheresultsof dirtyOCR,it is
readytohandlealsomoresophisticatedoutputoflinguistictechnologies.
1 DjVutechnologyand DjVuLibre
TheDjVutechnology,describedbyitsauthors as animage compressiontech-
nique, a a document format, , and d a software e platform m for delivering documents
imagesovertheInternet[LeCunetal.,2001,p.2]wasoriginallydevelopedby
YannLeCun, LéonBottou, PatrickHaffner, andPaul G.Howardat AT&T
Laboratories in1996. AT&T T Laboratories acquiredseveralpatents s for some
aspects of the technology, but didn’t offer any product using g or r supporting
DjVu
1
. The e broadrights to the patents have been n purchasedby y LizardTech
This is s an updated d version n of the paper r which h appeared in Bernardi, Raffaella and
Chambers, Sally y and d Gottfried, Björn and d Segond, , Frédérique e and d Zaihrayeu, Ilya (eds.),
Advanced Language Technologies for r Digital Libraries, Lecture e Notes in Computer Sci-
ence 6999, Springer r Berlin n / / Heidelberg, pp p 1-14, 2011, DOI 10.1007/978-3-642-23160-5_1
(http://dx.doi.org/10.1007/978-3-642-23160-5_1).
FormalLinguisticsdepartment,Universityof Warsaw,Browarna8/10,00-927Warszawa,
Poland,jsbien@uw.edu.pl,http://www.klf.uw.edu.pl.
1
AlthoughthepatentsinquestionarevalidonlyinUSA,theydefinitelydelayedthepracti-
calapplicationsoftheformat(fortunatelysoftwarepatentsarenotallowedatallinEuropean
Unionandalotofother countries).
1
Convert pdf to multipage tiff - SDK application project:C# PDF Convert to Tiff SDK: Convert PDF to tiff images in C#.net, ASP.NET MVC, Ajax, WinForms, WPF
Online C# Tutorial for How to Convert PDF File to Tiff Image File
www.rasteredge.com
Convert pdf to multipage tiff - SDK application project:VB.NET PDF Convert to Tiff SDK: Convert PDF to tiff images in vb.net, ASP.NET MVC, Ajax, WinForms, WPF
Free VB.NET Guide to Render and Convert PDF Document to TIFF
www.rasteredge.com
(itlaterbecameapartofCelartemTechnologyInc.,whichin2009appointed
CaminovaInc. “todevelop,distributeandmanageits s DjVudocument imag-
ing technology”, cf. http://www.caminova.jp/en/), , whichin n 2001 allowed
touse patentedtechniques s in the e software distributedunder the GNU Gen-
eralPublicLicense;asthewordingofthestatementwasconsideredunprecise,
in2002itwas supplementedbyanadditionalclarification. Theimplementa-
tionoftheDjVutechnologyavailableontheGNUGPLlicenceiscalledDjVu-
Libre.ItisworthremindingthatGNUGPLprovidestheuserwith4freedoms
(http://www.gnu.org/philosophy/free-sw.html):
1. Thefreedomtoruntheprogram,foranypurpose.
2. Thefreedomtostudyhowtheprogramworks,andadaptittoyourneeds.
3. Thefreedomtoredistributecopiessoyoucanhelpyourneighbor.
4. Thefreedomtoimprovetheprogram,andreleaseyourimprovements s to
thepublic,sothatthewholecommunitybenefits.
Inconsequence,itismostappropriateforacademicresearch.
DjVuhasseveralfeatures. Firstofall,itprovidesveryefficientalgorithms
forimagecompression;thebestofthemarestillavailableonlyintheform of
commercialandquiteexpensiveproducts.Secondly,itprovidesanefficientway
totransferthecompressedimagesovertheInternet,evenonrelativelyslowlines.
Moreover,itprovidesalsoanefficientwaytodisplaytheimageontheend-user’s
computer,usingsuchtricksasprogressivedecoding(whichdecompresses only
thispartoftheimagewhichistobedisplayed),downloadingthenextpagein
thebackgroundetc.
DjVuallows to o storeevery page ina separatefile anddownloadonly the
pages which are really y needed, , which h is of f crucial importance e especially y for
largedictionaries,whicharenotreadinasequentialway. Anotherfeature e of
crucialimportanceisthepossibilitytoaccompanythescansbythehiddentext
layer,whichcanbesearched,copiedetc.
Fromauser’spointofviewitistheDjVuviewerwhichisimportant.There
exist severalofthem,bothcommercialandfree,forvarious platforms,palm-
topsandcellularphonesincluded. AlltheviewersprofitfromtheDjVudesign
featuresallowingtheviewertosimulatetheoperationsonapaperdocumentin
comparabletime,asillustratedbythetable4in[LeCunetal.,2001,p.6]:
Action
Real-wordequivalent
Acceptabledelay
Zooming/Panning
Movingtheeyes
Immediate
Next/PreviousPage
Turningapage
<1second
RandomPageaccess
Findingapage
<3seconds
Fromtheverybeginning,DjVuviewers allowedtohighlightspecifiedfrag-
mentsofaremotetext.Forexample,theaddress
http://www.leoyan.com/century-dictionary.com/04/index04.djvu?djvuopts=
&page=p2719.djvu&zoom=100&showposition=0.48,0.34&highlight=1084,
3451,1004,344
pointstotheentryhardwareintheonlineeditionofthefamousTheCentury
DictionaryandCyclopedia(publishedfrom1888to1891),referencedalsolater
inthepaper. Themainpartof f the address describes the primarydocument
2
SDK application project:C# TIFF: C#.NET Code to Split Multipage TIFF File
XDoc.Tiff ›› C# Tiff: Split Tiff. C# TIFF - Split Multi-page TIFF File in C#.NET. C# Guide for How to Use TIFF Processing DLL to Split Multi-page TIFF File.
www.rasteredge.com
SDK application project:.NET Multipage TIFF SDK| Process Multipage TIFF Files
to work with .NET development environments, this Multipage TIFF Processing SDK on the Web, open and view TIFF files on to SharePoint and save to PDF documents.
www.rasteredge.com
file,whichinthiscaseisjustanindextothefiles containingindividualpages
of the 4th volume of the dictionary. Theparameter r page describes the page
usingitsnamewhichhappenstocoincidewiththenameofthefilecontaining
it. Thehighlightparameterspecifiespixelcoordinatesoftherectangletobe
highlighted,andtheshowpositionpartguaranteesthatthevisibleareaofthe
pagewillcontainthehighlight.
Thisveryusefulfeaturewashoweververylittleusedbecausetherewasno
easywaytoidentifythecoordinatesoftheareatobehighlighted.Thereforein
2008IaskedJakubWilk(thenastudentofmine)toextenddjview4allowingto
createsuchURLsconvenientlyaftermarkingaregionwithamouse.Thepatch
hasbeensubmittedtotheSourceforgetrackingsystemon9
th
Februaryandby
29
th
FebruaryithasbeenreimplementedmoreefficientlybyLéonBottou,the
authoroftheprogram,whoincludeditintheofficialdistribution. Ithinkthis
featureis extremelyimportant for academicresearch, , as s it allows toquote a
specificfragmentofadigitalizedworkwhenincludingitsimageistechnically
difficultornotdesirable.
Whenaccessingadocument withahighlightedfragment,thepageis dis-
playedinthedefaultresolutionandinthedefaultposition,soitcouldhappen
thatthehighlightedfragmentisnot immediatelyvisible. Thefreebutclosed
sourceLizardTech viewer r for MSWindows hada a solutiontotheproblem in
theformoftheShowPositionparameter. InMay2008Iaskedforanidentical
featureindjview4andjustseveralmonthslater(inJune2008)LéonBattouim-
plementedit. SoifyousendanURLreferringtoahighlightedfragmentoftext,
thereceipientwillseeitexactlyasthesender(withsomeminorexceptions).
2 DjVuandPortableDocumentFormat
PortableDocumentFormat(PDF)isanopenstandard(formallysinceJuly1,
2008)fordocumentexchangeintroducedbyAdobeSystemsin1993. Asubsetof
thespecificationisknownasPDF/Aanddescribedintheinternationalstandard
ISO19005-1:2005Documentmanagement–Electronicdocumentfileformatfor
long-termpreservation–Part1: UseofPDF1.4(PDF/A-1).
Reportedlyalreadyversion1.0ofthespecificationallowedtocreate“sand-
wichPDF”containingboththescansandhiddentextlayers,predatinginthis
respectDjVu,whichhoweverforyearsprovidedbettercompression(atpresent
thecompressionratioiscomparable)andisstillinmanyaspectsmoreconve-
nient.
ThankstotheopencharacterofthePDFstandarditbecameverypopular,
both as s the output t of f scanning programs s and stand-alone scanners, and as
aninput for printing, rangingfrom personal printers to o professional devices.
Moreover “sandwich h PDF” is s used d also as s the output format of many y OCR
programs,includingthewidely-usedAbbyFinerReader.
Tohavethebestofbothworlds,in2008JakubWilkcreatedthefirstver-
sionofthepdf2djvuprogram,whichhehassincethenactivelymaintainedand
developed;the software is hostedat http://code.google.com/p/pdf2djvu/.
ItisreleasedunderthetermsoftheGNUGeneralPublicLicensesandavailable
inthepackageforminmajorfreeoperatingsystem(GNU/LinuxandFreeBSD)
distributions,suchasDebian,UbuntuandOpenSuse;itcanbecompiledalso
forMSWindows. Thecurrentversionoftheprogramis0.7.10(releasedon20
th
3
SDK application project:C# TIFF: C# Code for Multi-page TIFF Processing Using RasterEdge .
process, convert, annotate, and save various image and document file formats. Most commonly, images and documents like Tiff, Jpeg, Bmp, Png, Gif, PDF, Word
www.rasteredge.com
SDK application project:VB.NET Image: Multi-page TIFF Editor SDK; Process TIFF in VB.NET
imaging SDK owns rich APIs, using which developers can easily load, save, view, edit, annotate, manipulate, convert and compress source TIFF document image
www.rasteredge.com
August2011)andsupportssuchfeaturesas
• compressingthescanstheDjVuway,tryingtosplitthemintofrontand
background;
• optionallypreservinghiddentext;
• optionallypreservingthedocumentoutline;
• optionally preservinghyperlinks s (with h some limitation intrinsic c for r the
DjVuformat);
• optionallypreservingandupdatingthedocumentmetadata.
The program is able in particular to o preserve and update e the metadata
in the e XMP P format; ; XMP P stands s for Extensible Metadata a Platform (http:
//www.adobe.com/products/xmp/)whichisbecomingmoreandmorepopular.
TheexpensivecommercialDjVudocumentcreatorsprovidebettercompres-
sionthanpdf2djvu,butareavailableonlyforMSWindowsandincludebuilt-in
OCRprogramswhichcannotbecontrolledbytheuser. Inconsequence,pdf2djvu
usedaloneorwithanOCRprogramofchoiceisaviablecompetitor inmany
circumstances.
3 Searchingthehidden textlayer
Every DjVuviewer allows s for r searching g the e hidden text layer, but for r large
remotedocumentsitisinefficientasitdefeatsthepurposeofsplittingthedoc-
umentintoseparatepages: toaccessthehiddentext,allthepageshavetobe
loaded,andifthesearchisrepeated,theyarereloadedmultipletimes. Onthe
otherhand,ifthedocumentisavailablelocally,djview4offersveryefficientand
convenientincrementalsearchwhichseemstobeabsentinotherviewers.
Hence,theoptimalsolutionistousesomekindofindexandasearchengine.
YannLeCun,one of the creatorsof the DjVuformat, implementedJSSindex
(JavaScriptSearchEngine,http://sourceforge.net/projects/jssindex/),
aninterestingsearchtoolforcollectionsofdocumentsinHTML,PS,PDF,and
DjVu,butunfortunatelyorientedonlyatEnglishlanguagetextsandverydiffi-
culttomodifyandextend.AsimplesearchenginehasbeenprovidedforCen-
turyDictionaryOnline (http://www.global-language.com/CENTURY/) men-
tionedearlier. Althoughit t looks like this is aspecial purpose softwarewrit-
ten for r the specific task, , this electronic editioncreatedby y Jeffery y A. Triggs
setsstandardsforanefficientandconvenient access toDjVudocuments. An-
other electroniceditionpreparedby Triggs is Jamieson’sEtymologicalDictio-
nary of f the ScottishLanguage Online (http://www.scotsdictionary.com/);
itallowstochoosebetweentwosearchengines:HunterandAmberfish. Hunter
is commercial l software developed by Alternative Output Inc. (http://www.
alternativeoutput.com/),usedbyafewcustomers,one of them beingOx-
fordUniversityPress,whichreportedlyusesitfortheonlineversionofOxford
EnglishDictionary. Amberfishis s anopensourcetext retrievalsystemdevel-
opedbyEtymonSystems;thecompanyseemstonolongerexist,butthesoft-
wareisstillavailableathttp://sourceforge.net/projects/amberfish/and
https://github.com/nassar/amberfish.
4
SDK application project:Process Multipage TIFF Images in Web Image Viewer| Online
Convert TIFF to other30+ formats supported by .NET imaging page TIFF image to a PDF; More image viewing & displaying functions. Multipage TIFF Processing.
www.rasteredge.com
SDK application project:Process Multipage TIFF Images in .NET Winforms | Online Tutorials
Convert multipage TIFF files into other 30+ formats supported by Swap a Page in a Multipage TIFF Image. Tiff Processing; RasterEdge OCR Engine; PDF Reading; Encode
www.rasteredge.com
Althoughgeneralpurposesearchenginesarequiteuseful,thereisawhole
familyof interestingsoftwarewhichtreats texts as linguisticobjects, namely
corpusmanagementsoftware. Oneofthemostsophisticatedsystemsofthistype
isPoliqarp(PolyinterpretationIndexingQueryandRetrievalProcesor),anopen
sourcetooldevelopedintheInstituteofComputerScienceofPolishAcademyof
Sciences(http://poliqarp.sourceforge.net/).Ithasbeeninuseforseveral
years, now also o for the National Corpus s of f Polish (http://nkjp.pl/); this
shouldguaranteeits continuous maintenance. An n important t factor r is s also o a
usercommunityfamiliarwithitsquerylanguage. Themaintainer r of Poliqarp
andimplementoroftheextensionsdesignedprimarilybyAdamPrzepiórkowski
(cf.[Przepiórkowski,2009])wastillrecentlyJakubWilk.
ThePoliqarpquerylanguagehasbeeninspiredbyCorpusQueryProcessor,a
componentofCorpusWorkbenchdevelopedattheUniversityofStuttgart(now
anopensourcesystem,cf. http://cwb.sourceforge.net/,butitwasnotso
whenthe developmentofPoliqarpstarted). Thebasicprincipleis s touse two
levels of regular expressions. One e levelis appliedtostrings representing the
valuesoflinguisticfeaturesofaword,theactualspellingofthewordbeingone
ofthem. Thesecondlevelofregularexpressionsisappliedtowordsortheirsets
definedwiththefirstlevelexpressions. Inconsequence e thequerylanguageis
verypowerful(itseemsthatpracticallyallqueriesavailableine.g. Hunterand
AmberfishmentionedabovecanbeexpressedinPoliqarp),butlessuser-friendly
thaninsimplersystems.
TheideatousePoliqarpforsearchinghiddentextofDjVudocumentshas
beenconceivedbythepresent author in2008andformulatedfirstas aterm
projectforComputerSciencestudents.Thebackgroundandtheresultsofthis
preliminaryattemptwerepresentedin[Bień,2009a]. Aresearchgrantallowed
toimplement lateramoreefficientandelegantsolutiondescribedbelow,and
tosupportthedevelopmentofsomeothertoolsmentionedinthepaper.
Theresultsofthesearchinthehiddentextlayermaybesuccessfulonlyif
the textreallyrepresents thecontent of thescan. Usuallyit t isnot the case
as the hiddentext layer is createdby ‘dirtyOCR’,i.e. anunattended d OCR
process. Henceitisimportanttoestimateeasilythequalityofthehiddentext.
Uponmyrequestof May2008LéonBottouinafewdaysincludedindjview4
thepossibilitytodisplayhiddentextforthescanfragment underthecursor;
anotheraddedfeatureisthepossibilitytodisplaythewholehiddenlayeratonce.
Itallowse.g. tospottheOCRerrorswhicharetoblameifthesearchmissesa
target(sucherrorscanbenowcorrectedwiththehelpofJakubWilk’sprogram
djvusmooth available inseveralLinux distribution including DebianSqueeze;
theprogramisstillunderdevelopment,soitshouldbecomemoreconvenientto
useinthenearfuture). Ontheotherhandthesamepurposecanbeservedby
graphicalconcordancesmentionedbelow.
4 PoliqarpforDjVu
PoliqarpforDjVu,alsoknownunderthecodenamemarasca-wbl,isanextension
ofPoliqarpallowing,atleastinprinciple,tousethefullpoweroftheprogramto
searchhiddentextinDjVudocuments.Itsdevelopmentisoneofthetaskssup-
portedbythePolishMinistryofScienceandHigherEducation’sgrantentitled
Textdigitalization tools for philological research. The e sourceof the system is
5
SDK application project:.NET PDF SDK | Read & Processing PDF files
Able to convert PDF documents into other formats (multipage TIFF, JPEG, etc); Multiple font types support, including TrueType, Type0, Type1, Type3 & OpenType;
www.rasteredge.com
SDK application project:C# PDF Convert to Images SDK: Convert PDF to png, gif images in C#
NET control able to batch convert PDF documents to image Create image files including all PDF contents, like Turn multipage PDF file into single image files
www.rasteredge.com
availableunderthetermsoftheGNUGPLlicenseathttps://bitbucket.org/
jwilk/marasca-wbl. Itisworthnotingthatalthoughat t firstthesystemwas
justamodificationof Poliqarp,wecontributeinreturntotheoriginalproject.
SinceMarch2010theNationalCorpus of Polish h has s usedour versionof the
WWWPoliqarpclient(https://bitbucket.org/jwilk/marasca).
PoliqarpforDjVuwasimplementedbyJakubWilkaccordingtothedesign
of the present author. Ithas s beenavailable for testingsince December 2009
athttp://poliqarp.wbl.klf.uw.edu.pl. Itoperatesbyaugmentingastan-
dardPoliqarp corpus with information about t the bounding g box coordinates
of the text tokens. Thetext t andthe coordinates are providedinhOCRfor-
mat[Breuel,2007]generatedwiththedjvu2hocrprogrambundledwithJakub
Wilk’socrodjvusoftware(http://jwilk.net/software/ocrodjvu).Thanksto
pdf2djvuitallowstoapplyPoliqarpforDjVutotheresultsproducedbyprac-
ticallyallimportant OCRprograms. Moreover,recentlyaconverterfromthe
PAGE (PageAnalysis andGround-truthElements) format [Pletschacher and
Antonacopoulos,2010]tohOCRhasbeendeveloped,whichallowsPoliqarpto
handle,atleastinprinciple,numeroustextspreparedintheveryformatbythe
socalledlibrarypartnersintheframeworkoftheIMPACTproject(IMProving
ACcesstoText,www.impact-project.eu).
As ofSeptember2011,four importantPolishdictionaries areavailablefor
testingPoliqarpforDjVU:
• “Warsawdictionary”,morepreciselySłownikjęzykapolskiego(Dictionary
ofthePolishLanguage)byJ.Karłowicz,A.KryńskiandW.Niedźwiecki
publishedinWarsawin8volumesin1900–1927. Ithasbeenscannedby
thelibraryof theUniversityofWarsaw,whichusedAbbyFineReader 8
forOCR;theresultscontainmanymistakesbutseemtobeusable.
• Słownik k polszczyznyXVI wieku u (Dictionary y of the 16
th
centuryPolish).
Thework startedin1949andisstillinprogress. Its s digitalizationhas
complex history, , which has s been n describedelsewhere (cf. [Piotrowski,
2005]and[Bień,2009b]).SinceDecember2010allthe34alreadypublished
volumes have beenavailable. Most t ofthem arescannedandthe OCR
is, unfortunately, of rather low quality. Thanks s to the e sponsor r of the
dictionary,FoundationforPolishScience,whichrecentlymadepublication
on the Internet t a a formal requirement t for further r funding, the e last t two
volumesaredigitallyborn;thesamefilesthatwereusedforprintingwere
convertedbyJakubWilkwithhis pdf2djvuprogram,sothephysicaland
electronic versions havethe sameappearanceand d content. Twoearlier
volumeswerepreservedintheinternalformat ofthetypesettingsystem
used; whentypesetagain, theresultingPDFfileshaveslightlydifferent
appearanceduetosomeminor changesinthesystem andfonts. Asthe
contentremainedidentical,thesevolumesarealsoavailableasdigitally-
born.
• SecondeditionofLinde’sdictionary. . Słownikjęzykapolskiego(Dictionary
of the Polishlanguage) ) by y Samuel Bogumił ł Linde were publishedin 4
volumes (two of them are e split t into two o parts, so o it t makes actually y 6
volumes)in1807-1814,thesecondeditionhasbeenpublishedin1854-1861.
Thisisoneofthemostimportanthistoricaldictionariesnotonlyfromthe
Polishpointofview,asalldefinitionsarealsogiveninGermanandthere
6
isalotofquotationsfromotherlanguages(includingOldSlavonic,Greek
andevenHebrew)anddialects,someofthemalreadyextinct.Themixture
of languages andscripts makes OCR extremely difficult; atpresent the
hidden text layer r has s been prepared with Abby FineReader 10 0 set t to
Polishlanguage.InconsequencethefragmentsinPolishareofquitegood
quality,whiletheremainingpartsarecompletelyunusable;thisishowever
alreadyasufficienthelpforreaderstryinge.g. tolocateanentry,which
areorderedaccordingtoruleswhicharedifferentfromcontemporaryones.
Wehavesomeplanstoimprovethequalityofthehiddentext,butthisis
outsidethescopeofthepresentpaper.
• Słownik geograficzny y Królestwa a Polskiego i innych krajów słowiańskich
(The Geographical Dictionary of f the e Polish h Kingdom m and d other r Slavic
Countries),agazetteerin15volumesofalmost1000pageseach,published
in1880-1914, extremelyuseful for r genealogical l research. The e gazetteer
covers Polandinits borders before the partitions betweenRussia,Ger-
many andAustria, but due tothe censorshipitwasimpossibletostate
thisexplicitlyinthetitle.
From a a user’s s point of f view, , Poliqarp p for DjVu enhances s Poliqarp proper
with functionalities present t already y in n The e Century y Dictionary Online e and
Jamieson’sEtymologicalDictionary of f the e Scottish h Language Online, namely
withlinkinghits(keywordsintheKWICindex)tothescanswithhighlighted
hits. To o quickly sort t out t false e positives caused d by y the low quality of “dirty
OCR”,PoliqarpforDjVuadditionallyprovidessocalledgraphicalconcordances,
i.eaKWICindexwiththescansnippetscreatedonthefly. Figure1showsa
graphicalconcordanceforanon-trivialqueryinLinde’sdictionary. Thepurpose
ofthequeryistofindtheoccurencesoftheabbreviationSyr. meaningSyryjski
(i.e. Syriac[language]). . Theproblemisthatthesameabbreviationrefersalso
toSyreniuszazielnik(i.e.Syreniusz’herbarium),butinsuchacaseitisfollowed
byapagereferenceintheformofanumber. Henceregularexpression
Syr "\." "[^[:digit:]].*"
specifies3tokens:
1. thecharacterstringSyr,
2. afullstop,
3. atokenthatdoesnotstartwithadigit.
Beforegoingintothedetails oftheregularexpressionsyntaxletus notethat
mostofthehitsareobviouslycorrect. Hitnumber2isafalsepositivedueto
anOCRerror,thedigithasbeenmisinterpretedasaletter. Hitnumber4may
seem incorrect,butactuallythisisaresultofsizelimitationofthe displayed
snippet.
Letushavealooknowatanexampleillustratinghowthepowerofregular
expressionscanbeusedtocircumventtheOCRerrors.Thefollowingexpression
("[CĆOGU]ze[sś]" | "[CO][z/]o[sa]") "\."
7
Figure1: GraphicalconcordancesinPoliqarpforDjVu
seems tomatchallthe occurences of the abbreviationCzes. (meaningCzech
language)intheWarsawdictionary,whichhasbeenrecognizedasCześ,Gzes,
Czos,Ozosetc.,asillustratedinfigures2and3.
Letusanalyzethestructureofthequery.Thetoplevelofthequeryconsists
ofthreesecondlevelregularexpressionsandhasthestructure
(RE1 | | RE2) RE3
whichmeansthatwearesearchingforRE3immediatelyprecededeitherbyRE1
orbyRE2.
Expression"\."denotessimplyafullstopendingtheabbreviation.Because
thefullstopinregular expressionsmeans “anycharacterexceptnewline”(in
thismeaningitoccursclosetotheendinthefirstexample),ithastobeescaped
withbacklashtorecoveritsstandardmeaning.Quotesareneededtodistinguish
thelevelsofregularexpressions.
Expression"[CĆOGU]ze[sś]"matcheswordsconsistingof4characters. The
secondandthirdonemust berespectively zande,the first andlastmaybe
any character fromtherespectivebracketedlist. Ifsucha a list t starts with^,
itmeans thethelistspecifiescharacterswhicharenotallowed,asinourfirst
example.
Thebracketedlistmaycontainalsopredefinednamesofcharacter classes,
asexemplifiedby[:digit:]inthefirstexample.Anotheruseofthisconstruct
is demonstratedby aquery usefully applicable to o the dictionary of the16th
centuryPolish:
"[[:upper:]]{3,}" within body meta orig=pdf
8
Figure2:GraphicalconcordancesfordirtyOCR
Itallowstosearchforheadwords,alwaysspelledincapitals. Thequerymatches
alsotheRomannumbersreferringtocenturies,butitdoesn’tdomuchharmand
avoidingthismakes thequerymuchmorecomplex. . Theresultsarepresented
infigure4.
Thetoplevelregularexpressionissimpleandconsistsofonlyonecomponent,
itishoweversupplementedbytwoclauses. Thefirstclauselimitsthesearchto
thesectionnamedbody;sectionsaredefinedduringthecorpusbuilding,inour
case thissections refers tothepart ofdictionarycontainingthe entries. The
secondclause referstometadataassignedtothe publications includedinthe
corpus. In n ourcase e this is non-standardmetadatawhichallows tolimit our
searchtodigitally-bornvolumes.
Thesecondlevelexpressionconsistsoftwoparts: thecharacterspecification
[[:upper:]] and d the e quantifier {3,}. The e character specification n is just t a
singleelementbracketedlist,andtheelementisthenameofacharacter class
(alsowritteninbrackets);theclass[:upper:] denotes,asexpected,allupper
casecharacters;themeaningof“all”dependsonanoperatingsystemproperty
calledlocale,butcanbesafelyassumedtomeanatleastallcharacterspresent
intheBasicMultilingualPlaneoftheUnicodestandard(www.unicode.org).
The quantifier r {3,} } means that the precedingelement has to occur r in a
wordatleast threetimes;inthecaseofourdictionaryitmeansthat weskip
9
Figure3:StandardconcordancesfordirtyOCR
theinitialsofauthors(inthedictionaryeveryentryissignedbyitsauthor)but
matchtheheadofentrieslongerthantwoletters.Otherpopularquantifiersare:
*(theprecedingelementoccursanynumberoftimesordoesnotoccuratall;
theconstructwas usedinthefirstofourexamples),+(theprecedingelement
occursatleastonce),? (theprecedingelementoccursatmostonce).
Theregularexpressions arefarfrombeinguser-friendly,theymaybecon-
fusingevenforanexperiencedprogrammer.Theiruseishoweversoubiquitous
thatlearningthem is agoodinvestment. Ontheotherhand, , there existal-
readyvarioustoolsforeditinganddebuggingregularexpressionsandwehope
toadapt oneof them inthe future to o Poliqarp. For r the timebeingthebest
approachis tostart withasimple generalquery andtorefinethe searchby
addingadditionalrestrictions.
5 Lemmatization,morphosyntactictaggingand
polyinterpretations
The standardlinguistic corpus workflow includes two important steps: mor-
phosyntacticanalysisanddisambiguation(cf.eg.[Przepiórkowski,2004,p.14]).
Morphosyntacticanalysisassignsallpossibleinterpretationstoaword,inpar-
10
Documents you may be interested
Documents you may be interested