Module2
Imageacquisition&preprocessing
UweSpringmann
Centrum曼rInformations-undSprachverarbeitung(CIS)
Ludwig-Maximilians-UniversitätMünchen(LMU)
2015-09-14
UweSpringmann
Module2Imageacquisition&preprocessing
2015-09-14
1/18
Convert pdf to html form - application Library utility:C# PDF Convert to HTML SDK: Convert PDF to html files in C#.net, ASP.NET MVC, WinForms, WPF application
How to Convert PDF to HTML Webpage with C# PDF Conversion SDK
www.rasteredge.com
Convert pdf to html form - application Library utility:VB.NET PDF Convert to HTML SDK: Convert PDF to html files in vb.net, ASP.NET MVC, WinForms, WPF application
PDF to HTML Webpage Converter SDK for VB.NET PDF to HTML Conversion
www.rasteredge.com
Motivation
remember:thecompleteOCRworkflowconsistsofseveralsteps:
1
imageacquisition
2
preprocessing
3
(groundtruthproduction,modeltraining)
4
recognition
5
evaluation
6
postprocessing: annotation,errorcorrection,tagging,…
“achainisonlyasstrongasitsweakestlink”:
badimages/preprocessingwillseverelylimitthequalityofyourendresult
trade-off:fastresultagainstqualityresult(requiressomemanualprocessing)
makeaninformeddecisionbasedonyourobjectives
UweSpringmann
Module2Imageacquisition&preprocessing
2015-09-14
2/18
application Library utility:VB.NET PDF Form Data Read library: extract form data from PDF in
RasterEdge .NET PDF SDK is such one provide various of form field edit functions. Demo Code to Retrieve All Form Fields from a PDF File in VB.NET.
www.rasteredge.com
application Library utility:C# PDF Form Data Read Library: extract form data from PDF in C#.
A best PDF document SDK library enable users abilities to read and extract PDF form data in Visual C#.NET WinForm and ASP.NET WebForm applications.
www.rasteredge.com
Imageacquisition
Imageacquisition
UweSpringmann
Module2Imageacquisition&preprocessing
2015-09-14
3/18
application Library utility:VB.NET PDF Convert to Jpeg SDK: Convert PDF to JPEG images in vb.
C#: Convert PDF to HTML; C#: Convert PDF to Jpeg; C# File C# Protect: Add Password to PDF; C# Form: extract value from fields; C# Annotate: PDF Markup & Drawing.
www.rasteredge.com
application Library utility:C# PDF Convert to Jpeg SDK: Convert PDF to JPEG images in C#.net
C# PDF - Convert PDF to JPEG in C#.NET. C#.NET PDF to JPEG Converting & Conversion Control. Convert PDF to JPEG Using C#.NET. Add necessary references:
www.rasteredge.com
Imageacquisition
Wheretolookfordigitizedbooks
lookforscansatHathiTrust,archive.org,Europeana,TheEuropeanLibrary,
DDB,Wikisource,BSB,orGooglebooks
trytofindthebestscan(Googlebooksareo晴entheworst);largerfilesizes
pointtohigherresolution
especiallygoodscanscanbefoundinDFG-fundedprojects(VD16,VD17,
VD18)
ifyoucannotfindascan:
haveitscanned晲omaninstitution(canbeexpensive)
yourlocalresearchlibrarymaybeabletohelpyou
ordo-it-yourself:
procureyourowncopy,takethepagesapartandscanthem
scaneitherincoloror(atleast)grayscale
resolution:preferably300-400dpi;higherresolutionmaynotbebetter
(connectedcomponentsinlettershapesmayfallapart)
theDFGdigitisationguidelinesmaybehelpful
UweSpringmann
Module2Imageacquisition&preprocessing
2015-09-14
4/18
application Library utility:VB.NET PDF Convert to Word SDK: Convert PDF to Word library in vb.
VB.NET PDF - Convert PDF to MS Office Word in VB.NET. VB.NET Tutorial for How to Convert PDF to Word (.docx) Document in VB.NET. Best
www.rasteredge.com
application Library utility:VB.NET PDF Convert to Tiff SDK: Convert PDF to tiff images in vb.
VB.NET PDF - Convert PDF to TIFF Using VB in VB.NET. Free VB.NET Guide to Render and Convert PDF Document to TIFF in Visual Basic Class.
www.rasteredge.com
Imageacquisition
Sometipsforimageacquisition
o晴enbooksfoundatGooglearealsoavailableatahigherresolutionatBSB
(searchBSBfirst)
usetheBSBOPACpluscatalogtosearchforvolumes(resultscanbefilteredfor
onlineresources)
atarchive.org,download“singlepageprocessedJP2zip”fileratherthanpdfor
djvufiles(thelatteraredowngradedinresolution)
avoidbinarizedimages,doyourownbinarizationlateron
publiclyavailableimagestendtobedownsized150dpi“servicecopies”(pdfor
jgp);youcanaskforhigherresolutionoriginalpngoftiffimages
youcanstillOCR150dpimaterial,butiftheresultsarenotgoodenoughfor
you,get300dpiscansbeforeyoudoheavypostcorrection
UweSpringmann
Module2Imageacquisition&preprocessing
2015-09-14
5/18
application Library utility:C# PDF Convert to SVG SDK: Convert PDF to SVG files in C#.net, ASP
PDFDocument pdf = new PDFDocument(@"C:\input.pdf"); pdf.ConvertToVectorImages( ContextType.SVG, @"C:\demoOutput Description: Convert to html/svg files and
www.rasteredge.com
application Library utility:C# PDF Convert to Tiff SDK: Convert PDF to tiff images in C#.net
C#.NET PDF SDK - Convert PDF to TIFF in C#.NET. Online C# Tutorial for How to Convert PDF File to Tiff Image File with .NET XDoc.PDF Control in C#.NET Class.
www.rasteredge.com
Imageacquisition
Effectofimagequalityonrecognition
thesamescanwithlower(Google)andhigher(BSB)resolution
a晴ermodeltraining,theaccuracyontestpagesis94%(Google)and97%(BSB)
UweSpringmann
Module2Imageacquisition&preprocessing
2015-09-14
6/18
Preprocessing
Preprocessing
UweSpringmann
Module2Imageacquisition&preprocessing
2015-09-14
7/18
Preprocessing
Preprocessingtasks
preprocessingconsistsof(someof)thefollowingtasks:
splitting:splitdouble-sideimagesintosinglepages,orseveralcolumnsinto
single-columnimages
cropping:getridof(black)boundaries
deskewing: bringimagetohorizontalorientation
dewarping: “flatten”image,ifscanned晲omwarpedpages
despeckle:noisereduction,suppressblackspots(“speckles”)
binarization:separatesignal(characters,black)晲omnoise(background,white)
zoning:separatetextzones晲omnon-text(images,graphsetc.);separate
semanticallydifferenttextzones(runningheads,pagenumbers,footnotes,
columns,…)
linesegment:cuttextzonesinsingletextlines
allOCRengineshavesomekindofbuilt-inpreprocessingfacility
however,foroptimalresultsitiso晴enbettertodosomemanualtool-assisted
preprocessing
UweSpringmann
Module2Imageacquisition&preprocessing
2015-09-14
8/18
Preprocessing
Example: GartderGesundheit(printingof1487)
JohannWonneckevonKaub(JohannesvonCuba),GartderGesundheit(1487)
UweSpringmann
Module2Imageacquisition&preprocessing
2015-09-14
9/18
Preprocessing
Effectofpreprocessingonrecognition(Bodenstein1557)
OCRengine
char.acc.
orig.
prepr.
Tesseract(Fraktur)
35%
71%
Abbyy(Fraktur+hist.lexicon)
78%
79%
UweSpringmann
Module2Imageacquisition&preprocessing
2015-09-14
10/18
Documents you may be interested
Documents you may be interested