itextsharp pdf to text c# : How to fill pdf form in reader control software platform web page windows .net web browser open-2001-0651-part1584

Apositivepointaboutthistoolisthatitmarksthepageswithinformation. Atthetopofeachpage,
itplacesananchorwherebythenamegivenisthepagenumber. Italsoshowsthepagenumberat
theendofeachpage. Thispagenumberingandnamingmakestherecognitionofpagebreaksvery
easy,andisanicefeature.
Duringthetestingofthisprogram,IdidnotfindmanyPDFfilesthattheprogramwasunableto
convert. TheonlysuchfileswerePDFfilesofaparticularlybadquality,whichIsuspectwere
documentsthatmaypossiblyhavecontainedunknownfonts,soimprovisationhadtobecarriedout
bythetoolthatcreatedthePDF. ThetoolhadabsolutelynoproblemsconvertingPDFfilesthat
werecreatedfromMicrosoftWorddocuments,andtheHTMLsourceforPDFfilesconvertedfrom
suchdocumentswasexactlythesameasthoseconvertedfromnon-Microsoftdocuments.
AlthoughthetooldoesplaceboldtagsanditalictagsintotheHTMLsource,thisisnot100%
reliable,asthetagginginformationisalittleerratic. Takeforexamplethefollowingextract:
<i>Klim</i><i>a-</i>
<i>u</i><i>n</i><i>d</i>
<i>K</i><i>ält</i><i>eanlagen</i><br>
<i>du</i><i>rc</i><i>h</i>
<i>E</i><i>ins</i><i>a</i><i>tz</i>
<i>v</i><i>o</i><i>n</i>
<i>e</i><i>lek</i><i>t</i><i>ronis</i><i>c</i><i>h</i>
<i>geregelt</i><i>e</i><i>r</i>
<i>P</i><i>um</i><i>pen.</i>
<i>KI</i>
<i>Luf</i><i>t-</i>
<i>und</i>
<i>Kältet</i><i>ec</i><i>hni</i><i>k</i><i>,</i>
<i>page</i>
<i>20-</i><br>
<i>23</i><i>,</i>
<i>J</i><i>anuary</i>
Figure9. HTMLsourceproducedbypdftohtml. NotethepointlessrepetitionofHTMLtagginginformation.
Noticefromtheaboveextract,thatthetagsareclosedandre-openedpointlesslyforpartsofa
word,whereitshouldreallybe1setoftagstoenclosetheentireword. Infact,often,itshould
reallybe1setoftagstoenclosetheentiresectionofitalicisedtext. Ofcoursethisdoesnotmakea
differencetothebrowser,butifweareinterestedinthesource,thenthereisalotofcleaningtobe
doneuponit.
The‘pdftohtml’Tool:TestingConclusion
ItisashamethattheHTMLsourceproducedbythistoolissobrokenup. Itwouldhavebeen
preferableifitwasonlybrokenwheretherelevantHTMLbreaktagswere. However,despitethis
How to fill pdf form in reader - extract form data from PDF in C#.net, ASP.NET, MVC, Ajax, WPF
Help to Read and Extract Field Data from PDF with a Convenient C# Solution
how to save filled out pdf form in reader; saving pdf forms in acrobat reader
How to fill pdf form in reader - VB.NET PDF Form Data Read library: extract form data from PDF in vb.net, ASP.NET, MVC, Ajax, WPF
Convenient VB.NET Solution to Read and Extract Field Data from PDF
how to save editable pdf form in reader; how to save a filled out pdf form in reader
fact,theHTMLoutputisdisplayedverywellinabrowser,andsoifthisisthegoal,thenthistool
wouldbeperfect. Itevenhasthecapabilitytoincludepictures,makeframes,etc.
Unfortunately,althoughthetooldoesplaceboldanditalicisingmark-uptagsintotheHTML
source,itisnotreliableenoughformypurposes. Ihadwantedthetooltomarkupallheaders
preferablywith"<H>"tags,sothatIcouldrecognisesectionheadersetc. However,thetooldoes
notuse"<H>"tags,anditsometimeswouldmarkupheader,butothertimesitwouldnot. Thereis
nothingtobegainedfromthecheckingformarked-upheadersinthesourceproduced,asituses
“<B>”tagstomarkheadersup,butalsomarksupnormaltextwiththeseboldtags,hence
destroyinganywayofuniquelyidentifyingtheheadersbymark-up.
IwouldsaythatasfarasPDF-to-HTMLconvertersgo,thistoolisverygoodforcreatingHTML
thatisintendedfordisplayinabrowser,andIwouldstronglyrecommendthetoolifthisisthe
desiredoperation. However,itmustberealisedthatifthesourceiswantedforanyfurther
examination,itwillbenecessarytodoconsiderablecleaningonitbeforeitisready.
Verdict:
RecommendforPDF-to-HTMLconversionstobeviewedinabrowser.
Notreliableifmark-upofvarioussectionsisrequired.
The‘pstotext’Tool
Source:<http://research.compaq.com/SRC/virtualpaper/pstotext.html>
ThepstotexttoolisavailablefreelyfromtheaboveURL. ItwaswrittenbyAndrewBirrell,asa
spinofffromaprojectknownas"VirtualPaper".
ThepstotexttoolisprimarilyaPostScripttotextconverter,butcanalsobeusedforconverting
PDFdocumentstotext(althoughthedocumentationforthetoolclaimsthatthisisslightlyless
reliable).
Thepstotexttoolrequiresversion3.33orlaterofGhostscriptinordertowork.
Installingpstotext
ThepstotexttoolisdownloadedfromtheaboveURLasazipped".tar"file. Itsinstallationisreally
verysimple,itonlybeingnecessarytobuildthetoolwiththemakecommand.
Usingpstotext
Thepstotexttoolhasthefollowingusageinformation:
Usage: pstotext [option|file]...
Options:
-cork
assume Cork encoding for dvips output
VB.NET PDF Form Data fill-in library: auto fill-in PDF form data
to PDF. Image: Remove Image from PDF Page. Image Bookmark: Edit Bookmark. Metadata: Edit, Delete Metadata. Form Process. Extract Field Data. Data: Auto Fill-in Field
extract pdf data to excel; extract data from pdf form fields
C# PDF Form Data fill-in Library: auto fill-in PDF form data in C#
A professional PDF form filler control able to be integrated in Visual Studio .NET WinForm and fill in PDF form use C# language.
how to make a pdf form fillable in reader; extract pdf form data to xml
-landscape
rotate 270 degrees
-landscapeOther
rotate 90 0 degrees
-portrait
don't rotate e (default)
-bboxes
output one word per line with bounding box
-debug
show Ghostscript output and error r messages
-gs "command"
Ghostscript command
-
read from stdin n (default if f no o files s specified)
-output file
output results to o "file" " (default is s stdout)
Essentially,Iusedthetoolwithnoneoftheoptionsintheform:
«
pstotext [filename]
»
ThisstreamedtheoutputtoSTDOUT,whichIfoundtobeagoodfeature,asitallowstheoutputto
bedirectlycapturedandmanipulatedinsteadofhavingtobewrittentoafile,whichwouldtakeup
extratime.
Ifoundthatthepstotexttoolworkedfairlyquickly. ForaPostScriptfileofaround322KB,ittook
approximately50secondstomaketheconversiontotext,includingwritingittoafile.
Onthewhole,thequalityoftheoutputproducedbypstotextwasverygood. Therewererarely
manywordsincorrectlybrokenwithspaces. Linestookthesamelengthsastheydidintheoriginal
file(i.e.theyarewrappedinthesameplacesastheywereinthePostScriptfile). Thiswrapping
howeverwasdonebyinsertinganewlinecharacterintotheline,thusbreakingitintoseparate
lines. Thisisperhapsalittleunfortunate,asitwouldbeniceintheinterestsofparsingtheresultsif
thetoolweretoonlyinsertnewlinecharacterswheretherewasreallysupposedtobeanewline
(notwherelineswerewrappedforformattingpurposes),asitwouldeliminatetheneedtorebuild
linesatalaterstage.
OneproblemthatIdidencounterintheoutputwaswiththeword"different". Intheoutputtext
fromadocument,itkeptoutputting"di#erent"wheretheworddifferentwassupposedtobe. I
believethatthismustbesomethingtodowith2'f'charactersbeingnexttoeachother-perhaps
somesortofcharacterencodingproblem. Thiscouldcausebigproblemsifthetextwastobe
searchedforcertainkeywords,asstrangecharactersinthemiddleofwordsinthetextcould
preventtheirmatching.
Anotherdrawbacktothetoolsoutput(albeitasmallone)wasthatwhenawordwassplitand
hyphenated(duetowordwrappinginthePostScript),thistoolmadenoefforttoremovethe
hyphenationandplacethewordbacktogetherasoneword. Thisisunfortunate,asitwouldhave
helpedtoimprovethequalityoftheoutput.
Thepstotexttooldoesnotinsertanykindofpagebreakinformationsuchasalineofhyphensasin
theoutputoftheprescripttool. Allitdoesisprintthepagenumber(ifthedocumenthadone),
alongwitha'\f'character,whichisaform-feedcharacter,andservesasapagebreak. Thisis
perhapsunfortunate,asitwouldaidtheclarityoftheoutputtohavealineofhyphensorothersuch
charactersinsteadofa'\f'character.
ThefollowingextractfromtheaconvertedPostScriptdocumentshowsthequalityoftextproduced
bythepstotexttoolforPostScriptdocuments:
C# WPF PDF Viewer SDK to annotate PDF document in C#.NET
Text box. Click to add a text box to specific location on PDF page. Line color and fill can be set in properties. Copyright © <2000-2016> by <RasterEdge.com>.
online form pdf output; extract data from pdf to excel online
VB.NET PDF Password Library: add, remove, edit PDF file password
passwordSetting.IsAnnot = True ' Allow to fill form. passwordSetting document. passwordSetting.IsAssemble = True ' Add password to PDF file.
cannot save pdf form in reader; extracting data from pdf files
[9]SandraPayetteandCarlLagoze.Value-addedsurrogatesfordis-
tributedcontent.D-LibMagazine:TheMagazineofDigitalLi-
braryResearch,6(6),June2000.
18
[10]AndyQuick.JavaHTMLtidy.
<http://www3.sympatico.ca/ac.quick/jtidy.html>
[11]K.G.Saur.Functionalrequirementsforbibliographicrecords,
1998.UBCIMPublications-NewSeriesVol.19.
[12]KarenSollinsandLarryMasinter.Functionalrequire-
mentsforuniformresourcenames,December1994.
http://www.ietf.org/rfc/rfc1737.txt.
[13]ElaineSvenonius.TheIntellectualFoundationofInformation
Organization.M.I.T.Press,2000.
[14]HerbertVandeSompelandCarlLagoze.TheSantaFeCon-
ventionoftheOpenArchivesInitiative.D-LibMagazine:The
MagazineofDigitalLibraryResearch,6(2),February2000.
19
Figure10. Textproducedbythe‘pstotext’tool. Thequalityisfairlygood.
Noticeintheabovefigure,thatthepagenumber(18)thatcomesafterthe9threference. Thiscould
easilybemistakenforanotherpartofthatreferenceduringparsing.
WhenthetoolwastriedusingPDFfilesasinput,theresultsoftheconversionwerefair,but
certainlyalackofqualitywasvisible. Infact,insomePDFconversions,thedocumentwascut
short. Thissuggeststomethatthepstotexttoolpossiblyhassomedifficultiesinreadingthe
internalreferencestotheobjectsthatmakeupthePDFfile. Thisdoesnotmeantosaythatthetool
wasuselessforconvertingPDFdocumentstotext,justcertainlynotperfect.
ThefollowingextractshowspartofaconversionofaPDFdocument:
[37]ATLAShomepagehttp://atlasinfo.cern.ch:80/Atlas/Welcome.html
[38]ATLASTrigger/DAQPrototype-1homepagehttp://atddoc.cern.ch/Atlas/
[39]ApplicationsofCorbaintheAtlasprototypeDAQ,S.Kolos,R.Jones.L.Mapelli,Y.Ryabov,
11th
IEEENPSSRealTimeConferenceProceedings,1999,pp469-474
[40]Textor-94experimenthomepageishttp://www.fz-juelich.de/ipp
[41]Objectivity/CorbadistributeddatabaseperformanceongigabitSUN-Ultra-10cluster,
L.Gommansandothers,11thIEEENPSSRealTimeConferenceProceedings,1999,442-445
[42]OverviewofPHENIXOnlineSystem,C.Witzig,10thIEEERealTimeConference
Proceedings,
1998,pp541-543
[43]UseofCORBAinthePHENIXDistributedOnlineComputingSystem,E.Desmondand
others,
C# PDF Password Library: add, remove, edit PDF file password in C#
passwordSetting.IsAnnot = true; // Allow to fill form. passwordSetting document. passwordSetting.IsAssemble = true; // Add password to PDF file.
fill in pdf form reader; extract data from pdf forms
VB.NET PDF - Annotate PDF with WPF PDF Viewer for VB.NET
Text box. Click to add a text box to specific location on PDF page. Line color and fill can be set in properties. Copyright © <2000-2016> by <RasterEdge.com>.
extract data from pdf; collect data from pdf forms
11thIEEENPSSRealTimeConferenceProceedings,1999,pp487-491
[44]BaBarhomepagehttp://www.slac.stanford.edu/BFROOT/
[45]AmbientandConfigurationDatabasesfortheBaBarOnlineSystem,G.Zioulasandothers,
11th
IEEENPSSRealTimeConferenceProceedings,1999,pp548-550
Figure11. Textproducedbypstotext(convertedfromaPDFdocument).
Ascanbeseenfromthisextract,thequalityisfairlyhigh. Itisashamethatthiscan'tbe
guaranteedthiseverytime.
Unfortunately,duringtesting,Idiscoveredthatthepstotexttoolwasveryunreliablewhen
attemptingtoconvertaPostScriptfilethatwascreatedfromaMicrosoftdocument. Iattemptedto
convertseveralPostScriptfilesthathadbeencreatedfromMicrosoftWordandPowerPointfilesby
theCERNConversionService,andobtained"garbage"outputsimilartothefollowing:
-
--
--
--
--
--
--
-
--
-
--
.
.
.
-
Figure12. GarbageoutputobtainedwhenanattemptismadebypstotexttoconvertaPostScriptproducedfroma
Microsoftdocument.
Itcanbeseenfromtheaboveextract,thattheoutputforthisMicrosoftcreatedPostScriptthathas
beenconvertedtotextiscompletelyuseless. Whatisworseisthatthetooldoesnotappearto
outputanysortoferrormessagessayingthatitcannotproperlyunderstandthefilethathasbeen
passedtoitasinput.
VB.NET PDF - Annotate PDF Online with VB.NET HTML5 PDF Viewer
on PDF page. Outline width, outline color, fill color and transparency are all can be altered in properties. Drawing Tab. Item. Name. Description. 7. Draw free
how to fill out pdf forms in reader; extracting data from pdf forms
C# HTML5 PDF Viewer SDK to annotate PDF document online in C#.NET
on PDF page. Outline width, outline color, fill color and transparency are all can be altered in properties. Drawing Tab. Item. Name. Description. 7. Draw free
extract table data from pdf; save data in pdf form reader
The‘pstotext’Tool:TestingConclusion
Ihavefoundthatthepstotexttooliscapableofproducingniceoutputthatisfairlyeasytoparse.
Therehavebeenafewdownfallswiththisoutput,suchasthelackofpagebreakinformationetc,
butthisisnottoocrucial,asitstilloutputstheform-feedcharacter.
However,onthedownsideofthetool,althoughoftenveryniceoutputwasobtainedfromaPDF
fileconversion,sometimespartofthefilewouldbelost. ThismakesitpartlyunreliableforPDF
conversions.
Thebiggestletdownofallforthepstotexttoolisthatitseemstobeveryunreliableatconverting
MicrosoftPostScriptdocumentstotext. This,inmyopinion,makesthetoolunsuitableforuseina
productionenvironmentwherewecannotdeterminethesourcesandcreatorsofthePostScript/PDF
filesthatwewanttoconvert. Infact,itisquitelikelythatmanyofthefilesthatwewouldexpectto
convertwouldhavebeencreatedfromMicrosoftWorddocuments,soclearly,thistoolis
unsuitable.
Verdict:Unsuitable.
The‘Prescript’Tool
Source:<http://www.nzdl.org/html/prescript.html>
PrescriptisavailablefreelyfromtheaboveURL. ItwaswrittenbypeopleattheNewZealand
digitalLibraryorganisationasatranslatortochangePostScriptdocumentsintoText. Italso
howeverofferssupportforasimpleHTMLoutputofthedocument. Unfortunately,because
prescriptisaPostScripttotexttranslator,itdoesnotofferPDFtotexttranslationcapabilities.
WhenprescriptisusedtoproduceHTML,ithasthecapabilityonlytointroducecertainHTML
tags. Thesearethe"<P>","<BR>","<HR>"and"<I>...</I>"tags. . Italsoofcourseinsertsthe
"<HTML>","<HEAD>",etctagsintothedocument. AccordingtotheNZDLsit,prescriptalso
attemptstosupportparagraphboundariesdetectionbyusingthelinespacingandindentationinthe
documentinordertodetermineparagraphboundaries.
AccordingtotheNZDLsite,prescriptalsoattemptstode-hyphenatewordsthathavebeen
hyphenatedbythePostScript,whichcouldbeafairlyusefulfeature. Italsoattemptssomeligature
translationforT
E
Xdocuments.
Prescriptrequiresversion4.01orhigheroftheGhostScriptutilityinordertowork. Itisalso
writteninthePythonlanguage,andsorequiresthePythoninterpreter.
Thereare2mainversionsofprescriptavailable. Theseare"PreScript0.1",whichisthestable
versionoftheprogram,andisrecommendedbytheauthorsastheversionthatshouldbeusedfor
anyseriousworkthatistobeundertaken. Thereisalsohowever,"PreScript2.2",whichisthe
latestversionofthetool. Theauthorsclaimthatthisversionofthetoolisalotfaster,andgenerally
better,includingbetterpredictionofline,pageandparagraphbreaks.
Installingprescript
Havingdownloadedtheprescripttool,whichcameasa"tar"package,amakefilewasusedinorder
toinstallit. However,manythingsneededtobedonemanually,suchasmakingthevarious
directoriesthatitneeded,asitcouldnotsuccessfullydothisduringtheattemptedbuilds. Itwas
alsonecessaryformetochangethepointertothepythoninterpreter,etc. Itwasalittleawkward,
butnomajorproblemswereencountered.
Usingprescript:
Thetoolcouldbeinvokedasfollows:
«
prescript <plain|html|arff> > <input> [output]
»
Itwasnecessarytospecifyforthetool,whichformattheoutputshouldtake,thenameoftheinput
file(thePostScriptfiletobeconverted),andthenameoftheoutputfile,towhichtheoutputwasto
bewritten. Unfortunately,itisnotpossibletotelltheutilitytosimplywriteitsoutputtothe
STDOUTstream-itappearstoneedtoactuallywriteittoafile. Thisisadefinitedownsidetothe
tool,asformypurposes,Isimplywanttocallthetoolfromwithinanotherprogram,feedingits
outputdirectlytoSTDOUT,andretrievingitforusebymyownprogram. Anintermediatestageof
writingafilewouldbeaperformancedrawback.
Havinglearnedhowtousethetool,ItrieditwithseveralPostScriptfiles,testingboththeHTML
outputandtheplain-textASCIIoutput. Ifoundthatonthewhole,thetoolgavesomeveryclean
andencouragingresults.
Firstofall,theHTMLresultsshallbediscussed. Ishallalsoincludesomeshortextractsfromthe
convertedoutputofafilesothattheycanbeappreciatedwithinthisdocument.
Whenviewedwithinabrowser,theHTMLresultsareverynice. Althoughthetextisinfairlyshort
lines,itiswellbrokenupintoparagraphs. Itisverycleartoreadinthismanner. Oneproblemthat
wasencounteredhowever,wasthatwhenthereisanimageinthePostScriptdocumentandthis
imagecontainswords,thewordsareunfortunatelytranslatedandplacedwithintheoutputtext.
Often,thiscanbenonsensebecausetheymeannothingwithouttherestoftheimage,anditwould
inmyopinionhavebeenbettertosimplyleavethemoutofthetranslator. Unfortunately,thisisa
problemcommontoalloftheconversiontoolstestedforthepurposesofthisreport. Itismy
feelingthatthetoolscannotdistinguishbetweentextwrittenontopofanimage,andtextwrittenon
therestofthePS/PDFcanvas. Presumably,theimageisrepresentedasbinaryinformation,butits
textremainsasaseriesofcharacterswrittenonthecanvaswiththeusualoperators.
Asmygoalistheextractionofreferenceinformationfromthereferencesectionhowever,itshould
usuallycausenoproblems. However,becausethereferencessectioncansometimescontain
imagesandfigures,anytextfromthesecouldpollutethereferences.
WhenduringthetranslationprocessthetooldiscoversanewpageinthePostScriptdocument,itis
markedintheHTMLoutputwitha"<HR>"tag. Thiscouldbeveryuseful,asitwouldallowany
parsingtooltoeasilyandunambiguouslyidentifyanynewpagesintheoutput. Thetoolalsomarks
eachpagewiththepagenumberjustbeforethe"<HR>"tag.
AdownsidetotheHTMLproducedisthattheprescripttooldoesnotmakeanyefforttomark-up
titlesectionswith"<H>"tagsor"<B>"tags. Thisiscertainlyunfortunate,asitwouldbevery
usefulforaparserattemptingtoextractreferencestohavetitlesectionsmarkedup. Theauthors
informationaboutthetooldoessaythatituses"<I>"tags,butonlyformarkingupheaderand
footersections. Verydisappointing.
RegardingthesourceoftheHTMLitself,Icanonlysaythatitisverygood. WithotherPostScript
totextconversiontoolsthatIhaveseen,thequalityoftheoutputisoftenmessy,withhyphened
wordsfrequentlyoccurring,andwithwordsnotproperlyrecognisedandendingupwithspacesin
themiddleofaword,hencemakingfurtherparsingdifficultiesforanytoolsthatusetheoutputto
attempttorecognisewordsandinformation. WiththeHTMLsourceproducedbytheprescripttool
however,thisisnotthecase. Ididnotonceseeawordthathadbeenerroneouslysplit.
LineswerebrokenattheendofthelineasitappearsinthePostScriptdocument. Theonlyreason
thattheselineswerebrokenatthesepointsinthePostScriptdocumentisthattheformattingofthe
textinthePostScriptrequireslinestobewrappedinorderthattheyfitthepage. Inouroutput
HTMLhowever,itwouldbepreferableifthelinesdidnotkeepthisformatoflinebreaksunless
thereisreallysupposedtobeone. However,withtheHTMLoutput,itwasnotalargeproblem
becausewiththeHTMLmarkup,itwouldbeeasytoreplaceallcarriagereturnsinthetextwith
spaces,unlesstherewasa"<BR>"or"<P>"tagpresent,inwhichcasethecarriagereturninthe
textwouldbejustified.
ThefollowingextractshowssomeHTMLsourcecreatedbytheprescriptprogramforadocument.
<p>[1]DonnaBergmark.Automaticextractionofreferencelinkinginformation
fromonlinedocuments.TechnicalReportTR2000-1821,
CornellComputerScienceDepartment,October2000.
<p>[2]Priscilla Caplan and William Arms. . Reference e linking
for journal articles. . D-LibMagazine: : TheMagazine
ofDigitalLibraryResearch,5(7/8),July/August1999.
&lt;http://www.dlib.org/dlib/july99/caplan/07caplan.html&gt;
<p>[3]JamesDavisandCarlLagoze.NCSTRL:designanddeployment
ofagloballydistributeddigitallibrary.IEEEComputer,February
1999.
<p>[4]SteveHitchcock,LesCarr,WendyHall,StephenHarris,S.Probets,
D.Evans,andD.Brailsford.Linkingelectronicjournals:
LessonsfromtheOpenJournalproject.D-LibMagazine:The
MagazineofDigitalLibraryResearch,December1998.
<p>[5]C.LagozeandJ.Davis.Dienst:Anarchitecturefordistributed
documentlibraries.CommunicationsoftheACM,38(4):47,April
1995.
<p>[6]SteveLawrence,C.LeeGiles,andKurtBollacker. Digitallibraries
andautonomouscitationindexing. IEEEComputer,
32(6):67{71,1999.&lt;http://www.researchindex.com&gt;
<p>[7]NormanPaskin.E-citations:actionableidentifiersandscholarly
referencing,1999.&lt;http://www.doi.org/citations.pdf&gt;
<p>[8]S.PayetteandC.Lagoze. Flexibleandextensibledigitalobject
andrepositoryarchitecture(FEDORA).InSecondEuropean
ConferenceonResearchandAdvancedTechnologyforDigitalLibraries,
Heraklion,Crete,1998.
<p>[9]SandraPayetteandCarlLagoze.Value-addedsurrogatesfordistributed
content.D-LibMagazine:TheMagazineofDigitalLibrary
Research,6(6),June2000.
<p><!--PageNo--><p><b><center>18</center></b><p>
<!--EndOfPage--><p><hr><p>
<p>[10]Andy Quick.
Java HTML
tidy.
&lt;http://www3.sympatico.ca/ac.quick/jtidy.html&gt;
<p>[11]K.G.Saur.Functionalrequirementsforbibliographicrecords,
1998.UBCIMPublications-NewSeriesVol.19.
<p>[12]Karen SollinsandLarryMasinter. . Functional l requirements
for uniform resource names, , December r 1994.
http://www.ietf.org/rfc/rfc1737.txt.
<p>[13]ElaineSvenonius.TheIntellectualFoundationofInformation
Organization.M.I.T.Press,2000.
<p>[14]HerbertVandeSompelandCarlLagoze.TheSantaFeConvention
oftheOpenArchivesInitiative.D-LibMagazine:The
MagazineofDigitalLibraryResearch,6(2),February2000.
<p><!--PageNo--><p><b><center>19</center></b><p>
Figure13. AsampleoftheHTMLsourcecreatedbytheprescripttool. . Itisofaveryhighquality.
NoticefromtheaboveHTMLsource,thatalthougheachreferencelineissplitintoseverallinesof
text(wrapped),eachreferenceisseparatedfromthepreviousbya"<P>"tag. Thiswouldmakeit
veryeasyforaparsertorebuildthecompletereferenceline,andindeedtoseparateseveral
referencelinesfromeachother. Noticealso,thewaythatthestartofapageismarkedwiththe
pagenumber(andthereisalsoacommenttoletusknowthatthisisthepagenumber:
"<p><!--Page No--><p><b><center>19</center></b><p>"
Thiswouldmakeitveryeasyforaparsertorecognisethatthepagehasreacheditsend,and
thereforetorecognisethepatternofnewlinesetcthatcomewiththeendofthepage,andthus
removethemappropriately.
Thereisalsothefollowingendofpagecommentafterthepagenumberhasbeendisplayed:
"<!--EndOfPage--><p><hr><p>"
Overall,theHTMLsourceproducedbyprescriptisofahighquality. Thefigurebelowshowsa
screenshotofitsappearanceinabrowser:
Figure14. HTMLproducedbytheprescripttoolasitappearsinabrowser.
TextOutput
NowthattheHTMLcreatedbyprescriptfromthePostScriptdocumenthasbeendiscussed,itis
necessarytodiscussthetextoutputofthetool. Thediscussionofthetextoutputoftheprescript
toolprovidedherewillbefairlyshort,becausethetextoutputisverysimilartotheHTMLoutput.
Essentially,itisthesameastheHTMLoutput,butwithoutanymarkuptags.
Documents you may be interested
Documents you may be interested