view pdf winform c# : Add hyperlinks to pdf online Library control component .net azure wpf mvc word_processing_preservation0-part1419

Preservationofwordprocessing
documents
IanBarnes
TheAustralianNationalUniversity
Friday,14July2006,12:50:10PM
Add hyperlinks to pdf online - insert, remove PDF links in C#.net, ASP.NET, MVC, Ajax, WinForms, WPF
Free C# example code is offered for users to edit PDF document hyperlink (url), like inserting and deleting
pdf link to email; add page number to pdf hyperlink
Add hyperlinks to pdf online - VB.NET PDF url edit library: insert, remove PDF links in vb.net, ASP.NET, MVC, Ajax, WinForms, WPF
Help to Insert a Hyperlink to Specified PDF Document Page
add hyperlink pdf; add links to pdf
TableofContents
1.Introduction......................................................................................................................3
2.Previouswork...................................................................................................................4
3.Fileformats ......................................................................................................................4
3.1.Preservationvs.accessformats ..................................................................................4
3.2.Criteriaforsustainability ..........................................................................................4
3.3.Wordprocessingformats ..........................................................................................5
3.3.1.MicrosoftWord............................................................................................5
3.3.2.OpenDocumentFormat.................................................................................6
3.3.3.Otherwordprocessingformats........................................................................7
3.4.PDF......................................................................................................................7
3.5.RTF......................................................................................................................8
3.6.XML ....................................................................................................................8
3.6.1.DocBookXML ............................................................................................9
3.6.2.TEI............................................................................................................9
3.6.3.XHTML+CSS.............................................................................................10
3.6.4.Customschemata.........................................................................................10
4.ConvertingdocumentsintoDocBook(orTEI)........................................................................11
5.Casestudies.....................................................................................................................12
5.1.ACSePub,xPubandpredecessors.............................................................................12
5.2.First.....................................................................................................................12
5.3.ANUePress ..........................................................................................................13
5.4.USQICEproject....................................................................................................13
5.5.NationalArchivesXenaproject.................................................................................14
6.TheDigitalScholar’sWorkbench ........................................................................................14
7.Conclusion ......................................................................................................................15
7.1.Recommendations ..................................................................................................15
7.2.Proposedpreservationstrategyforwordprocessingdocuments.......................................15
8.Acknowledgements...........................................................................................................16
References..........................................................................................................................16
Preservationofwordprocessingdocuments
2
VB.NET PDF Convert to HTML SDK: Convert PDF to html files in vb.
Turn PDF images to HTML images in VB.NET. Embed PDF hyperlinks to HTML links in VB.NET. Convert PDF to HTML in VB.NET Demo Code. Add necessary references:
add hyperlink pdf document; add a link to a pdf in acrobat
C# PDF Convert to HTML SDK: Convert PDF to html files in C#.net
Embed PDF hyperlinks to HTML links. How to Use C#.NET Demo Code to Convert PDF Document to HTML5 Files in C#.NET Class. Add necessary references:
adding hyperlinks to pdf files; add link to pdf file
1.Introduction
Wordprocessingdocumentsareamajorproblemfordigitalrepositories.AsIwillexplainbelow,theyarenot
suitableforlong-termstorage,sotheyneedtobeconvertedintoanarchivalformatforpreservation.Inthis
reportIwilladdressthefollowingquestions:
• Whatfileformatsaresuitableforlong-termstorageofwordprocessedtextdocuments?
• Howcanweconvertdocumentsintoasuitablearchivalformat?
Ialsoaddresstherelatednon-technicalquestion:
• Howcanwegetauthorstoconvertanddeposittheirwork?
Whilethevastmajorityofmaterialgeneratedbyuniversitiesistext,mostresearchondigitalpreservation
concentratesonimages,soundrecordings,videoandmultimedia.Youcouldbeforgivenforthinkingthatthisis
becausetextissimple,butunfortunatelythat’snotso.Evenrelativelyshorttextdocuments(likethisone)have
complexstructureconsistingofsections(parts,chapters,subsectionsetc)andalsoofindentedstructureslike
listsandblockquotes.Asignificantpartofthemeaningislostifthatstructureisignored(forexamplebysaving
asplaintext).
Mosttextdocumentscreatedtodayarecreatedinawordprocessor.(Theothermajortext-processingmethod,
usedbymathematicians,computerscientistsandphysicists,isTeX/LaTeX.Iwilladdresssustainabilityof
Tex/LaTeXdocumentsinaseparatereport[
1].)ForthereasonssetoutinSection3.3below,thefileformats
generatedbywordprocessorsaregenerallynotsustainable,soweneedtoconsiderconvertingdocumentsto
betterformats.Mostofthetextwe’reinterestedinarchivingisinoneofthevariousMicrosoftWordformats.A
smallamountisinotherwordprocessingformats,notablyOpenDocumentFormat,whichiscreatedby
OpenOffice.orgWriterandafewotherminorwordprocessors.
Sincewordprocessingformatsarenotsuitableforpreservation,thenextquestionis:“Whatformatshouldwe
convertdocumentsto?”MostarchivesseemtohavechosenPDF,butthishasseriousproblemsassetoutin
Section3.4.XMLisabetteranswer,butit’snotacompleteanswer.XMLisnotafileformat,buta
meta-format,aframeworkforcreatingfileformats.WehavetochooseasuitableXMLfileformatforstoring
documents.IdiscussthisquestioninSection3.5.
TherearevariousmethodsavailableforconvertingwordprocessingdocumentsintoasuitableXMLformat.I
discussthesebrieflyinSection4.
Otherpeoplehavebeenthinkingaboutthisproblemtoo.InSection2Igiveaverybriefliteraturereview,andin
Section5Ilistafewcasestudiesofpreviouspracticalconversionwork,biasedtowardsworkdonebypeopleI
know,hereinAustralia.
InSection6Igiveadescriptionofmyowncurrentworkinprogress,theDigitalScholar’sWorkbench,aweb
applicationdesignedtosolvesomeoftheproblemswithpreservationandinteroperabilityofwordprocessing
documents.
InSection7Isumupandmakesomerecommendations.
Preservationofwordprocessingdocuments
3
VB.NET PDF Page Replace Library: replace PDF pages in C#.net, ASP.
all PDF page contents in VB.NET, including text, image, hyperlinks, etc. Replace a Page (in a PDFDocument Object) by a PDF Page Object. Add necessary references:
add links to pdf online; add links to pdf document
VB.NET PDF Thumbnail Create SDK: Draw thumbnail images for PDF in
PDF document is an easy work and gives quick access to PDF page and file, or even hyperlinks. How to VB.NET: Create Thumbnail for PDF. Add necessary references:
add hyperlink in pdf; pdf link
2.Previouswork
Thereisalotofpublishedresearchondigitalpreservation,butnotmuchofitthatIfounddealsinanydetail
withpreservationoftext.
ThereisgoodworkdonebytheDiVApeopleinUppsalaUniversityLibrary,whoarearchivingdocumentsin
XML[
2].TheyuseacustomformatwhichisbasicallyDocBookXMLfordescribingthedocumentitself
(contentandstructure),withawrapperaroundtheoutsideallowingforcollectionsofrelateddocumentsandfor
comprehensivemetadata.
Slats[
3]discussesrequirementsforpreservationoftextdocuments,andtherelativemeritsofXMLandPDF.
Likeseveralotherauthorswithsimilarpublications,sherecommendsstoringdocumentsinXML,butfailsto
specifywhatXMLformattochoose.
Andersonetal[
4]fromStanfordrecommendensuringthatdocumentsarecreatedinasustainableformatrather
thanattemptingconversionandpreservationlater,asIwillrecommendbelow.Thisleavesopenthequestionof
whattodowithexistingdocuments.
TheNationalLibraryofAustralia[
5]recommendsconvertingwordprocessingdocumentstoRichTextFormat
(RTF)forpreservation.(Idisagree.SeeSection3.5below.)
3.Fileformats
Thissectionisaboutchoiceoffileformats.Firstweneedtomakeanimportantdistinctionbetweenpreservation
formatsandaccessorviewingformats.
3.1.Preservationvs.accessformats
Apreservationformatisonesuitableforstoringadocumentinanelectronicarchiveforalongperiod.An
accessformatisonesuitableforviewingadocumentordoingsomethingwithit.
Notethatitmaywellbethecasethatno-oneeverviewsthedocumentinitspreservationformat.Instead,the
archiveprovideson-the-flyconversionintooneormoreaccessformatswhensomeoneasksforit.Forexample,
thestrategyIrecommendistostoreDocBookXMLorTEI,butservethedocumentupaseitherHTMLfor
onlineviewingorPDFforprinting.
Somefileformatsmaybesuitableforbothpurposes.XHTMLhasbeensuggested,withCSSfordisplay
formatting.AsXHTMLisXML(andparticularlyifthemarkupismaderichwithuseofthedivelementto
indicatestructure),itmaybeanadequatepreservationformat,atleastforsimpledocuments.Asitcanbe
vieweddirectlyinawebbrowser,itiseminentlysuitableasanaccessformat.Itdoeshavesomeshortcomings
however,assetoutinSection3.6.3below.
3.2.Criteriaforsustainability
Whatfeaturesdoesagoodpreservationformathave?Howdowejudge?
MichaelLesk[
6]givesalistofrequiredfeaturesforpreservationformats.(Thepointsinitalicsarehis,the
commentsthatfollowaremine.)
Preservationofwordprocessingdocuments
4
.NET PDF SDK | Read & Processing PDF files
by this .NET Imaging PDF Reader Add-on. Include extraction of text, hyperlinks, bookmarks and metadata; Annotate and redact in PDF documents; Fully support all
add link to pdf; convert a word document to pdf with hyperlinks
PDF Image Viewer| What is PDF
advanced capabilities, such as text extraction, hyperlinks, bookmarks and Note: PDF processing and conversion is excluded in NET Imaging SDK, you may add it on
change link in pdf; add hyperlinks pdf file
1. Content-level,notpresentation-leveldescriptions.Inotherwords,structuralmarkup,notformatting.
2. Amplecommentspace.Formatsthatallowrichmetadataprobablysatisfythis.
3. Openavailability.Inotherwords,noproprietaryformats.Togetascare,rememberwhathappenedtoGIF
imageswhenUnisysclaimedthattheywereowedroyaltiesbecausetheyownthefileformat[
7].What
wouldhappenifAdobedecidedtodothesamewithPDForMicrosoftwithWord?
4. Interpretability.Inotherwords,theformatsshouldnotbebinary.Itshouldbepossibleforahumantoread
thedata,andalsoforsmallerrorsinstorageortransmissiontoremainlocalised.Asmallerrorina
compressedbinaryfilecanrendertheentirefileuseless.
Stanescu[
8]looksatthistopicfromariskmanagementpointofview.Slats[
3]discussescriteriaforchoosing
fileformats,comingtoverysimilarconclusions.
3.3.Wordprocessingformats
3.3.1.MicrosoftWord
ThevastmajorityofalltextdocumentscreatedtodayarecreatedinMicrosoftWordusingitsnative.doc
format(inoneofitsmanyvariationsdependingontheversionofWordbeingused).Itwouldbegreatifwe
couldjustdepositMicrosoftWorddocumentsintorepositoriesandbedonewithit,butunfortunatelythatwon’t
do,forafewgoodreasons:
• Wordformatisproprietary.ItisownedbyMicrosoftcorporation.EventherecentMicrosoftWord
XML-basedformatssufferfromthis.Sowhyareproprietaryformatsabadthing?
• Theownercouldchoosetochangetheformatatanytime,possiblyforcingrepositoriestoconvertall
theirdocuments.
• Theownercouldchangethelicensingatanytime,perhapsinsistingthatdocumentsmayonlybeopened
usingtheirsoftware,orthatuserspayafeeforreadingoreditingexistingdocuments.
• ExceptfortherecentXML-basedversions,Wordisa binaryformat.Thereisnoobviouswaytoextractthe
contentfromaWorddocument.Ifthedocumentiscorruptedevenalittle,thecontentcanbelost.Eventhe
mostrecentversion,MicrosoftOpenXMLformat,isacompressedZiparchiveofXMLfiles.Compressed
filesareparticularlypronetomajorlossifcorrupted.
• Wordisnotjustoneformatbutmany.OnecouldarguethatMicrosoft’ssuccesshasbeenpartlybuilton
makingincompatiblechangestotheirformatsoastoencourageuserstopayfornewversionsofthe
software.LeavingdocumentsinWordformatforcesrepositoriestosupportnotonebutseveralfileformats,
oralternativelytoengage,everyfewyears,inaprocessofopeningeverystoreddocumentinthelatest
versionofthesoftware,andsavingitusingthemostrecentincarnationoftheformat.Whenthenumberof
documentsbecomeslarge,thisbecomesanunacceptablecost.
• EventhenewXML-basedformathassometechnicalproblems.Forexample,someofthedataina
bibliographyentryisstoredasstringsthatneedparsing[
9],ratherthanusingXMLelementsorattributesto
separatethedifferentitems.Thismakesautomatedprocessingofthesefilesmuchmoredifficult.
MicrosofthasreleasedtheirlatestXML-basedfileformat,knownasOpenXML[
10],publicly,alongwith
assurancesthatitisandwillalwaysbefree[
11].Despitethemistrustofmanyintheopensourcecommunity,
Preservationofwordprocessingdocuments
5
whoremembertheGIF/Unisyscontroversy[
12],thisappearstobegenuine.Nevertheless,theredonotappearto
beanysignificantadvantagesofOpenXMLoverOpenDocumentFormat.
3.3.2.OpenDocumentFormat
OpenDocumentFormat[
13]isthenativefileformatofthelatestversionsofOpenOffice.orgWriter[
14],the
wordprocessorcomponentoftheOpenOffice.orgopensourceofficesuite.OpenOffice.orgistheopensource
versionofStarOffice,whichwasoriginallydevelopedbyStarDivisioninGermany.StarDivisionwerebought
bySunMicrosystems,whostillsupportthecontinuingdevelopment.OpenOffice.orgistheworld’slargestopen
sourcesoftwareproject.MostdevelopmentseemstobedonebySunengineers,butthereisalsoaveryactive
community.
OpenDocumentFormatgrewoutofOpenOffice.org’searlierOpenOfficeXMLformat.ItisnowanOASIS
andISOstandardandaEuropeanCommissionrecommendation.Itissupportedbytheopensourceword
processorsKOfficeandAbiWord,withmoretocome.
AnODFfileisaZiparchivecontainingseveralXMLfiles,plusimagesandotherobjects.TheZiparchiving
andcompressiontoolisfreelyavailableonallmajorplatforms,sothereshouldneverbeaproblemgettingatthe
contentofanODFdocument.UsingaZiparchivedoesmeanthatthefilesarepronetocatastrophiclossof
contentwithevenminordatacorruption,inthesamewayastheMicrosoftWordformatsdiscussedabove.
Ifwearegoingtoarchivewordprocessingdocuments,IbelievethatODFisabetteroptionthanMicrosoft
Wordformatinanyofitsvariations.EventhenewXML-basedWordformatswillstillsufferfrombeingowned
byafor-profitcorporation.
OnepossiblepreservationstrategywouldbetoconvertallwordprocessingdocumentstoODFforstorage.This
canbedoneeasilyusingOpenOffice.orgitselfasaconverter.Theconversioncouldbesetupaspartofthe
repositoryingestprocesssothatitwouldbealmosttotallypainlessforusers.ConversiontoODFgetsallthe
formattingofmostWorddocuments,withonlyminordifferencesinlayout.Forcomplexdocumentsthatuse
lotsoffloatingtextboxes,theseminordifferencescanmakeamessoftheappearanceofthedocument.For
documentsthatuseembeddedactivecontent(chunksfromlivespreadsheetsetc),theembeddingwillprobably
fail.Formost“normal”documents,evencomplexones,theconversionisgood.
ThemaindisadvantageofthisstrategyisthatOpenDocumentFormatisstillawordprocessingformat,nota
structureddocumentformat.Whatdoesthismean,andwhyisitaproblem?
• Wordprocessingformatsareatheartaboutdescribingtheappearanceofthedocument,notitsstructure.For
seriousprocessingit’sthestructurewewant.In20,50or100years,mostreaderswillprobablynotcare
aboutthesizeofthepaper,themargins,thefontsusedandsoon.Eventoday,ifwe’regoingtoserveupa
documentasawebpage,thosedetailsareirrelevant.Sometimesthesedetailscanevenbeadisadvantage,
forexampleifthedocumentinsistsonfontsthatareunavailableonyourcomputer.Ontheotherhand,the
divisionofthedocumentintosectionswillalwaysberelevant,usefulandimportant,andmustbepreserved.
• Wordprocessingformatsareflat.Thatis,thedocumentisasequenceofparagraphsandheadings.What
we’dreallylikeisadeepstructurewithsections,subsectionsandsoon,nestedinsideeachother(asin
DocBookorTEI).Wewantthisdeepstructurebecauseitmakesstructuredsearchesandqueriespossible,
andmakesconversionwithXSLTmucheasier.
Itispossibletodoautomatedconversionfromflattodeepstructure[
15](andseeSection6below),butthis
isonlypossibleatthemomentwithdocumentsthatconformtoawell-designedtemplate.Inthefuture
heuristicmethodsmightextendthistolesscarefullyprepareddocuments,buttheresultsarelikelytobe
inconsistent.
Preservationofwordprocessingdocuments
6
TheotherdisadvantageofOpenDocumentFormatisthatevenforsimpledocumentsitisextremelycomplex.
Forexample,unzippingaone-pagedocumentofabout120wordsresultsinacollectionoffilestotalling300K
insize.Thismakesitrelativelydifficulttolocatethemeaningfulcontentandstructureandtransformitinto
otherformatsforviewingorotheruses.Insteadofleavingdocumentsinthiscomplexformatandhavingahard
jobwritingconverters(XSLTstylesheets)forallpossiblefutureuses,itwouldbebettertostoredocumentsina
simple,clear,well-structuredformatthatmakesconverterseasiertowrite.
3.3.3.Otherwordprocessingformats
Thereareseveral,butnoneofthemhasmuchmarketshare,nordoanyofthemhaveanyparticularly
conspicuousadvantages.ProbablythebeststrategywiththeseistoconvertthemintoWordorOpenDocument
Format,thentreattheminthesamewayasthemajorityofdocuments.OpenOffice.orgwillopenmanyfile
formats,soitcanbeusedasagenericfirststageinanyprocessofconvertingdocumentsintousefulformats.
UseOpenOffice.orginservermodetoopenalldocumentsandsavetheminOpenDocumentFormat,then
processthemintosomethingbetter.
3.4.PDF
ManyrepositoriesseemtohaveadoptedPDFastheirmainformatfortextdocuments,bothforstorageandfor
access.PDFhassomegoodpoints:
• Itiseasytocreate,eitherusingAdobeAcrobatsoftwareorusingthePDFExportfeatureavailableinboth
MicrosoftWordandOpenOffice.orgWriter.
• ItcanbeviewedonallplatformsusingthefreeAdobeAcrobatReadersoftware(withsomecaveats,see
below).
• Itisextremelyeffectiveatpreservingtheformattingofadocument.Forsomeapplications(forexamplein
legalcontexts)thismaybeofvitalimportance.
However,therearesomeseriousproblemswithusingPDFasastorageformat[
16]:
• TheformatisownedbyAdobe.Whileitiscurrentlyopen,thecompanycoulddecidetokeepfutureversions
secret,chargeforuse,mandatetheuseoftheirsoftwareetc.RememberthecontroversyoverUnisysandthe
GIFimagefileformat[
12].
• Therearesomecompatibilityproblemsbetweendifferentversions.
• Documentsmayrelyonsystemfonts.ThereisanoptioninPDFtoembedallfontsinthedocument,butnot
allsoftwareusesthis,andsomePDFviewingsoftwareeithercannotlocatethecorrectfontsordoesn’tknow
howtosubstitutesuitablealternatives.Failingtoembedallfontscanresultinaseriousdegradationofthe
on-screenappearanceofadocument,orinacompletefailuretodisplaythecontent.
Forexample,Irecentlyaskedmysecond-yearsoftwareengineeringstudentstosendmetheirreportsas
one-pagePDFdocuments.Mostwerefine,butasmallnumberofstudentswhopreparedtheirreportsusing
MicrosoftWordsentmedocumentsthatIcouldviewbutnotprintfromAdobeAcrobatReaderonLinux.
HowevertheEvinceDocumentViewerprogramthatcamebundledwiththeFedoraCore4Linux
distributionprintedthemperfectly.
• PDFincludesextrafeatureslikeencryption,compression,digitalrightsmanagementandembeddingof
objectsfromothersoftwarepackages.Theseallpresentdifficulties,particularlythelast.
Preservationofwordprocessingdocuments
7
PDFisanexcellentaccessformatforprintingtopaper.Anygoodpreservationsystemshouldbeableto
generatePDFrenditionsofdocumentsforthispurpose.PDFisnotsogoodforviewingonscreen,asitties
documentcontenttoafixedpagesize.Thismeansthatforlargepagesizesorsmallscreens(e.g.onhandheld
deviceslikePDAsormobilephones)textwilleitherbetoosmalltoreadortheuserwillhavetoscrollbackand
forthalongthelines,whichishighlyinconvenient.Lookingahead,whoknowswhatviewingformatswewill
use.Weneedtobeabletoreformatcontenttofittheviewingdevice.
PDFisnotagoodpreservationformat.Itistruethatsomeoftheproblemslistedabovecanbeavoidedbytaking
carewhencreatingthePDF.Forexample,aworkflowthatensuresallfontsareembeddedindocuments,
preventsproprietaryembeddedobjects,forbidstheuseofencryptionandDRMandensuresthatcompressionis
turnedoffwouldgoalongwaytowardsolvingtheproblemsabove.Itwon’tsolvetheissueofAdobeowning
theformat.ItwillalsobedifficulttoenforcesuchapolicywhenpeoplesendinPDFsforpreservationanddon’t
wantto(ordon’tknowhowto)recreatethemaccordingtotherules.
3.5.RTF
RTFstandsforRichTextFormat.ItisaMicrosoftspecification[
17],buttheyhavepublishedit,soonecould
arguethatitisanopenstandard.Itiscertainlywidelyinteroperable,withmostwordprocessorscapableof
readingandwritingRTF.ThereareproblemswithusingRTFasapreservationformat:
• Itisstilldefinedbyacorporation,withalltherisksthatentails.
• Thereseemtobepartsofthespecificationthatarenotinthepubliclyavailablespecificationdocument,and
whichhavechangedovertheyears.
• Thespecificationisnotcompleteandprecise,leavingmanylittlequirks.
TheNationalLibraryofAustraliahaschosenRTFasitsmainpreservationformat[
5].Ithinkawell-chosen
XMLfileformathassignificantadvantagesoverRTF,butitmightwellbeworthretainingRTFasanaccess
format,sinceithasgoodinteroperability.
3.6.XML
XML[
18]iswidelyacceptedasadesirableformatfordocumentpreservation.Seeforexampletheassessment
ofXMLontheUSLibraryofCongressdigitalformatswebsite[
19]andtherelatedconferencepaperbyArms
&Fleischauer[
20].Thereasonsaresimple:
• XMLisafree,openstandard.
• XMLusesstandardcharacterencodings,includingfullsupportforUnicode.Thismakesitcapableof
describingalmostanythinginanylanguage.
• XMLisbasedonplaintext.Thisgivesitthebestpossiblechanceofbeingreadablefarintothefuture.Even
ifXMLandXSLTarenolongeravailable,therawdocumentcontentandmarkupwillstillbe
human-readable.(Thiswillbetrueevenifthemeaningofthemarkuphasbeenlost,althoughformats
designedwithpreservationinmindshouldmakethemeaningmoreorlessapparentfromthecarefully
chosenelementandattributenames).
• XMLcaneasilybetransformedintootherformatsusingXSLT[
21].
Preservationofwordprocessingdocuments
8
Thislastpointisveryimportant.ItmeansthatdocumentswhicharestoredinXMLcanbeviewedinmultiple
formats.AminimalsolutionwouldgenerateHTMLforon-screenviewingandPDFforprinting.
However,justsaying“XMListheanswer”isn’tenough.Unfortunatelythisseemstobeasfarasmostofthe
literatureI’veseengoes.XMLonitsownislittlebetterthanplaintext.Whatmakesitusefuliswhendocuments
conformtoastandardDTDorschema.HavinganXML-basedpreservationstrategymeanschoosingoneor
more(butpreferablyveryfew)XMLdocumentformats.Italsomeanshavingaworkablemethodforconverting
documentsintothatformat.
Tosummarise,ifwedecidetoconvertwordprocessingdocumentsintoXMLforarchiving,thisraisestwo
issues:
• WhatXMLformat(s)totarget,and
• Howtodotheconversion.
Iaddressthefirstissueinsections3.6.1–3.6.4below,andthesecondissueinsection4.
3.6.1.DocBookXML
DocBook[
22]isarichandmatureformatthathasbeeninuseforabout15years.ItwasoriginallyanSGML
formatdesignedformarkingupcomputerdocumentation(liketheO’Reillybooks),butitsapplicationiswider,
althoughitstillseemsabitawkwardandill-matchedtonon-technicalwriting.DocBookisanOASIS[
23]
standard.
DocBookishuge,withover300elements.Thismakesitquitehardtolearn,andcumbersometousedirectly.
VeryfewpeoplecreateDocBookdocumentsbyhand.Ofcoursethat’sofnoconcerntotheordinaryauthorif
thetransformationfromwordprocessorformatstoDocBookisdoneautomatically.Itmaybeaconcernforthe
unluckypersonwhohastowritestylesheetsforconvertingdocumentstoandfromDocBook.Fortunately
though,NormWalsh(theguidingforcebehindDocBook)andothershavewrittenacomprehensivesetofXSLT
stylesheets[
24]forconvertingfromDocBookXMLintonumerousformatsincludingXSL-FO(andhencePDF),
XHTML,HTML,HTMLHelp,JavaHelp,EclipseHelpandsoon.Thisisahugeheadstart.
ForconvertingwordprocessorfilestoDocBook,thecomplexityandnumberofelementsdoesn’tmatter,since
theconversionprocesswillprobablytargetonlyasmallsubsetofDocBook.ThisistheapproachIhaveadopted
withtheDigitalScholar’sWorkbench(seeSection6below).
3.6.2.TEI
TEIstandsfortheTextEncodingInitiative[
25].Itsguidelinesareaimedmostlyatthepreservationofliterary
andlinguistictexts(soaverydifferentslanttoDocBook).LikeDocBook,TEIishuge.Furthermore,it’snot
exactlyaformat,butasetofguidelinesforbuildingmorespecialisedformats.OnesuchisTEI-Lite,whichhas
provedverypopular,andisusedbyseveralseriousrepositories.
TEImaybebetter-matchedthanDocBooktosomescholarlywork,particularlyinthehumanities.Itdoeshave
someseriousshortcomingshowever:
• Itusesabbreviatedelementnameslike e <p>forparagraph(whereDocbookuses<para>).Thisis
presumablytomakeiteasiertokeyinbyhand,butitisaproblemforsustainabilitysinceitmaymakeit
moredifficulttorecoverthemeaningofthemarkupinthedistantfuture.
• IthasasetofcustomisableXSLTstylesheetswrittenbySebastianRahtz[
26].Ihavenoexperiencewith
Preservationofwordprocessingdocuments
9
usingthembuttheimpressionIgetisthattheyarelessmatureandlesscomprehensivethantheNormWalsh
DocBookXSLstylesheets[
24].Thisisdefinitelyworthfurtherinvestigation.
WhetherornottheTEIXSLTstylesheetsareuptothejob,TEIneedstobeconsideredasaseriouscandidate
forapreservationformatforsomescholarlywriting.Ideallyafullsolutiontothepreservationproblemwould
supportbothDocBookandTEI,allowingauthorsorcurators/archiviststochoosethemostsuitableformatfor
preservingeachwork(orcollectionofworks).
3.6.3.XHTML+CSS
SinceXHTML[
27]isbothavalidXMLdocumentformatandcanbedisplayedbywebbrowserswithout
transformation,withtheformattingcontrolledbyaCSS[
28]stylesheet(embeddedorexternaloracombination
ofboth),thishasbeensuggestedasapossiblearchivalformat.
Idon’trecommendit,exceptperhapsforlow-valuedocumentsthatarchivistscannotaffordthetimeneededto
getintoDocBookorTEI.InthesecasesareasonablestrategymightbetostorethedocumentinOpen
DocumentFormatandaddanautomaticallygenerated,perhapspoorquality,XHTML+CSSversionforeasy
viewingandsearching.ThiscouldeitherbestoredintherepositoryalongsidetheODFversion,orcouldbe
generatedontheflybyafront-endlikeCocoon[
29].
WhynotuseXHTML+CSSforalldocuments?
• Firstly,it’sessentiallyaflatformatwhichmeansit’shardertodousefulconversionintootherformatsinthe
future.It’spossibletousethe <div>elementcreativelytoaddlotsofstructure,butifyou’regoingtodo
that,you’remuchbetteroffusingawell-definedstructuredformatlikeDocBookorTEI.(Why?Becausein
thoseformatsthestructuralelementsarerigorouslydefined,whileinXHTMLyoucanusedivshowever
youlike,makingithardforprocessingapplicationstoknowwhattodo.SeeSection3.6.4onCustom
schematabelow.)
• CSSreliesonconsistentuseofthe“class”attributeintheXHTML.Thereisnostandardfordoingthis.
Sameproblemasabove.
• CSSisnotXML,soparsingittoconvertitintosomenewformatinthefutureismuchharderthanwith
XMLformats.
3.6.4.Customschemata
OneofthebiggesttrapsintheXMLworldistheideathatyoucreateyourowndocumentschemathatperfectly
matchesyourparticularneeds.Auniversitycouldcreatespecialiseddocumenttypesforlectures,labexercises,
readinglists,researchpapers,internalmemos,minutesofmeetings,rules,policies,agendas,monographs...The
listgoeson.Thereareseriousproblemswiththisapproach,asTimBraypointsout[
30].Forsustainability,the
problemsbasicallynarrowdowntomaintenanceandinteroperability.
Thefirstproblemismaintenance.Eachoftheseformatswillrequirestylesheetsforrenderingintowhatever
viewingformatsareneeded.AreasonableshortlistwouldbeHTML,PDFandplaintext.That’sthree
stylesheetsforeachdocumenttype.Whathappensnextisthatsomeonewantstoaddanelementtooneofthe
documenttypes:amathematicianwantstousemathsinameetingagenda,forexample.Everytimethis
happens,youhavetomodifyallthestylesheetsforthatdocumenttype.Withsomecareinthedesign,therewill
beelementsthatarecommontothedifferentdocumenttypes,anditmaybepossibletodosomesharingof
templates,butingeneraltheworkloadislargeandongoing.
Preservationofwordprocessingdocuments
10
Documents you may be interested
Documents you may be interested