pdf document viewer c# : Deleting text from a pdf control SDK platform web page wpf html web browser LHT-06-2015-00680-part568

Beyond TIFF and JPEG2000:
PDF/A as an OAIS submission
information package container
YanHan
TheUniversityofArizonaLibraries,
TheUniversityofArizona,Tucson,Arizona,USA
Abstract
Purpose–ThepurposeofthispaperistointroducePDF/AtoreplaceTIFFasthepreferredfile
formatfordigitizationoftextualdocuments.Inaddition,PDF/A canbeused asanopenarchival
informationsystem(OAIS)submissioninformationpackage(SIP)containertoreducedigitizationand
digitalpreservationcosts.
Design/methodology/approach–Theauthorfirstreviewedthecurrentdigitizationguidelines,the
OAIS model and provides on an overview w ofthe e development PDFand PDF/A as international
standards.ThenliteraturereviewoftheusesofPDF/Aispresented.Theauthoranalyzedpitfallsof
TIFFsasthepreferredformatfordigitization,andshowedhowtousePDF/AtocodedigitizationSIP.
Findings–TIFFfileformathasbeenthepreferredmasterfileformatbyFederalAgencyDigitization
GuidelinesInitiativedigitizationguidelinesforthepast20years.However,therearedrawbacksofTIFF
format.LiteraturereviewsshowthatPDF/Ahasbeenthepreferredstandardfor coding born-digital
documentsincourt,governmentandbusinesssectors.PDF/A-2andPDF/A-3arerelativelynewstandards
releasedafter 2010.However,few understood thestandardsand have utilized thefull potentialsin
digitization.TheauthorshowsthatPDF/AcanbeusedasanOAISSIPcontainer.
Practicalimplications–InordertodeliveryOAISSIPs,currentpracticesrequireacombination
offiles,directoriesandvarioustypesofmetadata.TheauthorshowsthatPDF/A(PDF/A-2and/or
PDF/A-3) canbeabetter fileformatfortextual documentdigitizationwith codingvarious typesof
metadatainextensiblemetadataplatformandarbitraryfile/datacanbecodedinPDF/A-3.Thesefeatures
inPDF/AprovidemuchbetterwaystodeliverSIPsinacost-efficientmanner.
Originality/value–PDF/Ahasbeenrecognizedasthepreferredstandardforborn-digitaldocuments,
butithasnotbeenusedasthepreferredfileformatfordigitizedmaterials.Theauthorrecommendsthat:
PDF/A with lossless JPX X compressionsas the preferred file format;and PDF/A A with lossless JPX
compressionsalongwithmetadata/dataasthepreferredOAISSIPcontainer.Asaresult,theusesreduce
costsindigitizationanddigitalpreservationandalsoincreaseproductivity.Theauthorrecommendsto
updatethenationalandinternationaldigitizationpracticesusingPDF/A.
KeywordsDigitaldocuments,Digitization,Standards,Digitalpreservation,PDF/A
PapertypeResearchpaper
1. Background
1.1 Overviewofcurrentdigitizationguidelines
Libraries,museumsandarchiveshavebeendigitizingmaterialsforpreservationand
accesssincethe1990s.Overthepast20years,FederalagenciessuchastheNational
Archivesand theDigital Library Federation(DLF) havepublished several critical
digitizationguidelinesandbestpractices,whichhavebeenthedefactostandardsfor
digitizationprojectsinlibraries,archivesandmuseums.Theseguidelineswerewritten
byexpertsandspecifyingreatdetailsineveryaspectsofdigitizationincludingfile
formatandvariousmetadataconsiderations.Theseguidelinesgreatlyinfluencealmost
LibraryHiTech
Vol.33No.3,2015
pp.409-423
©EmeraldGroupPublishingLimited
0737-8831
DOI10.1108/LHT-06-2015-0068
Received26June2015
Revised13July2015
Accepted15July2015
ThecurrentissueandfulltextarchiveofthisjournalisavailableonEmeraldInsightat:
www.emeraldinsight.com/0737-8831.htm
TheauthorwouldliketothankLeonardRosenthol,ProjectleaderforISOPDF/AandAdobePDF
ArchitectforhiscommentsonTIFF,PDFandPDF/Afileformats.
409
Beyond
TIFFand
JPEG2000
Deleting text from a pdf - delete, remove text from PDF file in C#.net, ASP.NET, MVC, Ajax, WinForms, WPF
Allow C# developers to use mature APIs to delete and remove text content from PDF document
remove text from pdf preview; acrobat delete text in pdf
Deleting text from a pdf - VB.NET PDF delete text library: delete, remove text from PDF file in vb.net, ASP.NET, MVC, Ajax, WinForms, WPF
VB.NET Programming Guide to Delete Text from PDF File
pdf editor delete text; delete text from pdf with acrobat
allinstitutions’digitizationprojects,createstandardizeddigitizationpracticesinthe
communities andcontribute to o access andpreservation of f scholarship andculture
heritageresourcesinreachingvariousaudiences.Theseguidelinesare:
2002: The DLF published Benchmark for Faithful Digital Reproductions of
Monographs and Serials (Version 1), available at: http://purl.oclc.org/DLF/
benchrepro0212
USNationalArchivesandRecordsAdministration(2004):TechnicalGuidelines
for Digitizing Archival Records for Electronic c Access: Creation of Production
MasterFiles–RasterImages.
2008:BibliographicCenterforResearch(BCR)publisheditsBCR’sCDPDigital
ImagingBestPracticesVersion2.0,availableat:http://books.google.com/books/
about/BCR_s_CDP_Digital_Imaging_Best_Practices.html?id¼vjeEXwAACAAJ
Federal Agency Digitization Guidelines Initiative (2010) released digitization
guidelinesrelatedtoaudio,video anddigitalimaging.Thesefederalagencies
includetheNationalArchivesandRecordsAdministration,LibraryofCongress,
theGovernmentPrintingOfficeandotherfederallibraries.Thesetofguideline
includes Technical Guidelines for Digitizing Cultural Heritage Materials and
Embedded Metadatain TIFF Images. TheTechnicalGuidelines for r Digitizing
Cultural Heritage Materials draws substantially on the National Archives’
TechnicalGuidelinesforDigitizingArchivalRecordslistedabove.
2013:National Archiveof Australiahasscanning specifications,availableat:
www.naa.gov.au/Images/ScanSpecsAmended22082013_tcm16-70095.pdf.Allthe
requirementsaresimilarorthesameastheUSNationalArchivesguidelines,
exceptthatitalsorecommendsusingPDF/Aasapreferredfileformat.
1.2 Analysisoffileformatsindigitizationguidelines
Currently all US digitization guidelines including the Federal agencies’ technical
guidelinesfordigitizationfavorTIFF6.0as“Preferredformatforproductionmaster
file”andJPEG2000as“Increasinglyconsideredasaviableformatformasterimage
files,butnotyetwidelyadopted.”PDFislistedas“Notrecommendedforproduction
master files,” whilePDF/Asectionisemptyas it was considered along with PDF
(FederalAgencyDigitizationGuidelinesInitiative(FADGI),2010).FAGIadoptedquite
alotfrom theUSNationalArchivesandRecordsAdministration(2004) guidelines.
SeveralPDFstandardshavebeenpublishedsince2010,whichprovidenewandbetter
waystodigitizedocuments.Moreover,intheFederalAgencyDigitizationGuidelines
Initiative(FADGI)formatdocumentthereareafewtechnicaldiscrepanciesreferringto
PDFandPDF/A.Forexample,PDF1.4beforedoesnotsupportJPEG2000compression,
andPDF/Ahasmultipleversions(Appendix).
ThescanningguidelinesbytheNational ArchiveofAustraliarecommendusing
PDF/AbesidesrecommendusingTIFFandJPEG2000asthepreferredformat.However,
severalimportantthingsaremissinginthisguideline.First,nospecificguidanceon
PDF/Astandards:PDF/Ahasthreestandards.Eachonehasitsownfeatures.Thereare
multiplewaysofcodingrasterimages.Inotherwords,rasterimagescanbecodedwith
nocompression,losslesscompressionorlossycompression.Asaresult,technicallyraster
imagesinPDF/Acanbestoredinnocompression,losslessorlossymode.Toensure
high-qualityimages,aninstitutionshalladoptapolicyregardinghowtohandleraster
410
LHT
33,3
C# PDF insert text Library: insert text into PDF content in C#.net
developers to add multiple text processing functions to PDF document imaging application, such as inserting text to PDF, deleting text from PDF, searching text
remove text from pdf reader; online pdf editor to delete text
C# PDF Page Delete Library: remove PDF pages in C#.net, ASP.NET
Provide C# Users with Mature .NET PDF Document Manipulating Library for Deleting PDF Pages in C#. C#.NET Project DLLs for Deleting PDF Document Page.
how to erase pdf text; how to erase text in pdf online
imagesinPDF/A.Second,nometadatacodinginstructions.Theguidelinedoesnothave
recommendationsonhowtocodevariousmetadatainformation.
Thecurrentpreferredfileformats(thelatestFADGI,2010guidelineandthelatest
USNationalArchiveandRecordsAdministration’s,2004guideline) resultsinmore
managementoverheadandhighercostsinoperation,filemanagementandlong-term
preservation. In the past several l years, there has been new developments in the
internationalstandardizationof fileformat andmetadata.Inthispaper,theauthor
proposesadifferentfileformatPDF/AoverthecurrentpreferredTIFF6.0orJPEG2000
fortextualdocumentdigitization.Inaddition,PDF/Acanbeusedfordigitizationofother
materialssuchasgraphicillustrations,mapsand aerial photographs.Furthermore,
beyondsimplyasafileformat,PDF/Acanbeusedasopenarchivalinformationsystem
(OAIS) submission n information package (SIP) containers. This international open
standardsof PDF/Asimplifydigitizationprocess, reducedigitizationcost,improve
production substantially and build more confidence for preservation and access.
Thenextsectionwilldiscusswhyandhow.
1.3 OverviewofOAISmodelanddigitization
OAIS,ISO14721:2012(previouslyISO14721:2003),iswellknowninlibraryandarchive
communities as it is the de facto standard for producing, processing, archiving
anddelivering information from producers toconsumers.AnOAIS is“an archive,
consistingofanorganization,whichmaybepartofalargerorganization,ofpeopleand
systemsthat has accepted the responsibility to o preserve information and makeit
availablefor a Designated Community” (CCSDS, 2012). It providesframeworksto
ingest,preserveandprovideaccesstofacilitateinformationflowfromproducersto
consumers.In addition,theOAISdefines aninformationmodel inwhich anOAIS
informationpackage(IP)shallinclude:contentinformation;preservationdescription
information;packaginginformation;anddescriptiveinformation.ThreetypesofIPs
aredefined:SIP,archivalinformationpackage(AIP) anddisseminationinformation
package (DIP). The following figure is adopted from the OAIS Magenta book,
illustratingtheOAISIPanddataflow(Figure1).
AproducerproducesSIPsthroughdigitizationprocess.Afterthat,oneormoreSIPs
aretransformedintoAIPsduringinternalmanagementandfinallyoneormoreDIPs
willbedeliveredtoconsumer.Section3ofthispaperdiscussesarealexampleofSIPsin
digitizationandhowPDF/AcanbeaSIPcontainer.
2. PDFstandardsandusesofPDF/A
2.1 PDF/Astandard
First,let’sreviewthedevelopmentofPDFasinternationalopenstandards.ThePDF1.7
specificationwasreleasedas“thefull-functionPDF”wasreleasedunderISO32000-1in
2008.OthersubsetstandardswerereleasedasISOstandardsfor“morespecializeduses”
(AdobeSystems,2008).Forexample,PDF/X(ISO15930)forelectronicprinting;PDF/A
(ISO19005)forarchivingofdigitaldocuments.TheprimarythreegoalsofPDF/Aareto:
“providesamechanismforrepresentingelectronicdocumentsinamannerthat
preservestheirvisualappearanceovertime,independentofthetoolsandsystems
usedforcreating,storingorrenderingthefiles.”
“provide a framework for recording the context and history of electronic
documentsinmetadatawithinconformingfiles.”
411
Beyond
TIFFand
JPEG2000
VB.NET PDF Page Delete Library: remove PDF pages in vb.net, ASP.
Free PDF edit control and component for deleting PDF pages in Visual Basic .NET framework application. DLLs for Deleting Page from PDF Document in VB.NET Class.
delete text pdf file; delete text pdf preview
VB.NET PDF insert text library: insert text into PDF content in vb
NET users to add multiple text processing functions to PDF document imaging application, such as inserting text to PDF, deleting text from PDF, searching text
delete text pdf acrobat professional; delete text pdf files
“define a framework for representing thelogical l structure and other semantic
informationofelectronicdocumentswithinconformingfiles”(ISO,2005,2011,2012).
Toachievetheabovegoals,PDF/Ahasadditionalrequirementsandalsoprohibitssome
PDFfeaturessuchasencryption.Formoredetails,pleaseconsultthePDFassociation
(www.pdfa.org/)publicationsonPDF/Atopics.CurrentlyPDF/Astandardshavethree
parts:ISO19005-1(PDF/A-1),ISO19005-2(PDF/A-2)andISO19005-3(PDF/A-3).Anyof
themcanbeusedforlong-termarchivalpurpose.PDF/A-1isbasedonPDF1.4;while
PDF/A-2andPDF/A-3arebasedonISO32000(PDF1.7).ThenamingofPDF/Amay
confusepeople,asitdoesnotmeanthatPDF/A-2isbetterthanPDF/A-1orPDF/A-1is
obsolete.Speciallyfordigitizationpurpose,oneofthemostimportantfeaturesinPDF/A-2
andPDF/A-3isthatJPEG2000compressionissupported.Thismeansthattherewillbe
40-60 percent spacesaving rasterimagesusinglosslessJPEG2000 compressionwith
PDF/A-2 and/or PDF/A-3 comparing to that t of PDF/A-1. The three standards have
different featuresandthey co-exist.PDF/A-1wasthefirstPDF/Astandardpublished
in2005,whilePDF/A-2andPDF/A-3werepublishedin2010andin2012,respectively.The
majordifferenceinPDF/A-2andPDF/A-3isthatPDF/A-3canembedanyarbitraryfileor
data,whilePDF/A-2canonlyembedanyPDF/Afiles.ThisfeatureofPDF/A-3makesitbe
theuniversalfilecontainerlikeBagIt.TheBritishDigitalPreservationCoalitionreported
thatPDF/Aisoneofthebestfileformatstopreserveelectronicdocumentsandsuggested
usingPDF/Aasthestandardformatforarchivingelectronicdocuments(Fanning,2008).
2.2 UsesofPDF/Ainborn-digitaldocuments
PDF/Ahasbeen widely acceptedasapreferred master fileformat forborn-digital
documentsasaSIPcontainer.Insomecases,PDF/AisevenusedasanAIPcontainer.
GovernmentsandcourtssuchasEuropeanUnionandtheUSFederalCourts(2015)
(Borstein, 2010), the National Information Standards Organization (NISO, 2007),
Producer
Submission
Information
Packages
OAIS
Archive
Information
Packages
queries
Consumer
orders
query
responses
Dissemination
Information
Packages
Legend
  Entity
Information
  Package Data
Object
  Data Flow
Source: Adopted from CCSDS (2012) 
Figure1.
OAISinformation
packagesand
dataflow
412
LHT
33,3
C# PDF File & Page Process Library SDK for C#.net, ASP.NET, MVC
PDF to tiff, VB.NET read PDF, VB.NET convert PDF to text, VB.NET Easily manipulate multi-page PDF document file with page inserting, deleting and re
delete text from pdf online; delete text from pdf file
C#: How to Delete Cached Files from Your Web Viewer
C#.NET rotate PDF pages, C#.NET search text in PDF VB.NET How-to, VB.NET PDF, VB.NET Word, VB Introduce Visual C#.NET Developers the Ways of Deleting Cache Files.
how to delete text from a pdf reader; remove text from pdf online
national libraries,private sectorssuchas banksand hospitals,library centers like
CaliforniaDigital Library(2011) andFloridaCenterforLibrary Automation(Chou,
2006)haveendorsedPDF/Aastherequiredorpreferredfileformatforborn-digital
documentsoverotherformatssuchasWordandPDF.Somecasestudieshavebeen
publishedregardingthefileformat.Forexample,FloridaDigitalArchiveconducteda
studytoevaluatesoftwareconversionfrom PDFtoPDF/A-1(KooandChou,2013).
ArchaeologydataserviceusesPDF/AasAIPandanalyzedthebenefitsanddrawbacks
(EvansandMoore,2014).
2.3 UseofPDF/Aindigitization
PDF/Ahasbeenacknowledgedtobethepreferredmasterfileformatforborn-digital
materials, but it has not been recognized as a preferred master file format for
digitizationoftextualdocuments.Thispaperisintendedtopresentthebenefitofusing
PDF/Afordigitization.
OnlyfewarticleshavebeenpublishedaboutusingPDF/Aindigitizationsincethe
releaseofPDF/A-1in2005.ItappearsthatallthecasesusedPDF/A-1asthefileformat,
and none have discussed using PDF/A-2 or PDF/A-3. The Ohio State University
LibrariesmentionedtoinvestigatingofusingPDF/A-1fordigitization(Noonanetal.,
2010).Theydidnotreportifthereisapolicyonhowtohandletherasterimage.Most
likely,therasterimagesinsidethePDF/A-1filesweresavedinlossycompression
mode.Ifthisisthecase,thedigitizationqualityofdigitizationfilesislower.Incomparison,
PDF/A-2providesabetterwaytohandlerasterimagesduetoitsJPEG2000compression
feature.Inaddition,thepaperdidnotreporthowtocodemetadatainfosuchastechnical
anddescriptivemetadata.Opticalcharacterrecognition(OCR)canbehandleddifferently
toachievebetterresults.Followingcommondigitizationpractice,SouthNewHampshire
UniversitydigitizedpapersasTIFFs,andmadePDFsforaccessfiles.ThenthePDFs
weresavedasPDF/Aforimprovingaccessfileslongevityandaccessibility(Platt,2010).
TheuseofPDF/Ainthiscaseisquestionable,asrasterimagesinPDFsdonotrelyon
embedded-fontandthereisnobenefitneededforlongevityofaccessfile.IndiaNational
AgriculturalResearchSystemdigitizeddocumentsusingPDF/A.Unfortunatelythepaper
didnot discusswhichPDF/Astandardwasused, and didnotmentionitschoiceof
compressionpolicyandmetadatainformation(Veeranjaneyulu,2014).Insummary,allthe
reportedcasestudiesdidnotutilizethefullpotentialsofPDF/A.
3. DigitizationandSIP
IntheOAISmodeltheproducercreatesSIPs,whichmaybeinanyformatthatthe
producerandthearchivesagreeto.AnOAISIPshall include:content information;
preservationdescriptioninformation;packaginginformation;anddescriptiveinformation.
Inthecurrentdigitizationprocess,atypical SIPconsistsofacorrespondingdirectory
containingthefollowinginformation:
Content:preservationmasterfiles–rasterimagesfilessavedinpreferredfile
formatasTIFFs/JPEG2000s)aseachpageoftextualdocumentsisscannedasa
raster image.Access files –a compressed d PDF and/or thesame numberof
JPEGs/JPEG2000s.OthercontentsuchasOCRdata.
Preservationdescription:preservationmetadatasavedinTIFFheaderandother
metadatasuchasstructuralandtechnicalmetadata;checksumfiles.
Packaginginformation:directoryandFilenaming,structuralmetadata.
413
Beyond
TIFFand
JPEG2000
C# PDF remove image library: remove, delete images from PDF in C#.
VB.NET read PDF, VB.NET convert PDF to text, VB.NET C# PDF - Remove Image from PDF Page. Provide C# Demo Code for Deleting and Removing Image from PDF File Page.
how to delete text in pdf preview; how to delete text in pdf file
C# Word - Delete Word Document Page in C#.NET
C# Word - Delete Word Document Page in C#.NET. Provides Users with Mature Document Manipulating Function for Deleting Word Pages. Overview.
erase text from pdf; how to erase text in pdf
Descriptiveinformation:descriptivemetadatasavedindigitizationmanagement
system,catalogortextual/XMLfiles.
Forexample,Abookof100pagestypicallyconsistsofthefollowingfiles:100TIFFs/
JPEG2000sasmasterfiles,onePDFfileconsistingofcompressedimagesforaccess,
one checksum file consisting of f all files’ checksums, one structural metadata a file
consistingofstructuralmetadatainformationforthisbook,OCRdataeithersavedin
thePDFfileoraseparatetextfileoranALTOXMLfile(technicalmetadataforOCR).
TheSIPinformationisspreadoutfromdifferentpartsofa fileorinmultiplefiles.
ATIFFimagecontainsthecontent(thevisualappearanceofaphysicalpagefroma
book),italsocontainssomeofthepreservationinformationinitsheadertags.However,
othercontentinformationsuchasOCRedtextfromthisTIFFimageismostlikely
savedinaseparatefile.Asaresult,togatheralltheSIPinformation,ingestionmust
interactwithmultiplefilesandevendatabase.Itiserror-pronewhendealingwithamix
ofmultiplefilesinavarietyoffileformat.Herearetheexamples.
(1) Directorylisting
DirectoryIdentifier
Page#s
azu_acku
00000001.tif
00000002.tif
00000003.tif
00000004.tif
00000005.tif
azu_acku_1.pdf
stru_meta.txt
checkmd5.txt
(2) Checksumfile
Thechecksumfileconsistsofchecksumsofallthemasterimagefiles,whose
mainpurposeistoensuredataintegrityforerror-detectionduringtheprocess
ofdatatransmissionand/orstorage.
(3) Structuralmetadata
Structuralmetadatadescribesthelogicalstructureandcomponentsofcontent.
This type of metadata can be used for both page-to-page and semantics
navigation in delivering digitization materials to enhance users’ access
experience.YaleUniversityLibrary(2008)publishedadetailedpolicyonhow
to usestructuralmetadata inmultiplelevels.Asimpleexampleisatableof
contentforabook.MoreexamplesofstructuralmetadatacanbefoundinYale
University Library’sbest practicesforstructuremetadata.TheUniversityof
ArizonaLibrarieshasasimilarpracticewiththefollowingoptions:
Nostructuralmetadata.
Structuralmetadatadefiningfilesequence:YaleUniversityLibrary,Cornell
and University of Michigan have used thisapproach. The UAL usesa
modifiedfilenamingconventiontocodefilesequenceaspartofthefilename.
TIFFfilesarespecificallynamedwithleadingzerostofacilitatesorting.
414
LHT
33,3
Structuralmetadatadefininglogicalcomponents:Thislevelistypicallyfor
booksandmanuscripts.Someexamplesincludetitlepages,chaptersand
indices. The UAL has been using coding of f this type for some of f the
digitized materials. The information can be saved in a text/XML or
databaseorinMETS.
(4) Technicalmetadata
Technicalmetadatadescribestechnicalattributesofdigitalobjectsduringdigital
captureandotherprocesses.Ittypicallycomesfromdigitizationequipmentsuch
asscannersanddigitalcameras.Examplesarehardware,softwaretoproducethe
digitalobject,resolutions,fileformatsandcolorprofiles.Librarycommonpractice
istocodethismetadatainaseparatetext/XMLfileorinMETS.
(5) Descriptivemetadata
Descriptive metadata is the most commonly used to describe a resource
foridentificationanddiscovery.Thistypeofmetadataisthemostwidelyusedfor
resourcedescriptionanddiscoveryviasearchenginesandlocalsearchfunctions.
Typicalelementsaretitle,authorandkeywords.Typicallythedescriptivemetadata
issavedasaseparateMARC/MODS/METS/DublinCorefile.
(6) Othermetadatasuchaspreservationmetadata
Othertypeofmetadatasuch aspreservationandrightsmetadatamightbe
addedinthedigitizationprocess,ormightbeupdatedatalaterstage.
(7) Otherdata
Other data can n be embedded in n PDF/A files. For example, OCR data are
deliveredseparatelyinanotherfilesuchastextfileorALTOtechnicalmetadata
forOCRorwithinaPDF.
4. PDF/AasanOAISSIPcontainer
ThekeyrequirementofPDF/Aisthatitisself-describedandself-containedsothatitcan
bereproducedexactlythesamewaywithdifferentsoftwareinvariousplatforms.Allof
theinformationnecessaryfordisplayingthedocumentisembeddedinthePDF/Afile.
Thisincludesanycontentsuchastext,rasterimagesandvectorgraphics,fontsandcolor
profiles.Fordigitization,PDF/Acanbeusedasastructured,self-containedandself-
describeddatacontainer,whichcodesrasterimagesinuncompressed,losslessand/or
lossycompressedmodedependingontheusers’preference.ThePDF/Afileformatcan
achievestructured,self-containedandself-describedstatusbydoingthefollowing:
(1) taggedPDF:embedstructuralmetadataviapre-definedPDFtagsorcreateyour
owntags;
(2) self-contained:embedrequiredcolorprofiles,fontsandotherrelatedinformation;
and
(3) self-described using extensible metadata a platform (XMP) metadata: PDF/A
cancodealltherequiredinformationfromanOAISSIPthroughthestandard
andXMP.
AllthefilesanddatafromtheabovedigitizationSIPcanbecodedinPDF/A.Howtodoso:
(1) Alltherasterimages(page1,page2,etc.)fromabookcanbecompressedand
savedindesiredsequenceinonePDF/AfilewithJPEG2000losslesscompression.
415
Beyond
TIFFand
JPEG2000
(2) All l the metadata can becoded with another ISO metadata standard XMP
(ISO16684):
Structural metadata can be coded d with XMP inside of the PDF/A file
metadatastream withpre-defined tags.Ifusersneedto havetheirown
customizedtags,theycandoso.
Descriptivemetadata canbecodedwithXMPstandardizedDublinCore
elementssuch asdc:contributorand dc:title.If usersneed tocodeother
descriptivemetadata elementssuchasMODS,theycanachievethisvia
XMPextension.
RightsandmediamanagementmetadatacanuseXMPstandardnamespaces.
XMPalsohavecamerarawmetadatanamespaceifneeded.
OCRdatainALTOformatcanbecodedwithinXMPusingextension.or
OCRtextcanbestoredinPDF.
Other metadata such as METS and MODS can n be coded using g XMP
extension.
(3) AnyarbitraryfilescanbesavedwithinaPDF/A-3file.
4.1 Criteriaformasterfileformats
Whenchoosingafileformatfordigitization,futureviabilityofthemasterfileformatis
theprimaryfactor.Therefore,commonconsiderationsincludenon-proprietary,open
anddocumentedinternationalstandards,commonlyused,unencrypted,uncompressed
orlosslesscompression.Digitalpreservationisrunningonastackofhardwareand
software.Rendering anyfilereliesonappropriatehardware,operating systems,
librariesandapplicationsoftware.Whilesomeexpertsmayarguethecompression
is the preferred choice, the author believe that lossless compressed file using
non-proprietary and/or open documented algorithmsisequivalently asgood as
uncompressedfile.
4.2 TIFFissuesasthepreferredmasterfileformat
Forthepast20yearsTIFF6.0hasbeenthepreferredmasterfileformatfordigitizationdue
toafewfactorssuchasavailabilityofthetechnicalspecificationandeasy-to-understand
filestructure.TIFFdoeshavecertainadvantagesasafileformatforrasterimageonly
materials,asthebaselineTIFFisverysimple,easytorepairandmigrate,asitcannot
includelayersandJPEGorLZWcompressions.However,thereareseveral significant
issueswithTIFF6.0fileformatandimplementationwhendealingwithtextualmaterials:
Justarasterimageformat:intheOAISmodel,theSIPisgeneratedbyinformation
producers,andtheIPisreadyforingestionformanagement.Duetolimitationsof
theTIFFfileformat,otherwaysofmanagingand handling SIParerequired.
Although TIFF supports coding multiple images andXMPmetadata, current
practiceslimittheuses.Onecommonwayistohavealltherelateddata(e.g.image
files,structuralinformation)savedinadirectory.Anotherwayistomaintaina
digitizationmanagementsystemwhichcapturetheSIPinformation.Unfortunately
allthesewaysresultinhugeincreaseincostsalongwithinefficiency.
Proprietarystandard:manypeopleperceivethatTIFF(akaTIFF6.0)isanopen
standard.Theyarepartiallycorrect.AdobestillholdsthecopyrightontheTIFF6.0
416
LHT
33,3
specification, although Adobe does not require a license to implement TIFF
software.AlicensewasrequiredatonetimetoimplementtheLZWcompression
algorithm,butall patentsonthat havebeennowexpired.TIFF-EP(electronic
photography),asasubsetofTIFF6.0,isanISOstandard.Althoughthereareno
majordifferencefromTIFF6.0,manyofTIFF6.0tagsareignoredinTIFF/EP.
Bigfilesize:manyinstitutionschooseuncompressedTIFF6.0asthemasterfile
format. Due to the nature of TIFF, file size is huge compared to lossless
compressedone.Forexample,compressedTIFFwithlosslessLZWis30percent
+smallerinfilesize.Incomparison,losslesscompressedJPEG2000fileis40-60
percent+smallerinfilesize.
Inflexibleforwebandmobiledelivery:inthewebormobileenvironment,access
iscritical,whileTIFFcannotbevieweddirectlyinbrowsers(exceptSafari)or
mobilephoneswithoutaplug-in.Appropriatesoftwareisrequiredtoopenthe
file.Alongwithhugesize,deliveryofdigitizedtextualdocumentsinTIFFhasto
beconvertedtosomeotherfileformatsuchasPDFandJPEGforaccessand
fasterdownload.
Indexingisdifficult:indexing thecontent in aTIFFfilegenerally cannotbe
savedwiththefileitself,andhastobeachievedviaotherways.Forexample,
OCRdatasavedintextualfile,PDFor,XML-formatsuchasALTO.
Structural metadata not allowed: TIFF does not provide a way to capture
structural metadata, which is critical for providing access to digitized
manuscriptsandjournals.Toachievingthisfeature,librarieshavebeenusing
adatabaseoranadditionalmetadatawrappersuchasMETStosavestructural
metadatainformation.
InconsistentTIFFtagdata:oneoftheoldestdigitizationprojectsfromtheLibrary
ofCongress,AmericanMemoryhasusedTIFF5.0andTIFF6.0asthemasterfile
format.Itisalsonotedthat“theLibrary’suseofTIFFformatsandheadershasnot
alwaysgonesmoothly,perhapstheinevitableresultofusinga“multi-flavor”setof
industryconventionsratherthanatruestandard”(Fleischhauer,1998).Itissotrue
thatwehaveseenvariousmetatagsinTIFFheaderfromdifferentdigitization
vendors.Inaddition,intextualdocumentdigitizationprocess,almostallofTIFF
tagssuchasscanneranddimensionarethesame.ThesetoftheseTIFFtagsare
storedineachTIFFfileheader,resultingmanyduplicates.
TIFFtagsaredifficulttoworkwith:FADGIalsopointsoutthattheproliferationof
tagsandtagsetscomplicatesTIFFmetadataextraction.MostTIFFprograms
onlydisplayTIFF’sbaseline,extensionandafewprivatetags.Inaddition,the
extracteddataaredifficulttouseandstorebecauseofthedifferentdatatypesfor
thevarioustaggedfields,andthelackof anysystematicdata structuresand
formats(FederalAgenciesDigitizationGuidelinesInitiative(FADGI),2009).
JPEG2000 file format has many advantages over TIFF 6.0, but also have a few
drawbacks.AstudyconductedbyMrKnijffattheNationalLibraryoftheNetherlands
foundoutthelimitationsofJP2handingICCprofilesanddifferenthandlingofheaders
viamajorJPEG2000software(vanderKnijff,2011).Thisfindingledtoarequestto
amendJPEG2000standard.Forrasterimages,JPEG2000and/orTIFFwillstillplayan
importantrolesasapreferredmasterformat.However,theirrolesasapreferredmaster
formatfortextualdocumentsarenotjustified.
417
Beyond
TIFFand
JPEG2000
4.3 PDF/AasanOAISSIPcontainer
The author recommendsusing PDF/Aas the preferred fileformat t for production
master file for textual document digitization. The author further recommends
touse:
(1) PDF/Awithlosslesscompressionsasthepreferredfileformatforhigh-quality
digitization.Inaddition,PDF/AcanbeusedasanOAISSIPcontainerbeyond
merelyusedasafileformat.
(2) PDF/Awithlosslesscompressionsalongwithmetadata/dataasthepreferred
OAISSIPcontainer.
APDF/Afilecanbestructured,self-containedandself-describedfordigitizationby
codingvariousmetadata androasterimageswithin thefile.PDF/Aisa betterfile
formatthanTIFF/JPEG200foronlineaccessanddelivery,asitrequiresnobrowser
plug-ininweband/ormobileenvironment;canbetaggedPDFwithstructuralinformation.
PDF/Aoffersthefollowingadvantages.
4.3.1 Open International Standards. PDF are now truly open documented
internationalstandards.ISOhasbeenreleasingmultiplestandards-relatedPDFsince
2008.ItwastruethatPDFwasaproprietaryformatcontrolledbyAdobebefore2008.
ISOpublishedPDF1.7 asISO32000-1:2008,andsincethenthecontrolofthePDF
specificationpassedtoanISOcommitteeofvolunteerindustryexperts.In2008,Adobe
publishedaPublicPatentLicensetoISO32000-1grantingroyalty-freerightsforall
patentsownedby Adobethat arenecessaryto make,use,sell anddistributePDF
compliantimplementations.PDF/A-1,PDF/A-2 and PDF/A-3 areall ISOstandards
underISO19005.WhilemanyusersareusedtousingAdobeproductstohandlePDFs,
there are a few open source software and private companies providing g ways to
generateandupdatePDFs.Intheworstcasescenario,youcanwriteyourownsoftware
tohandlePDFsbasedonISOtechnicalspecifications.
4.3.2 Self-containedandself-described.PDF/AissuitableasanOAISSIPcontainerto
codeall therequireddigitalobjects,dataand/or metadata.Itcanpackageall image
objectsalongwith ICCprofilesinto onePDF/Afileinsteadofa directory ofTIFFs/
JPEG2000s.LimitationsofTIFF6.0andJPEG2000wereidentifiedabove.Forexample,
thePDF1.7specificationsection4.5“colorspaces,”4.8“Images”and10.7“TaggedPDF”
explainhowtohandleICCprofiles,rasterimagesandstructuralmetadata.Incomparison,
inthecurrentpracticeadigitizationproducerhastouseapackageofTIFFs/JPEG2000s
andassociatedfilestodeliverdata.Asaresult,severalbenefitswillbeachievedforboth
producer and receiving institutions. Using PDF/A simplifies ingestion and delivery
process.ReductioninthenumberoffilesforAIPsresultsinlessmanagementoverhead.
TheflexibilityofXMPmakesiteasiertocodestandardizedmetadatasuchasDublinCore
andspecializedmetadataforAIPsandDIPs.Asa result,theuseofPDF/Afilecan
eliminatecurrentdigitizationinventoryand/ormanagementsystem.
4.3.3 Flexible. PDF/A offers options to encode raster images either in
uncompressed mode or lossless or lossy way with royalty-free compression
algorithms.ItcanuseallthecompressedmethodsofferedinTIFF,andprovidean
improvedcompressionofferedinJPEG2000forPDF1.5andabove.Theseoptionsare
veryflexibletoenableuserstohandlemasteroraccessfilesintheirpreferredways.
Theoptionsare:
uncompressed:onecancodeuncompressedimagesinPDF/A-1;
418
LHT
33,3
Documents you may be interested
Documents you may be interested