58
9.5 PDF/A for Archiving 213
initialize the current color to black at the beginning of each page. Depending on wheth-
er or not an ICC output intent has been specified, it will use the DeviceGray or Lab color
space for selecting black. Use the following call to manually set Lab black color:
p.setcolor("fillstroke", "lab", 0, 0, 0, 0);
In addition to the color spaces listed in Table 9.15, spot colors can be used subject to the
corresponding alternate color space. Since PDFlib uses CIELab as the alternate color
space for the builtin HKS and PANTONE spot colors, these can always be used with PDF/
A. For custom spot colors the alternate color space must be chosen so that it is compati-
ble with the PDF/A output intent.
Note More information on PDF/A and color space can be found in Technical Note 0002 of the PDF/A
Competence Center at www.pdfa.org.
9.5.5 XMP Document Metadata for PDF/A
PDF/A-1 heavily relies on the XMP format for embedding metadata in PDF documents.
ISO 19005-1 refers to the XMP 2004 specification
1
; older or newer versions of the XMP
specification are not supported. PDF/A-1 supports two kinds of document-level meta-
data: a set of well-known metadata schemas called predefined schemas, and custom ex-
tension schemas. PDFlib will automatically create the required PDF/A conformance en-
tries in the XMP as well as several common entries (e.g. CreationDate).
User-generated document metadata can be supplied with the metadata option of
PDF_begin/end_document( ). In PDF/A mode PDFlib verifies whether user-supplied XMP
document metadata conforms to the PDF/A requirements. There are no PDF/A require-
ments for component-level metadata (e.g. page or image).
XMP metadata from imported PDF documents can be fetched from the input PDF via
the pCOS path /Root/Metadata.
Cookbook A full code sample can be found in the Cookbook topic interchange/import_xmp_from_pdf.
Predefined XMP schemas. PDF/A-1 supports all schemas in XMP 2004. These are called
predefined schemas, and are listed in Table 9.16 along with their namespace URI and the
preferred namespace prefix. Only those properties of predefined schemas must be used
which are listed in XMP 2004. A full list of all properties in the predefined XMP schemas
for PDF/A-1 is available in a TechNote published by the PDF/A Competence Center.
1. See www.aiim.org/documents/standards/xmpspecification.pdf
Table 9.16 Predefined XMP schemas for PDF/A-1
Schema name and description
(see XMP2004 for details)
namespace URI
preferred
namespace prefix
Adobe PDF schema
http://ns.adobe.com/pdf/1.3/
pdf
Dublin Core schema
http://purl.org/dc/elements/1.1/
dc
EXIF schema for EXIF-specific properties
http://ns.adobe.com/exif/1.0/
exif
EXIF schema for TIFF properties
http://ns.adobe.com/tiff/1.0/
tiff
Photoshop schema
http://ns.adobe.com/photoshop/1.0/
photoshop
XMP Basic Job Ticket schema
http://ns.adobe.com/xap/1.0/bj
xmpBJ
65
214
Chapter 9: Generating various PDF Flavors
XMP extension schemas. If your metadata requirements are not covered by the pre-
defined schemas you can define an XMP extension schema. PDF/A-1 describes an exten-
sion mechanism which must be used when custom schemas are to be embedded in a
PDF/A document. Table 9.17 summarizes the schemas which must be used for describ-
ing one or more extension schemas, along with their namespace URI and the required
namespace prefix. Note that the namespace prefixes are required (unlike the preferred
namespace prefixes for predefined schemas).
The details of constructing an XMP extension schema for PDF/A-1 are beyond the
scope of this manual. Detailed instructions are available from the PDF/A Competence
Center.
XMP document metadata packages can be supplied to the metadata options of PDF_
begin_document( ), PDF_end_document( ), or both.
Cookbook Full code and XMP samples can be found in the Cookbook topics pdf_flavors/pdfa_extension_
schema and pdf_flavors/pdfa_extension_schema_with_type.
XMP Basic schema
http://ns.adobe.com/xap/1.0/
xmp
XMP Media Management schema
http://ns.adobe.com/xap/1.0/mm/
xmpMM
XMP Paged-Text schema
http://ns.adobe.com/xap/1.0/t/pg/
xmpTPg
XMP Rights Management schema
http://ns.adobe.com/xap/1.0/rights/
xmpRights
Table 9.17 PDF/A-1 extension schema container schema and auxiliary schemas
Schema name and description
namespace URI
1
1. Note that the namespace URIs are incorrectly listed in ISO19005-1, and have been corrected in Technical Corrigendum 1.
required
namespace prefix
PDF/A extension schema container schema:
container for all embedded extension sche-
ma descriptions
http://www.aiim.org/pdfa/ns/extension/
pdfaExtension
PDF/A schema value type: describes a single
extension schema with an arbitrary num-
ber of properties
http://www.aiim.org/pdfa/ns/schema#
pdfaSchema
PDF/A property value type: describes a sin-
gle property
http://www.aiim.org/pdfa/ns/property#
pdfaProperty
PDF/A ValueType value type: describes a
custom value type used in extension sche-
ma properties; only required if types be-
yond the XMP 2004 list of types are used.
http://www.aiim.org/pdfa/ns/type#
pdfaType
PDF/A field type schema: describes a field in
a structured type
http://www.aiim.org/pdfa/ns/field#
pdfaField
Table 9.16 Predefined XMP schemas for PDF/A-1
Schema name and description
(see XMP 2004 for details)
namespace URI
preferred
namespace prefix
30
9.5 PDF/A for Archiving 215
9.5.6 PDF/A Validation
Bavaria report on PDF/A validation. PDFlib GmbH conducted comprehensive testing
of current PDF/A validation tools. A variety of non-conforming and conforming PDF/A
documents have been processed with validators and the results checked against the
standard. The validation report and associated test documents are available at the fol-
lowing location:
www.pdflib.com/developer/pdfa/validation-report
Acrobat 9.1. The Preflight tool in Acrobat 9.1 is based on the ISO 19005-1 standard,
Technical Corrigendum 1 and the relevant TechNotes published by the PDF/A Compe-
tence Center. It fixes several problems in Acrobat 9.0. If you want to validate PDF/A doc-
uments with Acrobat we strongly recommend Acrobat 9.1 or above.
Acrobat 9.0. The Preflight tool in Acrobat 9.0 includes validation profiles for PDF/A-1a
and PDF/A-1b, and validates according to the ISO standard including Technical
Corrigendum 1. Although we identified a few areas where Preflight issues inappropriate
warnings, PDFlib documents are not affected by the majority of these problems.
Acrobat 7 and 8. These Acrobat versions should not be used for PDF/A validation. Ac-
robat 7 does not implement important clarifications in Technical Corrigendum of ISO
19005. Acrobat 8 does not fully check all relevant areas (e.g. XMP extension schemas for
PDF/A and some font-related aspects).
9.5.7 Viewing PDF/A Documents in Acrobat
Acrobat 8/9 offers a special PDF/A viewing mode which can be configured in Edit,
Preferences, General..., Documents, PDF/A View Mode. Note that some features behave dif-
ferently in the PDF/A view mode of Acrobat 9:
>In Acrobat 9.0 bookmarks can no longer be activated. This is a bug and has been fixed
in Acrobat 9.1.
>Links can no longer be activated. Instead, Acrobat displays the link target (URI for
Web links) when the mouse pointer is located in the link area. The links are inactive
by design; they will work again when the document is displayed without PDF/A view
mode.
47
216
Chapter 9: Generating various PDF Flavors
9.6 Tagged PDF
Tagged PDF is a certain kind of enhanced PDF which enables additional features in PDF
viewers, such as accessibility support, text reflow, reliable text extraction and conver-
sion to other document formats such as RTF or XML.
PDFlib supports Tagged PDF generation. However, Tagged PDF can only be created if
the client provides information about the document’s internal structure, and obeys cer-
tain rules when generating PDF output.
Note PDFlib doesn’t support custom structure element types (i.e. only standard structure types as de-
fined by PDF can be used), role maps, and structure element attributes.
Cookbook Code samples regarding Tagged PDF issues can be found in the document_interchange category
of the PDFlib Cookbook.
9.6.1 Generating Tagged PDF with PDFlib
Cookbook A full code sample can be found in the Cookbook topic document_interchange/starter_tagged.
Required operations. Table 9.18 lists all operations required to generate Tagged PDF
output. Not calling one of the required functions while in Tagged PDF mode will trigger
an exception.
Unicode-compatible text output. When generating Tagged PDF, all text output must
use fonts which are Unicode-compatible as detailed in Section 5.4.4, »Unicode-compati-
ble Fonts«, page 111. This means that all used fonts must provide a mapping to Unicode.
Non Unicode-compatible fonts are only allowed if alternate text is provided for the con-
tent via the ActualText or Alt options in PDF_begin_item( ). PDFlib will throw an exception
if text without proper Unicode mapping is used while generating Tagged PDF.
Note In some cases PDFlib will not be able to detect problems with wrongly encoded fonts, for exam-
ple symbol fonts encoded as text fonts. Also, due to historical problems PostScript fonts with
certain typographical variations (e.g., expert fonts) are likely to result in inaccessible output.
Page content ordering. The ordering of text, graphics, and image operators which de-
fine the contents of the page is referred to as the content stream ordering; the content
ordering defined by the logical structure tree is referred to as logical ordering. Tagged
PDF generation requires that the client obeys certain rules regarding content ordering.
The natural and recommended method is to sequentially generate all constituent
parts of a structure element, and then move on to the next element. In technical terms,
the structure tree should be created during a single depth-first traversal.
Table 9.18 Operations which must be applied for generating Tagged PDF
item
PDFlib function and option requirements for Tagged PDF compatibility
Tagged PDF output
The tagged option in PDF_begin_document( ) must be set to true.
document language
The lang option in PDF_begin_document( ) should be set to specify the natural language of the
document. It should initially be set for the document as a whole, but can later be overridden for
individual items on an arbitrary structure level.
structure
information
Structure information and artifacts must be identified as such. All content-generating API func-
tions should be enclosed by PDF_begin_item( ) / PDF_end_item( ) pairs.
56
9.6 Tagged PDF 217
A different method which should be avoided is to output parts of the first element,
switch to parts of the next element, return to the first, etc. In this method the structure
tree is created in multiple traversals, where each traversal generates only parts of an el-
ement.
Importing Pages with PDI. Pages from Tagged PDF documents or other PDF docu-
ments containing structure information cannot be imported in Tagged PDF mode since
the imported document structure would interfere with the generated structure.
Pages from unstructured documents can be imported, however. Note that they will
be treated »as is« by Acrobat’s accessibility features unless they are tagged with appro-
priate ActualText.
Artifacts. Graphic or text objects which are not part of the author’s original content
are called artifacts. Artifacts should be identified as such using the Artifact pseudo tag,
and classified according to one of the following categories:
>Pagination: features such as running heads and page numbers
>Layout: typographic or design elements such as rules and table shadings
>Page: production aids, such as trim marks and color bars.
Although artifact identification is not strictly required, it is strongly recommended to
aid text reflow and accessibility.
Inline items. PDF defines block-level structure elements (BLSE) and inline-level struc-
ture elements (ILSE) (see the PDFlib Reference for a precise definition). BLSEs may contain
other BLSEs or actual content, while ILSEs always directly contain content. In addition,
PDFlib makes the following distinction:
The regular vs. inline decision for ASpan items is under client control via the inline op-
tion of PDF_begin_item( ). Forcing an accessibility span to be regular (inline=false) is rec-
ommended, for example, when a paragraph which is split across several pages contains
multiple languages. Alternatively, the item could be closed, and a new item started on
the next page. Inline items must be closed on the page where they have been opened.
Recommended operations. Table 9.20 lists all operations which are optional, but rec-
ommended when generating Tagged PDF output. These features are not strictly re-
Table 9.19 Regular and inline items
regular items
inline items
affected items
all grouping elements and
BLSEs
all ILSEs and non-structural
tags (pseudo tags)
regular/inline status can be changed
no
only for ASpan items
part of the document’s structure tree
yes
no
can cross page boundaries
yes
no
can be interrupted by other items
yes
no
can be suspended and activated
yes
no
can be nested to an arbitrary depth
yes
only with other inline items
51
218
Chapter 9: Generating various PDF Flavors
quired, but will enhance the quality of the generated Tagged PDF output and are there-
fore recommended.
Prohibited operations. Table 9.21 lists all operations which are prohibited when gener-
ating Tagged PDF output. Calling one of the prohibited functions while in Tagged PDF
mode will trigger an exception.
9.6.2 Creating Tagged PDF with direct Text Output and Textflows
Minimal Tagged PDF sample. The following sample code creates a very simplistic
Tagged PDF document. Its structure tree contains only a single P element. The code uses
the autospace feature to automatically generate space characters between fragments of
text:
if (p.begin_document("hello-tagged.pdf", "tagged=true") == -1)
throw new Exception("Error: " + p.get_errmsg());
/* automatically create spaces between chunks of text */
p.set_parameter("autospace", "true");
/* open the first structure element as a child of the document structure root (=0) */
id = p.begin_item("P", "Title={Simple Paragraph}");
p.begin_page_ext(0, 0, "width=a4.width height=a4.height");
font = p.load_font("Helvetica-Bold", "unicode", "");
Table 9.20 Operations which are recommended for generating Tagged PDF
item
Recommended PDFlib functions and options for Tagged PDF compatibility
hyphenation
Word breaks (separating words in two parts at the end of a line) should be presented using a soft
hyphen character (U+00AD) as opposed to a hard hyphen (U+002D)
word boundaries
Words should be separated by space characters (U+0020) even if this would not strictly be re-
quired for positioning. The autospace parameter can be used for automatically generating space
characters after each call to one of the show functions.
artifacts
In order to distinguish real content from page artifacts, artifacts should be identified as such us-
ing PDF_begin_item( ) with tag=Artifact.
Type 3 font
properties
The familyname, stretch, and weight options of PDF_begin_font( ) should be supplied with rea-
sonable values for all Type 3 fonts used in a Tagged PDF document.
interactive elements Interactive elements, e.g. links, should be included in the document structure and made accessible
if required, e.g. by supplying alternate text. The tab order for interactive elements can be speci-
fied with the taborder option of PDF_begin/end_document( ) (this is not necessary if the interac-
tive elements are properly included in the document structure).
Table 9.21 Operations which must be avoided when generating Tagged PDF
item
PDFlib operations to be avoided for Tagged PDF compatibility
non-Unicode
compatible fonts
Fonts which are not Unicode-compatible according to Section 5.4.4, »Unicode-compatible Fonts«,
page 111, must be avoided.
PDF import
Pages from PDF documents which contain structure information (in particular: Tagged PDF docu-
ments) must not be imported.
44
9.6 Tagged PDF 219
p.setfont(font, 24);
p.show_xy("Hello, Tagged PDF!", 50, 700);
p.continue_text("This PDF has a very simple");
p.continue_text("document structure.");
p.end_page_ext("");
p.end_item(id);
p.end_document("");
Generating Tagged PDF with Textflow. The Textflow feature (see Section 7.2, »Multi-
Line Textflows«, page 140) offers powerful features for text formatting. Since individual
text fragments are no longer under client control, but will be formatted automatically
by PDFlib, special care must be taken when generating Tagged PDF with textflows:
>Textflows can not contain individual structure elements, but the complete contents
of a single Textflow fitbox can be contained in a structure element.
>All parts of a Textflow (all calls to PDF_fit_textflow( ) with a specific Textflow handle)
should be contained in a single structure element.
>Since the parts of a Textflow could be spread over several pages which could contain
other structure items, attention should be paid to choosing the proper parent item
(rather than using a parent parameter of -1, which may point to the wrong parent el-
ement).
>If you use the matchbox feature for creating links or other annotations in a Textflow
it is difficult to maintain control over the annotation’s position in the structure tree.
9.6.3 Activating Items for complex Layouts
In order to facilitate the creation of structure information with complex non-linear
page layouts PDFlib supports a feature called item activation. It can be used to activate a
previously created structure element in situations where the developer must keep track
of multiple structure branches, where each branch could span one or more pages. Typi-
cal situations which will benefit from this technique are the following:
>multiple columns on a page
>insertions which interrupt the main text, such as summaries or inserts
>tables and illustrations which are placed between columns.
The activation feature allows an improved method of generating page content in such
situations by switching back and forth between logical branches. This is much more ef-
ficient than completing each branch one after the other. Let’s illustrate the activation
feature using the page layout shown in Figure 9.1. It contains two main text columns,
interrupted by a table and an inserted annotation in a box (with dark background) as
well as header and footer.
Generating page contents in logical order. From the logical structure point of view the
page content should be created in the following order: left column, right column (on the
lower right part of the page), table, insert, header and footer. The following pseudo code
implements this ordering:
/* create page layout in logical structure order */
id_art = p.begin_item("Art", "Title=Article");
id_sect1 = p.begin_item("Sect", "Title={First Section}");
40
220
Chapter 9: Generating various PDF Flavors
/* 1 create top part of left column */
p.set_text_pos(x1_left, y1_left_top);
...
/* 2 create bottom part of left column */
p.set_text_pos(x1_left, y1_left_bottom);
...
/* 3 create top part of right column */
p.set_text_pos(x1_right, y1_right_top);
...
p.end_item(id_sect1);
id_sect2 = p.begin_item("Sect", "Title={Second Section}");
/* 4 create bottom part of right column */
p.set_text_pos(x2_right, y2_right);
...
/* second section may be continued on next page(s) */
p.end_item(id_sect2);
String optlist = "Title=Table parent=" + id_art;
id_table = p.begin_item("Table", optlist);
/* 5 create table structure and content */
p.set_text_pos(x_start_table, y_start_table);
...
p.end_item(id_table);
optlist = "Title=Insert parent=" + id_art;
id_insert = p.begin_item("P", optlist);
/* 6 create insert structure and content */
p.set_text_pos(x_start_table, y_start_table);
...
p.end_item(id_insert);
id_artifact = p.begin_item("Artifact", "");
/* 7+8 create header and footer */
p.set_text_pos(x_header, y_header);
...
p.set_text_pos(x_footer, y_footer);
...
p.end_item(id_artifact);
/* article may be continued on next page(s) */
...
p.end_item(id_art);
Documents you may be interested
Documents you may be interested