64
PDF 32000-1:2008
676
©
Adobe Systems Incorporated 2008 – All rights reserved
•
Annex G, outlines strategies for accessing Linearized PDF over a network, which in turn determine the
optimal way to organize the PDF file.
The reader is assumed to be familiar with the basic architecture of the Web, including terms such as URL,
HTTP, and MIME.
F.2 Background and Assumptions
NOTE 1
The principal problem addressed by the Linearized PDF design is the access of PDF documents through the
Web. This environment has the following important properties:
•
The access protocol (HTTP) is a transaction consisting of a request and a response. The conforming
reader presents a request in the form of a URL, and the server sends a response consisting of one or
more MIME-tagged data blocks.
•
After a transaction has completed, obtaining more data requires a new request-response transaction. The
connection between conforming reader and server does not ordinarily persist beyond the end of a
transaction, although some implementations may attempt to cache the open connection to expedite
subsequent transactions with the same server.
•
Round-trip delay can be significant. A request-response transaction can take up to several seconds,
independent of the amount of data requested.
•
The data rate may be limited. A typical bottleneck is a slow link between the conforming reader and the
Internet service provider.
These properties are generally shared by other wide-area network architectures besides the Web.
Also, CD-ROMs share some of these properties, since they have relatively slow seek times and lim-
ited data rates compared to magnetic media. The remainder of this annex focuses on the Web.
Some additional properties of the HTTP protocol are relevant to the problem of accessing PDF files
efficiently. These properties may not all be shared by other protocols or network environments.
•
When a PDF file is initially accessed (such as by following a URL hyperlink from some other document),
the file type is not known to the conforming reader. Therefore, the conforming reader initiates a transaction
to retrieve the entire document and then inspects the MIME tag of the response as it arrives. Only at that
point is the document known to be PDF. Additionally, with a properly configured server environment, the
length of the document becomes known at that time.
•
The conforming reader may abort a response while the transaction is still in progress if it decides that the
remainder of the data is not of immediate interest. In HTTP, aborting the transaction requires closing the
connection, which interferes with the strategy of caching the open connection between transactions.
•
The conforming reader may request retrieval of portions of a document by specifying one or more byte
ranges (by offset and count) in the HTTP request headers. Each range can be relative to either the
beginning or the end of the file. The conforming reader may specify as many ranges as it wants in the
request, and the response consists of multiple blocks, each properly tagged.
•
The conforming reader may initiate multiple concurrent transactions in an attempt to obtain multiple
responses in parallel. This is commonly done, for instance, to retrieve inline images referenced from an
HTML document. This strategy is not always reliable and may backfire if the transactions interfere with
each other by competing for scarce resources in the server or the communication channel.
NOTE 2
Extensive experimentation has determined that having multiple concurrent transactions does not work very
well for PDF in some important environments. Therefore, Linearized PDF is designed to enable good
performance to be achieved using only one transaction at a time. In particular, this means that the conforming
reader needs to have sufficient information to determine the byte ranges for all the objects required to display
a given page of the PDF file so that it can specify all those byte ranges in a single request.
NOTE 3
The following additional assumptions are made about the conforming reader and its local environment:
•
The conforming reader has plenty of local temporary storage available. It should rarely need to retrieve a
given portion of a PDF document more than once from the server.
64
©
Adobe Systems Incorporated 2008 – All rights reserved
677
PDF 32000-1:2008
•
The conforming reader is able to display PDF data quickly once it has been received. The performance
bottleneck is assumed to be in the transport system (throughput or round-trip delay), not in the processing
of data after it arrives.
The consequence of these assumptions is that it may be advantageous for the conforming reader to
do considerable extra work to minimize delays due to communications.
Such work includes maintaining local caches and reordering actions according to when the needed data
becomes available.
F.3 Linearized PDF Document Structure
F.3.1
General
Except as noted below, all elements of a Linearized PDF file shall be as specified in 7.5, "File Structure", and all
indirect objects in the file shall be shall be divided into two groups.
•
The first group shall consist of the document catalogue, other document-level objects, and all objects
belonging to the first page of the document. These objects shall be numbered sequentially, starting at the
first object number after the last number of the second group. (The stream containing the hint tables, called
a hint stream , may be numbered out of sequence; see F.3.6, "Hint Streams (Parts 5 and 10)".
•
The second group shall consist of all remaining objects in the document, including all pages after the first,
all shared objects (objects referenced from more than one page, not counting objects referenced from the
first page), and so forth. These objects shall be numbered sequentially starting at 1.
These groups of objects shall be indexed by exactly two cross-reference table sections. For pedagogical
reasons the linearized PDF is considered to be composed from 11 parts, in order, and the composition of these
groups is discussed in more detail in the sections that follow. All objects shall have a generation number of 0.
Beginning with PDF 1.5, PDF files may contain object streams (see 7.5.7, "Object Streams"). In linearized files
containing object streams, the following conditions shall apply:
•
These additional objects may not be contained in an object stream: the linearization dictionary, the
document catalogue, and page objects.
•
Objects stored within object streams shall be given the highest range of object numbers within the main
and first-page cross-reference sections.
•
For files containing object streams, hint data may specify the location and size of the object streams only
(or uncompressed objects), not the individual compressed objects. Similarly, shared object references
shall be made to the object stream containing a compressed object, not to the compressed object itself.
•
Cross-reference streams (7.5.8, "Cross-Reference Streams") may be used in place of traditional cross-
reference tables. The logic described in this sub-clause shall still apply, with the appropriate syntactic
changes.
EXAMPLE 1
Part 1: Header
% PDF-1 . 1
% … Binary characters …
EXAMPLE 2
Part 2: Linearization parameter dictionary
43 0 obj
<< /Linearized 1.0
% Version
/L 54567
% File length
/H [ 475 598 ]
% Primary hint stream offset and length (part 5)
/O 45
% Object number of first page’s page object (part 6)
/E 5437
% Offset of end of first page
/N 11
% Number of pages in document
67
PDF 32000-1:2008
678
©
Adobe Systems Incorporated 2008 – All rights reserved
/T 52786
% Offset of first entry in main cross-reference table (part 11)
>>
endobj
EXAMPLE 3
Part 3: First-page cross-reference table and trailer
xref
43 14
0000000052 00000 n
0000000392 00000 n
0000001073 00000 n
… Cross-reference entries for remaining objects in the first page …
0000000475 00000 n
trailer
<< /Size 57
% Total number of cross-reference table entries in document
/Prev 52776
% Offset of main cross-reference table (part 11)
/Root 44 0 R
% Indirect reference to catalogue (part 4)
… Any other entries, such as Info and Encrypt …
% (part 9)
>>
% Dummy cross-reference table offset
startxref
0
% % EOF
EXAMPLE 4
Part 4: Document catalogue and other required document-level objects
44 0 obj
<< /Type /Catalog
/Pages 42 0 R
>>
endobj
… Other objects …
EXAMPLE 5
Part 5: Primary hint stream (may precede or follow part 6)
56 0 obj
<< /Length 457
… Possibly other stream attributes, such as Filter …
/S 221
% Position of shared object hint table
… Possibly entries for other hint tables …
>>
stream
… Page offset hint table …
… Shared object hint table …
… Possibly other hint tables …
endstream
endobj
EXAMPLE 6
Part 6: First-page section (may precede or follow part 5)
45 0 obj
<< /Type /Page
…
>>
endobj
… Outline hierarchy (if the PageMode value in the document catalog is UseOutlines) …
… Objects for first page, including both shared and nonshared objects …
EXAMPLE 7
Part 7: Remaining pages
1 0 obj
<< /Type /Page
47
©
Adobe Systems Incorporated 2008 – All rights reserved
679
PDF 32000-1:2008
… Other page attributes, such as MediaBox, Parent, and Contents …
>>
endobj
… Nonshared objects for this page …
… Each successive page followed by its nonshared objects …
… Last page followed by its nonshared objects …
EXAMPLE 8
Part 8: Shared objects for all pages except the first
… Shared objects …
EXAMPLE 9
Part 9: Objects not associated with pages, if any
… Other objects …
EXAMPLE 10
Part 10: Overflow hint stream (optional)
… Overflow hint stream …
EXAMPLE 11
Part 11: Main cross-reference table and trailer
xref
0 43
0000000000 65535 f
… Cross-reference entries for all except first page’s objects …
trailer
<< /Size 43 >>
% Trailer need not contain other entries; in particular,
% it should not have a Prev entry
% Offset of first-page cross-reference table (part 3)
startxref
257
% % EOF
F.3.2
Header (Part 1)
The Linearized PDF file shall begin with the standard header line (see 7.5.2, "File Header"). Linearization is
independent of PDF version number and may be applied to any PDF file of version 1.1 or greater.
The binary characters following the PERCENT SIGN (25h) on the second line are characters with codes 128 or
greater, as recommended in 7.5.2, "File Header".
F.3.3
Linearization Parameter Dictionary (Part 2)
Following the header, the first object in the body of the file (part 2) shall be an indirect dictionary object, the
linearization parameter dictionary, which shall contain the parameters listed in Table F.1. All values in this
dictionary shall be direct objects. There shall be no references to this dictionary anywhere in the document;
however, the first-page cross-reference table (part 3) shall contain a normal entry for it.
The linearization parameter dictionary shall be entirely contained within the first 1024 bytes of the PDF file. This
limits the amount of data a conforming reader must read before deciding whether the file is linearized.
94
PDF 32000-1:2008
680
©
Adobe Systems Incorporated 2008 – All rights reserved
F.3.4
First-Page Cross-Reference Table and Trailer (Part 3)
Part 3 shall contain the cross-reference table for objects belonging to the first page (discussed in F.3.4, "First-
Page Cross-Reference Table and Trailer (Part 3)") as well as for the document catalogue and document-level
objects appearing before the first page (discussed in F.3.5, "Document Catalogue and Document-Level Objects
(Part 4)"). Additionally, this cross-reference table shall contain entries for the linearization parameter dictionary
(at the beginning) and the primary hint stream (at the end). This table shall be a valid cross-reference table as
defined in 7.5.4, "Cross-Reference Table", although its position in the file shall not be at the end of the file. It
shall consist of a single cross-reference subsection that has no free entries.
In PDF 1.5 and later, cross-reference streams (see 7.5.8, "Cross-Reference Streams") may be used in
linearized files in place of traditional cross-reference tables. The logic described in this section, along with the
appropriate syntactic changes for cross-reference streams shall still apply.
Table F.1 – Entries in the linearization parameter dictionary
Parameter
Type
Value
Linearized
number
(Required) A version identification for the linearized format.
L
integer
(Required) The length of the entire file in bytes. It shall be exactly
equal to the actual length of the PDF file. A mismatch indicates that
the file is not linearized and shall be treated as ordinary PDF, ignoring
linearization information. (If the mismatch resulted from appending an
update, the linearization information may still be correct but requires
validation; see G.7, "Accessing an Updated File" for details.)
H
array
(Required) An array of two or four integers, [ offset
1
length
1
] or
[ offset
1
length
1
offset
2
length
2
]. offset
1
shall be the offset of the
primary hint stream from the beginning of the file. (This is the
beginning of the stream object, not the beginning of the stream data.)
length
1
shall be the length of this stream, including stream object
overhead.
If the value of the primary hint stream dictionary’s Length entry is an
indirect reference, the object it refers to shall immediately follow the
stream object, and length
1
also shall include the length of the indirect
length object, including object overhead.
If there is an overflow hint stream, offset
2
and length
2
shall specify its
offset and length.
O
integer
(Required) The object number of the first page’s page object.
E
integer
(Required) The offset of the end of the first page (the end of EXAMPLE
6 in F.3.1, "General"), relative to the beginning of the file.
N
integer
(Required) The number of pages in the document.
T
integer
(Required) In documents that use standard main cross-reference
tables (including hybrid-reference files; see 7.5.8.4, "Compatibility with
Applications That Do Not Support Compressed Reference Streams"),
this entry shall represent the offset of the white-space character
preceding the first entry of the main cross-reference table (the entry for
object number 0), relative to the beginning of the file. Note that this
differs from the Prev entry in the first-page trailer, which gives the
location of the xref line that precedes the table.
(PDF 1.5) Documents that use cross-reference streams exclusively
(see 7.5.8, "Cross-Reference Streams"), this entry shall represent the
offset of the main cross-reference stream object.
P
integer
(Optional) The page number of the first page; see F.3.4, "First-Page
Cross-Reference Table and Trailer (Part 3)". Default value: 0.
54
©
Adobe Systems Incorporated 2008 – All rights reserved
681
PDF 32000-1:2008
Below the table shall be the first-page trailer. The trailer’s Prev entry shall give the offset of the main cross-
reference table near the end of the file. A conforming reader that does not support the linearized feature shall
process this correctly even though the trailers are linked in an unusual order. It interprets the first-page cross-
reference table as an update to an original document that is indexed by the main cross-reference table.
The first-page trailer shall contain valid Size and Root entries, as well as any other entries needed to display
the document. The Size value shall be the combined number of entries in both the first-page cross-reference
table and the main cross-reference table.
The first-page trailer may optionally end with startxref, an integer, and %%EOF, just as in an ordinary trailer.
This information shall be ignored.
F.3.5
Document Catalogue and Document-Level Objects (Part 4)
Following the first-page cross-reference table and trailer are the catalogue dictionary and other objects that are
required present when the document is opened. These additional objects (constituting part 4) shall include the
values of the following entries if they are present and are indirect objects:
•
The conforming reader Preferences entry in the catalogue.
•
The PageMode entry in the catalogue. Note that if the value of PageMode is UseOutlines, the outline
hierarchy shall be located in part 6; otherwise, the outline hierarchy, if any, shall be located in part 9. See
F.3.10, "Other Objects (Part 9)" for details.
•
The Threads entry in the catalogue, along with all thread dictionaries it refers to. This does not include the
threads’ information dictionaries or the individual bead dictionaries belonging to the threads.
•
The OpenAction entry in the catalogue.
•
The AcroForm entry in the catalogue. Only the top-level interactive form dictionary shall be present, not
the objects that it refers to.
•
The Encrypt entry in the first-page trailer dictionary. All values in the encryption dictionary shall also be
located here.
All other objects shall not be located here but instead shall be at the end of the file; see F.3.10, "Other Objects
(Part 9)". This includes objects such as page tree nodes, the document information dictionary, and the
definitions for named destinations.
NOTE
The objects located here are indexed by the first-page cross-reference table, even though they are not logically
part of the first page.
F.3.6
Hint Streams (Parts 5 and 10)
The core of the linearization information shall be stored in data structures known as hint tables, whose format is
described in F.4, "Hint Tables." They shall provide indexing information that enables the conforming reader to
construct a single request for all the objects that are needed to display any page of the document or to retrieve
other information efficiently. The hint tables may contain additional information to optimize access by
conforming writer extensions to application-specific data.
The hint tables shall not be logically part of the information content of the document; they shall be derived from
the document. Any action that changes the document—for instance, appending an incremental
update—invalidates the hint tables. The document remains a valid PDF file but is no longer linearized; see G.7,
"Accessing an Updated File" for details.
The hint tables are binary data structures that shall be enclosed in a stream object. Syntactically, this stream
shall be a PDF indirect object. However, there shall be no references to the stream anywhere in the document.
59
PDF 32000-1:2008
682
©
Adobe Systems Incorporated 2008 – All rights reserved
Therefore, it is not logically part of the document, and an operation that regenerates the document may remove
the stream.
Usually, all the hint tables shall be contained in a single stream, known as the primary hint stream . Optionally,
there may be an additional stream containing more hints, known as the overflow hint stream . The contents of
the two hint streams shall be concatenated and treated as if they were a single unbroken stream.
The primary hint stream, which shall be required, is shown as part 5 in Example 5. The order of this part and
the first-page section, shown as part 6, may be reversed; see Annex G for considerations on the choice of
placement. The overflow hint stream, part 10, is optional.
The location and length of the primary hint stream, and of the overflow hint stream if present, shall be given in
the linearization parameter dictionary at the beginning of the file.
The hint streams shall be assigned the last object numbers in the file—that is, after the object number for the
last object in the first page. Their cross-reference table entries shall be at the end of the first-page cross-
reference table. This object number assignment shall be independent of the physical locations of the hint
streams in the file.
NOTE
This convention keeps their object numbers from conflicting with the numbering of the linearized objects.
With one exception, the values of all entries in the hint streams’ dictionaries shall be direct objects and may
contain no indirect object references. The exception is the stream dictionary’s Length entry (see the discussion
of the H entry in Table F.1).
In addition to the standard stream attributes, the dictionary of the primary hint stream shall contain entries
giving the position of the beginning of each hint table in the stream. These positions shall be counted in bytes
relative to the beginning of the stream data (after decoding filters, if any, are applied) and with the overflow hint
stream concatenated if present. The dictionary of the overflow hint stream shall not contain these entries. The
keys designating the standard hint tables in the primary hint stream’s dictionary are listed in Table F.2; F.4, "Hint
Tables," documents the format of these hint tables. Additionally, there is a required page offset hint table, which
shall be the first table in the stream and shall start at offset 0.
Table F.2 – Standard hint tables
Key
Hint table
S
(Required) Shared object hint table (see F.4.2, “Shared Object Hint
Table”)
T
(Present only if thumbnail images exist) Thumbnail hint table (see F.4.3,
"Thumbnail Hint Table")
O
(Present only if a document outline exists) Outline hint table (see F.4.4,
“Generic Hint Tables”)
A
(Present only if article threads exist) Thread information hint table (see
F.4.4, “Generic Hint Tables”)
E
(Present only if named destinations exist) Named destination hint table
(see F.4.4, “Generic Hint Tables”)
V
(Present only if an interactive form dictionary exists) Interactive form hint
table (see F.4.5, “Extended Generic Hint Tables”)
I
(Present only if a document information dictionary exists) Information
dictionary hint table (see F.4.4, “Generic Hint Tables”)
C
(Present only if a logical structure hierarchy exists; PDF 1.3) Logical
structure hint table (see F.4.5, “Extended Generic Hint Tables”)
L
(PDF 1.3) Page label hint table (see F.4.4, “Generic Hint Tables”)
Documents you may be interested
Documents you may be interested