51
10.11 TET Markup Language (TETML) Functions 185
10.11 TET Markup Language (TETML) Functions
C++ int process_page(int doc, int pagenumber, wstring optlist)
C# Java int process_page(int doc, int pagenumber, String optlist)
Perl PHP long process_page(long doc, long pagenumber, string optlist)
VB RB Function process_page(doc As Long, pagenumber As Long, optlist As String) As Int
C int TET_process_page(TET *tet, int doc, int pagenumber, const char *optlist)
Process a page and create TETML output.
doc A valid document handle obtained with TET_open_document*( ).
pagenumber The physical number of the page to be processed. The first page has page
number 1. The total number of pages can be retrieved with TET_pcos_get_number( ) and
the pCOS path length:pages. The pagenumber parameter may be 0 if trailer=true.
optlist An option list specifying options from the following groups:
>General page-related options according to Table 10.10 (these will be ignored if
pagenumber=0): clippingarea, contentanalysis, excludebox, fontsizerange, granularity,
ignoreinvisibletext, imageanalysis, includebox, layoutanalysis, skipengines
>Option specifying processing details according to Table 10.18: tetml
Returns -1 on error, or 1 otherwise. However, in TETML mode this function will always succeed
since problems will be reported in a TETML Exception element.
Table 10.18 Additional options for TET_process_page( )
option
description
tetml
(Option list) Controls details of TETML. The following options are available:
elements (Option list) Specify optional TETML elements:
line
(Only for granularity=word) If true, TETML output includes Line elements be-
tween Para and Word levels. Default: false
glyphdetails
(Option list; only for granularity=glyph and word) Specify which glyph attributes will be
reported for each Glyph element (default for all suboptions: false):
all
(Boolean) Enable all attribute suboptions
dehyphenation
(Boolean) Emit attribute dehyphenation to indicate hyphenated words.
dropcap
(Boolean) Emit attribute dropcap to indicate large initial characters at the start
of a paragraph.
geometry (Boolean) Emit attributes x, y, width, alpha, beta.
font
(Boolean) Emit attributes font, fontsize, textrendering, unknown.
sub
(Boolean) Emit attribute sub to indicate subscripts.
sup
(Boolean) Emit attribute sup to indicate superscripts.
trailer
(Boolean) If true, document trailer data, i.e. data after the last page, will be emitted (it must
be appended to the page-specific data emitted earlier). This option is required in the last call
to this function in order to emit trailer data. If pagenumber=0 only trailer data (without any
page-specific data) will be emitted. Once trailer=true has been supplied, no more calls to
TET_process_page() are allowed for the same document. Default: false
44
186
Chapter 10: TET Library API Reference
Details This function will open a page, create output according to the format-related options
supplied to TET_open_document*( ), and close the page. The generated data can be re-
trieved with TET_get_xml_data( ).
This function must only be called if the option tetml has been supplied in the corre-
sponding call to TET_open_document*( ). Header data, i.e. document-specific data before
the first page, will be created by TET_open_document*( ) before the first page data. It can
be retrieved separately by calling TET_get_xml_data( ) before the first call to TET_process_
page( ), or in combination with page-related data.
Trailer data, i.e. document-specific data after the last page, must be requested with
the trailer suboption when this function is called for the last time for a document. Trail-
er data can be created with a separate call after the last page (pagenumber=0), or togeth-
er with the last page (pagenumber is different from 0). Pages can be retrieved in any or-
der, and any subset of the document’s pages can be retrieved.
It is an error to call TET_close_document( ) without retrieving the trailer, or to call TET_
process_page( ) again after retrieving the trailer.
C++ const char *get_xml_data(int doc, size_t *length, wstring optlist)
C# Java final byte[ ] get_xml_data(int doc, String optlist)
Perl PHP string get_xml_data(long doc, string optlist)
VB RB Function get_xml_data(doc As Long, optlist As String)
C const char * TET_get_xml_data(TET *tet, int doc, size_t *length, const char *optlist)
Retrieve TETML data from memory.
doc A valid document handle obtained with TET_open_document*( ).
length (C and C++ language binding only) A pointer to a variable which will hold the
length of the returned string in bytes. length does not count the terminating null byte.
optlist (Currently there are no supported options.)
Returns A byte array containing the next chunk of data according to the specified options. If the
buffer is empty an empty string will be returned (in C: a NULL pointer and *len=0).
Details This functions retrieves TETML data which has been created by TET_open_document*( )
and one or more calls to TET_process_page( ). The TETML data will always be encoded in
UTF-8, regardless of the outputformat option. The internal buffer will be cleared by this
call. It is not required to call TET_get_xml_data( ) after each call to TET_process_page( ).
The client may accumulate the data for one or more pages or for the whole document in
the buffer.
In TETML mode this function must be called at least once before TET_close_
document( ) since otherwise the data would no longer be accessible. If TET_get_xml_
data( ) is called exactly once (such a single call must happen between the last call to TET_
process_page( ) and TET_close_document( )) the buffer is guaranteed to contain well-
formed TETML output for the whole document. This function must not be called if the
filename suboption has been supplied to the tetml option of TET_open_document*( ).
Bindings C and C++ language bindings: the result will be provided as null-terminated UTF-8. On
i5/iSeries and zSeries EBCDIC-encoded UTF-8 will be returned. The returned data buffer
can be used until the next call to TET_get_xml_data( ).
8
10.11 TET Markup Language (TETML) Functions 187
Java and .NET language bindings: the result will be provided as a byte array containing
UTF-8 data.
COM: Most client programs will use the Variant type to hold the UTF-8 data.
REALbasic: The result will be returned as REALbasic String with encoding UTF-8.
PHP language binding: the result will be provided as UTF-8 string.
Python: the result will be returned as 8-bit string (Python 3: bytes).
RPG language binding: the result will be returned as null-terminated EBCDIC UTF-8.
40
188
Chapter 10: TET Library API Reference
10.12 pCOS Functions
The full pCOS syntax for retrieving object data from a PDF is supported. For a detailed
description please refer to the pCOS Path Reference which is available as a separate doc-
ument.
C++ double pcos_get_number(int doc, wstring path)
C# Java double pcos_get_number(int doc, String path)
Perl PHP float pcos_get_number(int doc, string path)
VB RB Function pcos_get_number(doc as Long, path As String) As Double
C double TET_pcos_get_number(TET *tet, int doc, const char *path, ...)
Get the value of a pCOS path with type number or boolean.
doc A valid document handle obtained with TET_open_document*( ).
path A full pCOS path for a numerical or boolean object.
Additional parameters (C language binding only) A variable number of additional pa-
rameters can be supplied if the key parameter contains corresponding placeholders (%s
for strings or %d for integers; use %% for a single percent sign). Using these parameters
will save you from explicitly formatting complex paths containing variable numerical
or string values. The client is responsible for making sure that the number and type of
the placeholders matches the supplied additional parameters.
Returns The numerical value of the object identified by the pCOS path. For Boolean values 1 will
be returned if they are true, and 0 otherwise.
C++ wstring pcos_get_string(int doc, wstring path)
C# Java String pcos_get_string(int doc, String path)
Perl PHP string pcos_get_string(int doc, string path)
VB RB Function pcos_get_string(doc as Long, path As String) As String
C const char *TET_pcos_get_string(TET *tet, int doc, const char *path, ...)
Get the value of a pCOS path with type name, number, string, or boolean.
doc A valid document handle obtained with TET_open_document*( ).
path A full pCOS path for a string, name, or boolean object.
Additional parameters (C language binding only) A variable number of additional pa-
rameters can be supplied if the key parameter contains corresponding placeholders (%s
for strings or %d for integers; use %% for a single percent sign). Using these parameters
will save you from explicitly formatting complex paths containing variable numerical
or string values. The client is responsible for making sure that the number and type of
the placeholders matches the supplied additional parameters.
Returns A string with the value of the object identified by the pCOS path. For Boolean values the
strings true or false will be returned.
Details This function raises an exception if pCOS does not run in full mode and the type of the
object is string. However, the objects /Info/* (document info keys) can also be retrieved in
40
10.12 pCOS Functions 189
restricted pCOS mode if nocopy=false or plainmetadata=true, and bookmarks[...]/Title as
well as all paths starting with pages[...]/annots[...]/ can be retrieved in restricted pCOS
mode if nocopy=false.
This function assumes that strings retrieved from the PDF document are text strings.
String objects which contain binary data should be retrieved with TET_pcos_get_stream( )
instead which does not modify the data in any way.
Bindings C language binding: The string will be returned in UTF-8 format (on zSeries and i5/
iSeries: EBCDIC-UTF-8) without BOM. The returned strings will be stored in a ring buffer
with up to 10 entries. If more than 10 strings are queried, buffers will be reused, which
means that clients must copy the strings if they want to access more than 10 strings in
parallel. For example, up to 10 calls to this function can be used as parameters for a
printf( ) statement since the return strings are guaranteed to be independent if no more
than 10 strings are used at the same time.
C++ language binding: The string will be returned as wstring in the default wstring con-
figuration of the C++ wrapper. In string compatibility mode on zSeries and i5/iSeries the
result will be returned in EBCDIC-UTF-8 without BOM.
Java and .NET bindings: the result will be provided as Unicode string. If no more text is
available a null object will be returned.
Perl, PHP and Python language bindings: the result will be provided as UTF-8 string. If
no more text is available a null object will be returned.
RPG language binding: the result will be provided as EBCDIC-UTF-8 string.
C++ const unsigned char *pcos_get_stream(int doc, int *length, string optlist, wstring path)
C# Java final byte[ ] pcos_get_stream(int doc, String optlist, String path)
Perl PHP string pcos_get_stream(int doc, string optlist, string path)
VB RB Function pcos_get_stream(doc as Long, optlist As String, path As String)
C const unsigned char *TET_pcos_get_stream(TET *tet, int doc, int *length, const char *optlist,
const char *path, ...)
Get the contents of a pCOS path with type stream, fstream, or string.
doc A valid document handle obtained with TET_open_document*( ).
length (C and C++ language bindings only) A pointer to a variable which will receive
the length of the returned stream data in bytes.
optlist An option list specifying stream retrieval options according to Table 10.19.
path A full pCOS path for a stream or string object.
Additional parameters (C language binding only) A variable number of additional pa-
rameters can be supplied if the key parameter contains corresponding placeholders (%s
for strings or %d for integers; use %% for a single percent sign). Using these parameters
will save you from explicitly formatting complex paths containing variable numerical
or string values. The client is responsible for making sure that the number and type of
the placeholders matches the supplied additional parameters.
51
190
Chapter 10: TET Library API Reference
Returns The unencrypted data contained in the stream or string. The returned data will be emp-
ty (in C and C++: NULL) if the stream or string is empty, or if the contents of encrypted
attachments in an unencrypted document are queried and the attachment password
has not been supplied.
If the object has type stream all filters will be removed from the stream contents (i.e.
the actual raw data will be returned) unless keepfilter=true. If the object has type fstream
or string the data will be delivered exactly as found in the PDF file, with the exception of
ASCII85 and ASCIIHex filters which will be removed.
In addition to decompressing the data and removing ASCII filters, text conversion
may be applied according to the convert option.
Details This function will throw an exception if pCOS does not run in full mode (see the pCOS
Path Reference). As an exception, the object /Root/Metadata can also be retrieved in re-
stricted pCOS mode if nocopy=false or plainmetadata=true. An exception will also be
thrown if path does not point to an object of type stream, fstream, or string.
Despite its name this function can also be used to retrieve objects of type string. Un-
like TET_pcos_get_string( ), which treats the object as a text string, this function will not
modify the returned data in any way. Binary string data is rarely used in PDF, and can-
not be reliably detected automatically. The user is therefore responsible for selecting
the appropriate function for retrieving string objects as binary data or text.
Bindings COM: Most client programs will use the Variant type to hold the stream contents. Java-
Script with COM does not allow to retrieve the length of the returned variant array (but
it does work with other languages and COM).
C and C++ language bindings: The returned data buffer can be used until the next call to
this function.
Python: the result will be returned as 8-bit string (Python 3: bytes).
Note This function can be used to retrieve embedded font data from a PDF. Users are reminded of
the fact that fonts are subject to the respective font vendor’s license agreement, and must not
be reused without the explicit permission of the respective intellectual property owners. Please
contact your font vendor to discuss the relevant license agreement.
Table 10.19 Options for TET_pcos_get_stream( )
option
description
convert
(Keyword; ignored for streams which are compressed with unsupported filters) Controls whether or not
the string or stream contents will be converted (default: none):
none
Treat the contents as binary data without any conversion.
unicode
Treat the contents as textual data (i.e. exactly as in TET_pcos_get_string( )), and normalize it
to Unicode. In non-Unicode-aware language bindings this means the data will be converted
to UTF-8 format without BOM.
This option is required for the data type »text stream« in PDF which is rarely used (e.g. it can
be used for JavaScript, although the majority of JavaScripts is contained in string objects, not
stream objects).
keepfilter
(Boolean; Recommended only for image data streams; will be ignored for streams which are compressed
with unsupported filters) If true, the stream data will be compressed with the filter which is specified in
the image’s filterinfo pseudo object (see the pCOS Path Reference). If false, the stream data will be
uncompressed. Default: true for all unsupported filters, false otherwise
Documents you may be interested
Documents you may be interested