46
5.3 Recommendations for common Scenarios 65
5.3 Recommendations for common Scenarios
TET offers a variety of options which you can use to control various aspects of opera-
tion. In this section we provide some recommendations for typical TET application sce-
narios. Please refer to Chapter 10, »TET Library API Reference«, page 141, for details on
the functions and options mentioned below.
Optimizing performance. In some situations, particularly when indexing PDF for
search engines, text extraction speed is crucial, and may play a more important role
than optimal output. The default settings of TET have been selected to achieve the best
possible output, but can be adjusted to speed up processing. Some tips for choosing op-
tions in TET_open_page( ) to maximize text extraction throughput:
>docstyle=searchengine
Several internal parameters will be set to speed up operation by reducing the output
quality in a way which does not affect the indexing process for search engines.
>skipengines={image}
If image extraction is not required internal image processing can be skipped in order
to speed up operation.
>contentanalysis={merge=0}
This will disable the expensive strip and zone merging step, and reduces processing
times for typical files to ca. 60% compared to default settings. However, documents
where the contents are scattered across the pages in arbitrary order may result in
some text which is not extracted in logical order.
>contentanalysis={shadowdetect=false}
This will disable detection of redundant shadow and fake bold text, which can also
reduce processing times.
Words vs. line layout vs. reflowable text. Different applications will prefer different
kinds of output (hyphenated words will always be dehyphenated with these settings):
>Individual words (ignore layout): a search engine may not be interested in any lay-
out-related aspects, but only the words comprising the text. In this situation use
granularity=word in TET_open_page( ) to retrieve one word per call to TET_get_text( ).
>Keep line layout: use granularity=page in TET_open_page( ) for extracting the full text
contents of a page in a single call to TET_get_text( ). Text lines will be separated with a
linefeed character to retain the existing line structure.
>Reflowable text: in order to avoid line breaks and facilitate reflowing of the extracted
text use contentanalysis={lineseparator=U+0020} and granularity=page in TET_open_
page( ). The full page contents can be fetched with a single call to TET_get_text( ).
Zones will be separated with a linefeed character, and a space character will be insert-
ed between the lines in a zone.
Writing a search engine or indexer. Indexers are usually not interested in the position
of text on the page (unless they provide search term highlighting). In many cases they
will tolerate errors which occur in Unicode mapping, and process whatever text con-
tents they can get. Recommendations:
>Use granularity=word in TET_open_page( ).
>If the application knows how to process punctuation characters you can keep them
with the adjacent text by setting the following page option:
contentanalysis={punctuationbreaks=false}
45
66
Chapter 5: Configuration
Geometry. The geometry features may be useful for some applications:
>The TET_get_char_info( ) interface is only required if you need the position of text on
the page, the respective font name, or other details. If you are not interested in text
coordinates calling TET_get_text( ) will be sufficient.
>If you have advance information about the layout of pages you can use the include-
box and/or excludebox options in TET_open_page( ) to get rid of headers, footers, or
similar items which are not part of the main text.
Unknown characters. If TET is unable to determine the appropriate Unicode mapping
for one or more characters it will represent it with the Unicode replacement character
U+FFFD. If your application is not concerned about unmappable characters you can
simply discard all occurrences of this character. Applications which require more fine-
grain results could take the corresponding font into account, and use it to decide on
processing of unmappable characters. Use the following document option to replace all
unmapped characters with a question mark:
unknownchar=?
Use the following document option to remove all unmapped characters from the out-
put:
fold={{[:Private_Use:] remove} {[U+FFFD] remove} default}
Complex layouts. Some classes of documents often use very elaborate page layouts.
For example, with magazines and periodicals TET may not be able to properly deter-
mine the relationship of columns on the page. In such situations it is possible to en-
hance the extracted text at the expense of processing time. Suitable options for this
purpose are summarized in Section 6.6, »Layout Analysis«, page 88. See Table 10.12, page
174, for more details on relevant options.
Legal documents. When dealing with legal documents there is usually zero tolerance
for wrong Unicode mappings since they might alter the content or interpretation of a
document. In many cases the text position is not required, and the text must be extract-
ed word by word. Recommendations:
>Use the granularity=word option in TET_open_page( ).
>Use the password option with the appropriate document password in TET_open_
document( ) if you must process documents which require a password for opening, or
the shrug option if content extraction is not allowed in the permission settings and
you are in a legal position to extract text from the document (see »The »shrug« fea-
ture for protected documents«, page 60).
>For absolute text fidelity: stop processing as soon as the unknown field in the charac-
ter info structure returned by TET_get_char_info( ) is 1, or if the Unicode replacement
character U+FFFD is part of the string returned by TET_get_text( ). In TETML with one
of the text modes glyph or wordplus you can identify this situation by the following
attribute in the Glyph element:
unknown="true"
Do not set the unknownchar option to any common character since you may be un-
able to distinguish it from correctly mapped characters without checking the
unknown field.
44
5.3 Recommendations for common Scenarios 67
>Also to ensure text fidelity you may want to disable text extraction for text which is
not visible on the page:
ignoreinvisibletext=true
Processing documents with PDFlib+PDI. When using PDFlib+PDI to process PDF docu-
ments on a per-page basis you can integrate TET for controlling the splitting or merging
process. For example, you could split a PDF document based on the contents of a page. If
you have control over the creation process you can insert separator pages with suitable
processing instructions in the text. The TET Cookbook contains examples for analyzing
documents with TET and then processing them with PDFlib+PDI.
Legacy PDF documents with missing Unicode values. In some situations PDF docu-
ments created by legacy applications must be processed where the PDF may not contain
enough information for proper Unicode mapping. Using the default settings TET may
be unable to extract some or all of the text contents. Recommendations:
>Start by extracting the text with default settings, and analyze the results. Identify
the fonts which do not provide enough information for proper Unicode mapping.
>Write custom encoding tables and glyph name lists to fix problematic fonts. Use the
PDFlib FontReporter plugin for analyzing the fonts and preparing Unicode mapping
tables.
>Configure the custom mapping tables and extract the text again, using a larger num-
ber of documents. If there are still unmappable glyphs or fonts adjust the mapping
tables as appropriate.
>If you have a large number of documents with unmappable fonts PDFlib GmbH may
be able to assist you in creating the required mapping tables.
Convert PDF documents to another format. If you want to import the page contents of
PDF documents into your application, while retaining as much information as possible
you’ll need precise character metrics. Recommendations:
>Use TET_get_char_info( ) to retrieve precise character metrics and font names. Even if
you use the uv field to retrieve the Unicode values of individual characters, you must
also call TET_get_text( ) since it fills the char_info structure.
>Use granularity=glyph or word in TET_open_page( ), depending on what is better suited
for your application. Working with granularity=glyph may result in conflicts between
the visual layout of text and the processed logical text created by TET (e.g. the two
characters created by a ligature glyph may not fit into the same space as the liga-
ture).
Corporate fonts with custom-encoded logos. In many cases corporate fonts contain-
ing custom logos have missing or wrong Unicode mapping information for the logos. If
you have a large number of PDF documents containing such fonts it is recommended to
create a custom mapping table with proper Unicode values.
Start by creating a font report (see »Analyzing PDF documents with the PDFlib Font-
Reporter Plugin«, page 108) for a PDF containing the font, and locate mismapped glyphs
in the font report. Depending on the font type you can use any of the available configu-
ration tables to provide the missing Unicode mappings. See »Code list resources for all
font types«, page 109, for a detailed example of a code list for a logotype font.
10
68
Chapter 5: Configuration
TeX documents. PDF documents produced with the TeX documents often contain nu-
merical glyph names, Type 3 fonts and other features which prevent other products
from successfully extracting the text. TET contains many heuristics and workarounds
for dealing with such documents. However, a particular flavor of TeX documents can
only be processed with a workaround that requires more processing time, and is dis-
abled by default. You can enable more CPU-intensive font processing for these docu-
ments with the following document option:
checkglyphlists=true
42
6.1 PDF Document Domains 69
6Text Extraction
6.1 PDF Document Domains
PDF documents may contain text in many other places than only the page contents.
While most applications deal with the page contents only, in many situations other
document domains may be relevant as well.
While the page contents can be retrieved with the workhorse functions TET_get_
text( ) and TET_get_image( ), the integrated pCOS interface plays a crucial role for retriev-
ing text from other document domains.
In the remaining section we provide information on domain searching with the TET
library and TETML. In addition, we summarize how to search these document domains
with Acrobat X/XI. This is important to locate search hits in Acrobat.
Text on the page. Page contents are the main source of text in PDF. Text on a page is
rendered with fonts and encoded using one of the many encoding techniques available
in PDF.
>How to display with Acrobat: page contents are always visible
>How to search a single PDF with Acrobat X/XI: Edit, Find or Edit, [Advanced] Search. TET
may be able to process the text in documents where Acrobat does not correctly map
glyphs to Unicode values. In this situation you can use the TET Plugin which is based
on TET (see Section 4.1, »Free TET Plugin for Adobe Acrobat«, page 43). The TET Plugin
offers its own search dialog via Plug-Ins, PDFlib TET Plugin... TET Find. However, it is not
intended as a full-blown search facility.
>How to search multiple PDFs with Acrobat X/XI: Edit, [Advanced] Search and in Where
would you like to search? select All PDF Documents in, and browse to a folder with PDF
documents.
>Sample code for the TET library: extractor mini sample
>TETML element: /TET/Document/Pages/Page
Predefined document info entries. Traditional document info entries are key/value
pairs.
>How to display with Acrobat X/XI: File, Properties...
>How to search a single PDF with Acrobat X/XI: not available
>How to search multiple PDFs with Acrobat X/XI: click Edit, [Advanced] Search and Show
More Options near the bottom of the dialog. In the Look In: pull-down select a folder of
PDF documents and in the pull-down menu Use these additional criteria select one of
Date Created, Date Modified, Author, Title, Subject, Keywords.
>Sample code for the TET library: dumper mini sample
>TETML element: /TET/Document/DocInfo
Custom document info entries. Custom document info entries can be defined in addi-
tion to the standard entries.
>How to display with Acrobat X/XI: File, Properties..., Custom (not available in the free
Adobe Reader)
>How to search with Acrobat X/XI: not available
22
70
Chapter 6: Text Extraction
>Sample code for the TET library: dumper mini sample
>TETML element: /TET/Document/DocInfo/Custom
XMP metadata on document level. XMP metadata consists of an XML stream contain-
ing extended metadata.
>How to display with Acrobat X/XI: File, Properties..., Additional Metadata.. (not avail-
able in the free Adobe Reader)
>How to search a single PDF with Acrobat X/XI: not available
>How to search multiple PDFs with Acrobat X/XI: click Edit, [Advanced] Search and Show
More Options. In the Look In: pull-down select a folder of PDF documents and in the
pull-down menu Use these additional criteria select XMP Metadata
(not available in
the free Adobe Reader).
>Sample code for the TET library: dumper mini sample
>TETML element: /TET/Document/Metadata
XMP metadata on image level. XMP metadata can be attached to document compo-
nents, such as images, pages, fonts, etc. However, XMP is commonly only found on the
image level (in addition to document level).
Fig. 6.1
Acrobat’s advanced
search dialog
Documents you may be interested
Documents you may be interested