29
71
Which choice of TIFF? – An explanation
The different kinds of tiff outputs affect the quality of the tiffs produced.
Different compression models decrease the file size by grouping same-colored pixels, but this, at times, may
have an adverse effect on quality.
There are two basic classes of compression
Lossy – this will average out differences with an approximate rendition of the converted image
Lossless – this will keep all detail of the original image upon conversion
LZW (Lempel-Ziv-Welch) is the most effective when compressing solid indexed colors (graphics), but less
effective for 24 bit continuous photo images. LZW lossless compression means there is no quality loss due to
compression.
G3 (Group 3) - Lossless - compression uses CCITT.4 standard, which is a line by line method of coding.
G4 (Group 4) - Lossless - compression Uses CCITT.6 standard, which allows for better compression ratios by
comparing each scanned line with the previous one
JPG compression is lossy. JPG compression is efficient. You will lose quality with JPG compression.
PackBits compression - Lossless - uses run-length compression, which is very effective for reducing the size of
bitmap files which contain large areas of solid color.
There are two color TIFF formats that produce uncompressed outputs:
TIFF 12bit RGB Color (No Compression) = It is Color, with 12bit RGB output (4 bits per component)
TIFF 24bit RGB Color (No Compression) = It is True Color, with 24bit RGB output (8 bits per component)
The remaining TIFF formats all produce black-and-white output with different compression modes:
TIFF b/w, CITTT = G3 Fax Encoding, with no EOLs (End of Line). It uses a CCITT Group 3 compression.
TIFF b/w, Fax Group 3 = G3 Fax Encoding, with EOLs (End of Line). It uses a CCITT Group 3 compression.
TIFF b/w, Fax Group 32-d = 2-D G3 Fax Encoding. It uses a CCITT Group 3 compression.
TIFF b/w, Fax Group 4 = G4 Fax Encoding. It uses a CCITT Group 4 compression.
TIFF b/w, LZW (Compression tag 5) = It uses LZW-compatible compression.
TIFF b/w, PackBits (Compression tag 32773) = It uses PackBits compression.
29
72
Appendix E - pdfDocs O RDesktop
Overview
The following guide provides a quick reference to configuring and using pdfDocs OCRDesktop.
Introduction
pdfDocs provides the ability to convert PDF documents to multiple formats regardless of whether the PDF
document contains only graphics, a mixture of text and graphics, or text only. There is no server software or
server component required for this to occur – all processing of documents is performed with your pdfDocs or
compareDocs software. This means that you can OCR documents when you are working remote to your office
with your laptop or have no server access.
Using pdfDocs, you can convert PDF documents into the following output formats PDF Image with invisible text
layer - a text-searchable PDF; Word; Excel or PDF/A.
What does OCR mean?
OCR (optical character recognition) is the recognition of printed or written text characters by a computer. This
involves photo scanning of the text character-by-character, analysis of the scanned-in image, and then
translation of the character image into character, together with formatting information such as the font style,
underlining and other formatting.
In OCR processing, the scanned-in image or bitmap is analyzed for light and dark areas in order to identify each
alphabetic letter or numeric digit. When a character is recognized, it is converted into the relevant letter or
number. OCR technology will also check the words found against a dictionary of words for the language that the
document has been typed in to improve accuracy and to enable special fonts only used in certain languages to
be recognized and differentiated from other graphics or marks on the page.
OCR produces a high degree of accuracy on typed documents with font sizes ranging from 10 point and
upwards. It produces a lower degree of accuracy for documents with font sizes of 9 point or less, handwritten
documents, or poor quality document images.
Converting PDF document to Word, Excel or Text Format
pdfDocs has two methods of converting PDF documents into text formats such as Microsoft® Word, Excel or
Text format and it is important to understand the benefits of each as this can significantly improve results and
the speed of processing.
C# PDF Image Extract Library: Select, copy, paste PDF images in C# Get image information, such as its location, zonal information, metadata, and so on. Able to edit, add, delete, move, and output PDF document image.
add image to pdf java; add an image to a pdf form
20
73
PDF Documents without Text
If your PDF document contains only graphic images, or is an image PDF that has previously been O R͛d, then
pdfDocs will convert the entire document to a graphic and then interpret what each character is, using Optical
Recognition technology to create the Microsoft® Word document.
PDF Documents containing text
If your PDF document contains mostly text and possibly some images (such as logos, pictures, etc.), pdfDocs OCR
will convert the document to Word format without using OCR technology – it will directly interpret the text and
font information to create your Word document. This will provide the best quality output.
Set pdfDocs to use OCRDesktop
pdfDocs can be configured to use either pdfDocs OCRDesktop or OCR Server as its OCR engine, and you may
switch from one to the other on a per-user basis as required, you may wish to change this setting yourself.
To do this, go to ͚File > Options > O R͛ and choose the radio button for OCRDesktop
OCR Desktop is shipped with a number of system defined Recognition Templates, these cannot be edited but
they can be copied at which point they can be opened; edited and renamed if required. These templates can
either be kept locally for an individual user or deployed.
Server v's Local Templates
OCR Templates can either be Server based or Local. When you first install pdfDocs Desktop, the server-based
templates will be available to each user automatically. For information about how to deploy server based
templates, go to the section xx page.
16
74
The Publishing templates window shows all server and local templates. The columns with checks in them (PDF,
PDF/A, Word, Excel, Text) indicate which output document formats the template may be used for.
Server Templates
Server templates are templates that are stored on your server and are shared by all users. In the Publishing
Templates window, these will appear grayed out and will also say ͚Read-only͛ at the end of their name. They
cannot be modified by a local user. These will either be the original templates created by DocsCorp, or your own
local templates that have been converted to Server templates. Each user has their own local copy of the server
templates (so that mobile users can continue to access them), but on a regular basis these will be updated
automatically by pdfDocs from the location where the pdfDocs installer was originally run on the server or
network location (for example ͞…\config\recognition templates\server͟), with each template being a separate
.XML file.
Local Templates
Local templates are templates created by a user by adding a new template or copying an existing template.
The local template can then be configured to either leave each setting as defined in the system defaults for OCR
Desktop, or override to suit user requirements.
11
75
Creating Local Templates
Copy
To create a local template, click on the Server template closest matching your requirements and click on the
͚ opy͛ button. You may then rename the newly created template by either clicking on the ͚Rename͛ button or
by just double-clicking on the name of the template.
Add
Selecting the Add button is different to a Copy action, in that the template generated through Add, creates a
local template based on the OCR Desktop system defaults. In this case, initially all settings are set as ͚Default͛
(setting is managed by pdfDocs), but can be manually changed by the user in the template to force a particular
action regardless of the system defaults.
18
76
Configuring Local Templates
To configure a Local template, select the template in the list and click on the Open button. Note that if you
open a Server based template, you will be able to see all the settings, but not modify any – all options will be
grayed out.
To find out more about each individual setting, just hold your mouse over the checkbox or setting and a balloon
help window will describe the option. This guide does not repeat all the information contained in these balloon
help messages.
There are four separate tabs containing output settings for OCR Desktop. The Legend at the bottom is
important.
A shaded text box means that this setting is managed by OCR Desktop
and is based on what the setting is for underlying OCR Desktop system
defaults. It is generally recommended that you leave any settings untouched unless you really need to change
them to something of your choice – this choice then overrides whatever was set in the system defaults.
A checked option means that this setting is switched on regardless of the setting in
the system defaults.
An unchecked option means that this setting is switched off regardless of the
setting in the system defaults.
30
77
PDF Options
These options control the creation of
PDF or PDF/A documents relating to
image resolution and compression of
the created PDF document once the
OCR recognition process has
completed (as defined on the
Language and Recognition Tabs).
These are document output settings.
To allow this template to be used
when creating PDF or PDF/A
documents, select the ͚Enable PDF͛ or
͚Enable PDF/A͛ checkboxes. For more
information on exactly what each setting means, hold your mouse pointer over the description field for a
balloon help field to display.
Word/Excel/Text Options
This tab defines output options when
creating Word, Excel or Text
documents once the OCR Recognition
process has completed (as defined on
the Language and Recognition Tabs).
Use the ͚Enable Word͛, ͚Enable Excel͛
and ͚Enable Text͛ checkboxes to allow
this template to support output in
these document formats. For more
information on exactly what each
setting means, hold your mouse
pointer over the description field for a
balloon help field to display.
11
78
Languages
This tab defines what languages the OCR process will look for in the document. These settings are only used if
you use the ͚Perform O R͛ checkbox when publishing to Word, Excel and Text, and are always used when
O R͛ing to PDF.
You should check (select) the languages in this list that you expect to see in your documents. You will get the
best results if you select as few languages as possible. This does not mean that the occasional word in another
language will not be recognized correctly – this will still occur when required. However, the OCR process will
attempt to guess between random imperfections on the page to determine if they are a valid character in your
selected language(s) or really are just marks and unreadable areas on the page and so should be ignored.
Ideally, you should create a template for each language document you are O R͛ing for optimum results.
16
79
Recognition Options
The Recognition options tab controls how the source PDF document is to be interpreted, regardless of the
output format of the document.
Here you can adjust the recognition methods to suit the source of your documents or the types of documents
you are O R͛ing. For example, you may have different settings for O R͛ing Faxes than documents coming from
your scanner. For example, you can switch on ͚Skew͛ settings here to automatically adjust documents that have
been scanned on an angle. For more information on exactly what each setting means, hold your mouse pointer
over the description field for a balloon help field to display.
Recognition Options – Blank Page Handling
The ability to split pages through the detection of blank page separators can be enabled as part of an OCR
Template.
By default, Blank Page Handling is disabled for all current OCR Templates.
To enable, you will need to copy a default template and create a local template.
Once you have created the local template copy, open the template and select the Recognition Options tab, on
which you will be able to locate the Blank Page Handling options.
73
80
By default, these settings are set to Default, the default action being disabled.
Enabling these settings is only applicable for documents that are passed through
a WatchFolder and when the target location is an Organizer Project.
Blank Page
Handling
Values
Behavior
Separate
Default
Disabled
Enable
Enabled if First
Page Blank
Default
The document will not be split. The O R͛d document will
ll
have the same number of pages as the original document.
Disabled
The document will not be split. The O R͛d document will
ll
have the same number of pages as the original document.
Enable
The document will be split for each blank page found.
The blank page will prefix the next output document.
Single output document will have a suffix of ͞_O R.
Multiple output documents will have a suffix of ͞_O R_n͟,
where n is an integer starting with 1.
Enabled if first page blank
If first page of the original document is blank, follow
enabled requirements
If first page of the original document is not blank, follow the
disabled requirements
Behavior
Default
Keep
Discard
This option will only be applicable if:
Blank Page Handling Separate = Enabled
OR
Blank Page Handling Separate = Enabled if first page blank
Default
The document will not be split. The O R͛d document will
ll
have the same number of pages as the original document.
Keep
All blank pages will be kept
The total pages of all the output document(s) will equal the
number of pages of the original document(s)
Discard
All blank pages will be deleted
The total pages of all the output document(s) will be
reduced by the number of blank pages
Documents you may be interested
Documents you may be interested