Final Report PDF
3 Project decisions
After four existing solutions to the problem were investigated, it was decided to change
the approach of the conversion from attempting to reproduce the page layout to
intelligent text extraction. The reasons for this decision, and a plan of the work to be
carried out, are shown below.
3.1 Change of aims and objectives
The original aims of the project were to perform an accurate conversion, maintaining
the page layout, fonts, graphics and other elements as closely as possible. After
investigating the existing solutions to the problem, it was found that this approach had
already been successfully implemented in three different pieces of software.
Although visually accurate, the results with these converters were not very practical.
Most of the advantages of the HTML format were lost with this type of conversion;
text was too small, could not be re-flowed and the output could not easily be converted
into a web-publishable document. It was therefore decided to change the aims of the
project to intelligent text extraction; attempting to detect elements such as paragraphs
and headings and using HTML’s features to represent them in the converted file.
This approach is far more challenging than simply maintaining the page layout as it
involves programming a computer to understand the elements of a page in such a way
that a human would. This fact had not been fully realized at the time of writing the
Progress Report, and it was therefore necessary to further modify the objectives to
simplify the implementation so that it would be completed in the time allocated for the
project. As a result, the implementation looks solely at the text elements of the PDF.
3.2 Implementation decisions
There is a huge variety of documents that are stored in PDF format ranging from
simple layouts such as manuals and research papers to complex layouts such as
newsletters, catalogues, tables and forms. All the existing solutions performed a “one-
step”, layout-independent approach that was performed an accurate conversion
reproducing the original layout, thus providing an acceptable result regardless of the
type of document.
This one step approach is not possible with intelligent text extraction, as the program
itself must understand the particular page layout. Many features in complex layouts