39
PDF
TO
HTML
C
ONVERSION
Final Report
Page 34
5 Performance evaluation
This section evaluates the results given by the conversion software. Annotated
examples of the program’s output from each of the four PDF files are given, together
with an analysis of the results. These results are then compared to the results given by
the existing solutions which are covered in section 2.
5.1 Analysis of converted output
Overall, the conversion software was found to produce good results with simple and
moderately complex documents, including the four PDF files that comprise the
conversion material chosen in section 3.3.
In most cases, simple multi-column layouts were correctly detected and the columns
were output in the correct order. More complex layouts, such as the US Military Force
in Columbia article in section 5.5, where an article spans two columns, gave less
successful results. Occasionally, elements on the page such as headers, footers and
captions, appeared in between the columns instead of being recognized as
miscellaneous and moved to the bottom of the page. One major improvement, which
is given as a suggestion for further work, would be to look at graphical elements of the
page such as lines and rectangles. These elements often indicate to the reader where
different articles start and end.
Most features of the page layout, as described in section 4.5, were detected and
handled correctly. In particular, the line spacing detection algorithm worked very
successfully and did not result in any unwanted new paragraphs. Some of the feature
detection methods, such as the hyphenation detection, corrected some errors but
caused others. This was because, where it met a double barrel word that was wrapped
to the next line (such horse-pistols as described in section 5.3.2) it would still remove
the hyphen and merge the two parts of the word into one. From the information
available to the program it is impossible to tell whether the hyphen should remain in
place or be removed. Only by understanding the text itself can this decision be made.
One solution here would be to use a dictionary of hyphenated words, although this was
beyond the scope of the project.
Similarly, the detection of forced carriage returns depended on the ratio of the line
width to the text width. Therefore, forced carriage returns are not recognized in
longer lines and are merged into a complete line of text. Again, without understanding