6. Go through all lines in the multi-line block.
7. Compare the text elements on one line with the text elements on the next line.
8. If two text elements on continuous lines have almost the same left value, the same
format and the difference of their top values are not that big, these text elements
assumed to belong together. The difference of their top values has to be less than the
average distance of this multi-line block. If two text elements fit these requirements they
are merged. To merge two lines, all the text elements on the second line must be merged
to the text elements on the previous line.
9. // Once the belonging lines are merged, we have to find a title if there is one.
10. Go maximal ten lines backward and look whether this line contains a title.
11. For each of these lines, check whether the text element on this line lies approximately at
the centre of the multi-line blocks side boundaries or the text contains the string “table”.
12. // Whether or not a title is found, the next step is to determine the header part of the
13. The first line with the amount of elements equal to the maximum number of elements in
this multi-line block is the last line of the header. Thus, the next line is the first data row.
14. The lines going backward from the last line of the header are being explored.
15. For each text element on the header lines there are two possibilities: this element is itself
a super-header (i.e. a header at the top of the hierarchy), this element has a super-header
that lies above him. For the latter case the super-header for this element must be
determined. There are four possibilities:
a. The text element lies directly under a text element on the previous line.
b. The left-point of the text element lies under a text element on the previous line.
c. The right-point of the text element lies under a text element on the previous line.
d. The text element lies completely on the left of a text element on the previous line.
e. The text element lies completely on the right of a text element on the previous line.
16. If case d or e occurs the text element is assigned to the element that is nearest to this
17. // After constructing the header part, the data-rows are explored to assign each text
// element to a header element.
18. For each line from the first data row line on, assign the elements on this line to a header
element on the last header line using the rules listed above.
Figure 19: Pseudo-code of the second classification step
5.2.3 Limitations of the Approach
Because of the complexity of the task, the bugs in the pdf2html tool, and the heuristic based
approach which cannot cover all table structure, you cannot assume that this tool returns you a
perfect conversion for each table. You should expect that a post-process in form of changes in
the user interface must be done in almost each case. Therefore, you should always control the
output of the tool and you should use the “with interaction” option of the interface, because to
correct the false points as early as possible is the best solution.
All these steps and rules, explained in 5.2.2 have the aim to construct the table as good as
possible and likely to its original. But this is often not reached because of several reasons:
The tool that I used for getting the text elements out of the PDF file contains a number of bugs
that affect the result of my heuristic based approach. This is a direct result of the nature of this
approach. Namely, all the rules are based on assumptions. The basic and most important
assumption is that the XML code returned from the pdf2html tool contains correctly extracted
data. If that is not the case, the result of my implementation would suffer.