Final Report PDF
Text in a PDF is held as a series of text fragments. These fragments may be written to
the PDF file (and hence extracted by JPedal) in any order. Each text fragment usually
contains one full line of text although changes in formatting and the inclusion of
certain symbols require the line to be separated into separate fragments. Some PDF file
creators place each word or character as a separate fragment.
Each text fragment is held as an XML Element, containing the various attributes
holding information about the text fragment itself. The only attribute used at this stage
was content, which was a string including the text embedded in XML/HTML
formatting information, START and END tags, as shown below:
<~START><FONT face="Minion-Regular" style="font-size:12pt">The quick
brown fox jumps over the lazy dog</FONT><~END>
Fig 4.3: Example of output from JPedal
It is therefore necessary to process the string to separate the text from the formatting
information and the method textOf does this. Other data, such as co-ordinates and
font size information, were accessed directly from the arrays by the PdfGrouping class.
4.3 Text merging principles
The processPageFragments method in the PdfGrouping class performs the text merging
procedure, calling other methods in the PdfGrouping class and interfacing between the
front end and the library.
In the PdfGrouping class text fragment data is held in a number of arrays, each being
the same size as the total number of text fragments. The contents of these arrays is
updated when the copyToFragmentArrays method is called at the beginning of the
processPageFragments method. Each array holds information about one particular
attribute and data about a particular text fragment is held in the same index across all
the arrays. Hence heights, f_start_font_size, contents and text_length
are all attributes of the same text fragment.
The attributes that were used in the grouping and merging algorithms include:
• contents: the actual text embedded in XML/HTML formatting information
• f_x1, f_x2, f_y1, f_y2: co-ordinates of the bounding box of the text fragment
• heights: the height of the text fragment
• f_start_font_size and f_end_font_size: start and end font sizes respectively