38
which is at the lower left of the PostScript page. The text within the parentheses inthis case is the
stringtext,and the integer after thisstring refers to the width of the string, which is measuredin
units of 1/720". It can be seenthat there is also a row that consists of a single character "P". This
row signifiesa page break.
The tool canalsobe operatedin theCOMPLEX modeof operation. Inthis case,it also adds rows
for colour changes and indicates the presence of an image.
If either of the twomodes describedaboveareused, it is necessary to write aprogram to interpret
this data, andrebuildtheinformation into meaningful text. There is however athird mode of
operationtothetool, whichistheSIMPLE mode. This mode simplyextracts the text from the
PostScript, and outputs it ina plaintext form. It is this form that I shall now discuss.
The ps2ascii.ps tool is invoked in SIMPLE mode as follows:
«
gs -q -dNODISPLAY -dNOBIND -dWRITESYSTEMDICT -dSIMPL-c save -f ps2ascii.ps
[filename] -c quit
»
This will cause the plain text output to bestreamedto STDOUT. The tool can convert both
PostScript and PDF documents to plain text format.
Unfortunately, whiletesting the tool, I found it to befairly slow. For example, a PDF document of
approximately 182 KB, took 1minuteand 40seconds. This is quitea slow conversionfor the file,
certainlyslower thancertain other converters were for the same task. Althoughsomeof this lackof
speed canbe attributed to the poor performance of themachine onwhichI have conducted the
tests,this tool still operated moreslowly than some of the other tools tested.
The plain text output of the ps2ascii.ps tool is of afairlyhighquality. Lines are rather long. In
fact, the lines produced bythis tool are considerablylonger thanthoseproducedby anyother tool
that I have tested. This seems to bedue to the attempts made bythe tool to guess when alinehas
actually been brokenby anintentional new line, as opposedto when it has simply been wrapped
due to long length. This linebreak guessingcan beboth a positive factor anda negative factor. On
the positive side, it is better if the lines do not have "\n" characters in them when it is reallyjust a
case of the linebeing wrappeddue to PostScript page size, andthis tool helps to prevent this. On
the negative side however, the attempts made bythe tool to guess where lines are trulysupposed to
be brokencan beerroneous. It canbreak lines incorrectly,and in particular, it seems to have
difficulties separatinga section title from the line that follows it.
Take for example,the following title followed by someexample text:
6Architecture Recap
Up to this point, we have discussed the API: its methods and outputs.However, recall that the
application shownin Figure 3 requires solvingtwo verydifficult problems: analysis,tofindthe
li......
Figure5. Sample text –a section titlefollowedby the section itself.
How to C#: Create a Winforms Control VB.NET How-to, VB.NET PDF, VB.NET Word, VB.NET Excel, VB.NET PowerPoint, VB.NET Tiff Add a new Form Item to the project, and choose to design mode sign.
make pdf form editable in reader; how to save a pdf form in reader
34
It can clearlybeseen from the above text that thetitle line reads "6 Architecture Recap", and is
locatedon its ownline,with a blanklineseparatingit from the passage of text that follows it.
However, whenthe ps2ascii.ps tool translated it to plain text,thefollowingwasproduced:
6Architecture Recap Up tothis point,wehave discussed the API: its methods
and outputs. However, recall that the applicationshown in Figure3 requires
solvingtwo verydifficult problems: analysis, to findthe li......
Figure6. The plaintext producedby ps2ascii forthe text shown in figure 5.
Intheaboveexample, it can clearlybe seen that the line separating the title andthebodyof text has
been completelylost,and the title line is now part of the body of text. The tool seems rather
unpredictable inits behaviour of splitting titles from the text that follows them. Sometimes, it has
no trouble in recognising that the title is not part of the body of text, andseparates the twoentities
in the samewaythat theyare separated in the PostScript/PDF. Other times however, it is unable to
distinguish between thetwo.
Unfortunately this problem posesa big argument against using this tool inthe extractionof
references. This is becausethe reference extraction tool must search for a title tothe reference line
(a title generallybeinga short lineof text on its own line, e.g. "The References Section"). If the
title is placed within thebodyof the references section itself, it will looktothe parser as though it
is simplypart of the text, and thereforewon't pick up on thefact that it is the title to the references
section.
The ‘ps2ascii.ps’Tool: Testing Conclusion
The ps2ascii.ps program has some goodpoints as well as badpoints. On its good side,it provides
longlines of text, mostly onlyinserting new line characters when they areencountered within the
PostScript/PDF. This is a verydesirablefeature, asit eliminates all need to rebuild lines at a later
stage. Thetool is also able to convert both PostScript andPDF documents. The readablequality of
the text produced byps2ascii.ps is also very good. If the user simply wanted to create text versions
of PostScript documents, thenthis tool would be agood choice.
On the down side of the tool however, it is fairly slow, andit has problems differentiatingbetween
titles andbodies of text. Unfortunately, this is abigenough disadvantage to makethetool
unsuitable for the process of reference extraction, as it is necessaryto be able to identify the
referencesection by title, andbe sure that we are not simplyidentifying the word "reference"
withinthetext.
Another disadvantage of this tool isthat it is unable to reliablyconvert PostScript files createdfrom
Microsoft products such as Word.
Verdict:Good,butunsuitableforthereferenceextractionprocess.
49
The ‘pdftohtml’Tool
Source:<http://www.ra.informatik.uni-stuttgart.de/~gosho/pdftohtml/>
The pdftohtml tool is basedon DerekB. Noonburg's XPDF package. According to the authors,the
goal of the pdftohtml project is to create afreely available program that enables a PDF file to be
converted into an HTML file.
Fuller documentationdetails are available from the above URL, but in anutshell, the authors claim
that the tool attempts to preserve all links withinthePDF; extract thetext from the PDF document
and place it within an HTML file; attempt torecognise bold and italic areas of text; and displaythe
content of the pages inHTML from first page to last.
The above-mentioned Web site for the pdftohtml tool is quiteuseful,as it provides much
information about thetool,includinga list of problems known with the tool.
Unfortunately, the pdftohtml tool is not capable of transforming a PostScript fileinto HTML. It
onlyworks with PDF files.
There are two main distributions of pdftohtml. These are pdftohtml version 0.22, which is the
current stable version of thetool and pdftohtml version 0.31, which is the latest, test release. I have
triedboth, and foundthem tobe fairlysimilarintheir results. Infact, I found the results of
pdftohtml version 0.22 to be moresuitable for my purposes, so Ihave decided to discuss this
particular version in thisreport. It is sufficient to saythat thereis not agreat enoughdifference
betweenthetwotools in order to justifya separate discussion of them.
Installing pdftohtml
The pdftohtml tool is downloadedas a zipped".tar" file. After unzipping andextractingthetar
archive, the binaries of the pdftohtml tool can be createdby movingtothedirectorycreatedby
extractingthe ".tar" file, and invoking the makecommand. Ihad noproblems with this installation.
Using pdftohtml
The pdftohtml tool has the followingusage information:
pdftohtml version 0.22
Usage: pdftohtml [options] <PDF-file> [<html-file>]
-f <int>
: first page to convert
-l <int>
: last page to convert
-q
: don't print any messages or errors
-h
: print usage information
-help
: print usage information
-p
: exchange .pdf links by .html
-c
: generate complex document
-i
: ignore images
-noframes
: generate no frames
-stdout
: use standard output
-ext <string>
: set extension for images (in the Html-file) (default png)
40
It canbe seen that there are manyoptions for the tool, such astellingit to convert certain pages
within arange etc. I onlymade use of certain options when testingthetool. I made use of the"-
stdout" option, which allowed metohave the output of the tool streamed directly to the STDOUT
stream. I also made use of the "-noframes" option,whichallowedme toprevent the tool from
creating HTML in which frames would be used. This is because for mypurposes, I simplywanted
all of the text toappear as one page of HTML, andframes would not allow this. I also used the "-i"
flag, which allowed the tool to ignoreall images within the document. This was becausefor my
purposes, images were not desired. The"-q" flagwas alsoused, which enabled the suppressionof
error messages.
The pdftohtml tool consists of 2main tools. The first is"pdftohtml",whichis ashell script that
executes "pdftohtml.bin" andthen converts all PBM/PPMimagesinto PNG images format using
pnmtopng converter. Theother tool is the"pdftohtml.bin" tool, whichis theactual tool that
performs the conversion itself. Since I was not interestedinhaving anyimages converted,I
decided to perform my tests bydirectlycalling the pdftohtml tool.
The tool was invoked as follows:
«
pdftohtml.bin -i -noframes -stdout [filename]
»
When usingthetool, Ifound it tobe fairlyquick. One advantageof usingPDF files is that they are
frequently smaller insize thantheir PostScript counterparts. The tool took approximately 40
seconds to convert and stream to STDOUT afile of approximately 182 KB in size. This is fairly
fast comparedwith some of thePostScript converters, which took well over 1 minute to convert the
same file from the PostScript format.
On the whole,thequality of theHTML source produced bythe pdftohtml tool was quite bad. It
seemedtohave one word per line andsometimes a tagsuch as a"<BR>" tagalso. This is a shame
-it wouldhavebeen nicer for parsing and for readability if the lines hadbeen broken onlywhere
there were HTML "<BR>" tagsor "<P>" tags, or for that matter, where there were anyother line
breaktags. This problem is not too difficult toneaten up with a simple script, which removes all of
the unwanted new line characters. However, this is yet another step, causingmore time overheads.
Asample of someHTML output for the program is shown below:
Fe
Con-<br>
vention
of
the
Open
Archives
Initiative.
D-Lib
Magazine:
25
The<br>
Magazine
of
Digital
Library
Research,
6(2),
February
2000.<br>
19<br>
Page-19
</body>
</html>
Figure7. Sample HTML output fromthe pdftohtml tool.
Notice that there is more or less one word per line. This is the philosophyof theprogram - the user
can clean upthe HTML source as theysee fit.
Of course,if the HTML is simplytobe displayed in a browser,then this source does not matter, as
the browserdisplays it nicelywithout all of these breaks in the lines. The image in ‘figure 8’,
below, shows the wayin which the article whosesource was shown above appears within a
browser.
Figure 8. Browserrepresentationofthe badquality HTML source produced by pdftohtml –it looks
good.
As I have discovered withsome of the other converters, when animage that contains text is
encountered,theimageis discarded, but the text isshown in the output createdby pdftohtml. This
is a shame,but I believe that it must be unavoidable, as many of the tools dothis.
Documents you may be interested
Documents you may be interested