45
37
3.5 XML in Information Processing
As I mentioned earlier, XML is a structure based language. This makes it attractive for further
processing, either manual or automatic.
Manual processing with XML documents is much easier than with unstructured documents,
because a human being can recognize the logical structure and make use of it. One can also
make use of several complementary specifications of XML.
The complementary specifications of XML are also useful for IP purposes. XSLT is a good
example for this because it can be used for each of the following types of tasks [Alexiev,
2004, p.60]:
Transform from XML to a simple textual format (extract data).
Transform from XML to a publishing format for printing. Here we target document that has
little if any logical relation to the input document, so usually we go through the intermediate
step of generating Formatting Objects (XSLT-FO). Then the FOs are transformed to printer-
ready output (PDF, PCL, etc) by a FO Processor (e.g. Apache-FOP).
Transform from XML to an XML format for publication, such as XHTML (and its poor
cousin HTML), SVG, MathML, etc. Here the target document may contain some “garbage”
(presentation stuff), but a lot of it has logical relation to the source.
Transform from one XML schema to another XML schema. Here almost all of the generated
data is logically related to the source.
Most relevant to data integration are the first and last tasks in the list above:
Extraction can pick up data fields from a semi-structured document and use them in further
processing (e.g. to save to database).
Transformation of data-centric XML schemas is key to XML data processing.
The interface between an application and an XML file is the parser. A parser reads an XML
file and gives the application access to the content and structure of this XML file. There are
two common types of XML parser application programming interfaces (APIs): Document
Object Model (DOM) and Simple API for XML (SAX). These parsers have different
properties and are suitable for different purposes.
The DOM parser, for example, is a parser that represents an XML document as a tree,
whereas each element in the document is a node. DOM allows an API to access and modify
parts of the document and to navigate in the document. DOM requires the document’s entire
structure in memory. Thus, it uses much memory and is slow.
The SAX parser, on the other hand, doesn’t build the whole structure of an XML document,
but scans an XML document and fires the events, such as element start or end. The handlers,
implemented by the application programs, receive these events and do appropriate processing.
SAX is fast and good suited for large documents, but it allows no re-processing.
There are many free parsers available in different programming languages which makes the
work of the programmers much easier. [Jaideep & Ramanujan, 2000]
3.6 Limitations of XML
So far we have seen what XML actually is, how XML documents look like, what possibilities
it offers, and its application areas. Now it is time to look at the limitations of XML, because it
cannot be said that XML is the perfect solution for every purpose. It is limited in terms of the
data types it supports: XML is a text-based format and has no means for supporting complex
data types such as multimedia data. Further, XML is limited in terms of security: XML has no