Adaptation can be required in the following terms: [Ciravegna, 2001]
1. Adapting an IES to a new domain
It is clear that IESs can only become a common-use technology if they provide the
extraction of text from different domains. No one would spend his money to get a system
that works only for a single domain, thus in many cases for a certain time, because the
interest in a domain can disappear. Extracting information about a new domain requires
new rules, and so on. This problem affects systems with different approaches in different
ways. Adapting rule based systems, for example, is often harder than adapting an active
2. Adapting an IES to a new languages
Most of the existing IE systems are designed for textual data in a single language, in
general English (in special cases in Chinese or Japanese). The task to make an IE system
able to handle textual data in other languages is in many cases a very difficult one. Some
of the Asian languages, like Chinese, are good examples to illustrate this difficulty. In
Chinese, words are not delimited by a white-space. That makes an additional step in the
process of IE necessary, namely the word segmentation step. Word segmentation is in
many cases not a trivial task, because it requires an original lexicon. It is clear that it is not
easy and feasible to build an original lexicon for a language each time. Additionally the
grammar has to be changed, too. Some works on automating these steps are ongoing.
3. Adapting an IES to different text genres
It is a common practice that IESs are trained on a corpus of documents with a specific
genre. But a portable IES has also to be able to handle documents with different genres,
because specific text genres (e.g. medical abstracts, scientific papers, police reports) may
have their own lexis, grammar, discourse structure, etc.
4. Adapting an IES to different types of data
One can broadly say that an IE system has to extract relevant information from text. The
term of “text”, in fact, is not restricted and can have any behaviour. Thus, text can be in
form of emails, newswire stories, military reports, scientific texts, and so on, which have
very different formats. An email, for example, does not have a pre-defined or predictable
format. It is only a free natural language text. Newswire stories, in contrast to emails, have
a specific format. They have a title and mostly abstract the principle topic in the first
Additionally, there is the fact that the widespread use of the internet makes texts in
different structural forms available, such as (semi-)structured HTML or XML files. To
adapt an IES, which was initially developed for a specific type of text, is a non-trivial task
even if the various types of text are about the same domain.
The problem of scalability of an IES has two relevant dimensions. First, an IE system must be
able to process large document collections. This dimension causes often no problems because
IESs use in general simple shallow extraction rules, rather than sophisticated slow techniques.
Second, an IE system must be able to handle different data sources. For example, weather
information from different forecast services can have different formats; therefore, the
extraction rules of the system must contain customized rules for such different sources. An IE
system that is able to master both dimensions would use, with a high probability, the active
learning approach. [Kushmerick & Thomas, 2002]
1.4 Information Integration
In a time in which we are overwhelmed with information from various data sources (e.g.
databases, documents, e-mails, etc.) in very different formats, it is a big deal to make use of
all this data in an efficient way. Thus research in the field of Information Integration (II)
becomes more important today. Information Integration: “is the process of extracting and
merging data from multiple heterogeneous sources to be loaded into an integrated information
resource.” [Angeles & MacKinnon, 2004]
To point at the differences of II today compared to the past, I quote the following [Halevy &
Li, 2003, p.3]:
First, we noted that the emergence of the WWW and related technologies completely changed the
landscape: the WWW provides access to many valuable structured data sources at a scale not seen
before, and the standards underlying web services greatly facilitate sharing of data among
corporations. Instead of becoming an option, data sharing has become a necessity. Second,
business practices are changing to rely on information integration – in order to stay competitive,
corporations must employ tools for business intelligence and those, in turn, must glean data from
multiple sources. Third, recent events have underscored the need for data sharing among
government agencies, and life sciences have reached the point where data sharing is crucial in
order to make sustained progress. Fourth, personal information management (PIM) is starting to
receive significant attention from both the research community and the commercial world. A
significant key to effective PIM is the ability to integrate data from multiple sources.
We can distinguish between several kinds of II [Brujn, 2003][Breu & Ding, 2004]:
1. Technical Information Integration: This kind of integration can be split into two
levels: the hardware (platform) level and the software (platform) level. The hardware
level encompasses differences in the computer hardware, the network architecture, the
used protocols, etc. The software level encompasses differences in the used operating
system, the database platform, etc.
2. Structural Information Integration: The structure of the data may be based on
different principles, as for example relational database tables, hierarchical trees, etc.
3. Syntactical Information Integration: This encompasses differences of the data
formats, as for example databases, plain text, etc. The different naming of the same
entity in different databases can be also an example for this kind of integration
problem (personal_id in one database and p_id in another database to name the same
entity, namely the identification number of a person).
4. Semantic Information Integration: This kind of integration is the most difficult one.
It encompasses different intended meanings of similar concepts in a schema. It
becomes a standard to give concepts self-describing names. But the meanings that
different users understand by looking only at the names are often not unique. It is
possible, that two concepts with the same name are assigned really to different
meanings (homonyms) or, that the same concepts are named differently in two
We can distinguish between two fundamental aspects of II: Data Integration, and Function
Integration. The definition I used above for II can also be used for Data Integration, because
Data Integration deals with the problem of making heterogeneous data sources accessible by
using a common interface and an integrated schema. A “common interface” should pretend
the user that the collection of data is from a single data source. Function Integration tries to
make local functions from disparate applications available in a uniform manner. Such an
integration solution has to pretend the user that the collection of functions is homogeneous.
[Leymann & Roller, 2002]
An important derivative of Function Integration is used in enterprises and is named Enterprise
Application Integration (EAI). In the centre of an EAI system is an integration broker, which
acts as a hub between connected applications and routes the messages between them. Because
of the behaviour of EAI systems, their capabilities for data integration are limited. EAI
solutions, often, only provide access to one source at a time. But because of the fact that
business transactions become more complex and often require information distributed across
multiple data sources, data integration platforms has to be developed to complement EAI
A system that provides its users a uniform interface to a multitude of heterogeneous,
independently developed data sources is called an Information Integration System (IIS). A
user of such a system has not to locate the data sources, to interact with each one in isolation
and to manually combine data from multiple sources. [Halevy & Li, 2004]
There are two paradigms for II: ad-hoc integration and global integration.
For better understanding, I will give you first an example for an ad-hoc integration process
from everyday life. Assume that you have developed an optimization algorithm for a graph
drawing problem and have implemented it. In the next step, you will test your implementation
with simple test graphs to get an idea of whether your implementation works or not.
Therefore, you have created some test graphs in a format that your implementation can
handle. After the first testing, you will test your implementation thoroughly with real world
test graphs of higher complexity. But, the test graphs you can find are all in a different format
than your program is prepared for. At this point you have two options: you can either rewrite
your implementation to be also able to handle this input format, or you can write a script to
transform the unsuitable test graphs into a suitable format. Both options are equally hard, but
if you are only a user of an implementation and have no access to the source code or you have
not the knowledge to make the required changes, you have just the latter option. For each
graph format that the implementation cannot handle you must write a new script. This kind of
integration is known as ad-hoc information integration.
Ad-hoc information integration solutions are not scalable and portable, because they are
established for a single case with certain requirements and they are not applicable in different
cases with different requirements. It is even harder to maintain such a solution, because if the
requirements change the solution has also to be changed. This might not be requiring much
effort for a personal user with simple demands but the situation is quite different in a business
environment. First, the requirements for an II solution are complex and underlie continuous
changes. Second, the amount of different applications that must operate together is large.
With every new application that has to be integrated, new integration solutions must be
Global integration tries to overcome the disadvantages of the ad-hoc information integration
and works quite different. An IIS, designed with this approach consists of a global schema,
source schemas, and a mapping between the global schema and the source schemas that acts
like a mediator.
We can distinguish between four kinds of mapping approaches [Lenzerini, 2002]:
1. Local-as-View (LAV) or source centric: The sources are defined in terms of the
2. Global-as-View (GAV) or global-schema-centric: The global schema is defined in
terms of the sources.
3. GLAV: A mixed approach.
4. Point-to-Point (P2P): The sources are mapped with one another without a global
To choose one of these approaches we have to consider the advantages and disadvantages of
each one. In the following, I will give an overview of the pro and contras of the LAV
approach, the GAV approach, and the P2P approach.
In the LAV approach, the quality depends on how well the sources are characterized. This
approach promises high modularity and extensibility, because if one source is changed, only
the definition of the source has to be changed. [Lenzerini, 2002] This approach is best suited
when many and relatively unknown data sources exist and there is a possibility for adding or
deleting data sources. [Borgida, 2003]
In the GAV approach quality depends on how well the sources are compiled into the global
schema through the mapping. Because the global schema is defined in terms of the sources,
the global schema has to be reconsidered every time a source changes. [Lenzerini, 2002] This
approach is best suited when few and stable data sources exist. [Borgida, 2003]
P2P integration addresses only an isolated integration need and is therefore not suitable for
reuse. For each pair of sources, a different mapping must be developed, which requires many
hours of effort. Even then the mapping may fail to deliver the full range of the desired results.
Each time a source is changed or a new source has to be integrated new mappings must be
developed which means additional effort. With every new source, the number of links that
must be added, tested, deployed, and maintained grows geometrically. Integrating 2 data
sources require 1 link, 3 sources require 3 links, 4 sources require 6 links, and 5 sources
require 10 links, and so on. [Nimble]
1.4.1 Application Areas of II
Among other information technology (IT) areas the field of database systems was one of the
first fields that show interest in II research. The reason may be that database systems become
the most commonly used structured data storage systems. The increase of the amount of such
systems takes with it the interest in integrating databases from different systems.
II has become a recent issue also in business areas. I have quoted the following list to give an
overview of IT and business areas that often create requirements for II [Alexiev, 2004, p.5, 6]:
The following IT and business areas often create requirements for data integration:
• Legacy application conversion and migration: convert data to a new application before retiring
the old application.
• Enterprise Application Integration and Application-to-application (A2A) integration, where
applications existing within the enterprise should be made to inter-operate.
• Business/executive reporting, OLAP, multidimensional analysis: regular loading of
operational data to a data warehouse for easier analysis and summarization.
• Business-to-business (B2B) integration between business partners.
• Business Process Integration: coordination (“orchestration”) of separate business processes
within and across enterprises in order to obtain a more synergistic and optimized overall
process. Includes needs for Business Process Modelling, enacting, workflow, data modelling
According to [Jhingran et al., 2002], II has three dimensions that make the task of managing
the data more complex: heterogeneity, federation, and intelligence. I will shortly describe
these three dimensions.
1. The heterogeneity of data: Currently, IISs have to deal with structured (e.g.
databases, etc.), unstructured (e.g. text, audio, video, etc.), and semi-structured content
(e.g. XML documents, etc.).
2. The federation of data: Today, data sources needed to be integrated are mostly
distributed over multiple machines in different organizations. The federation problem
encompasses the question of who owns and controls the data and the access on the
data. Privacy and security issues become also important, because every organization
has different security and privacy policies.
3. Intelligence: Another important issue is that of analyzing the data to turn the data into
information, and more precisely into intelligence (e.g. detecting trends in a business,
Because of the mentioned three dimensions many challenges arise. Some of them are long
term-challenges and others are challenges that currently occupy the attention of II researchers.
The long-term goal of II and the capabilities of systems reached this goal are explained in
[Halevy & Li, 2004, p.3, 4]:
The long-term goal of information integration research is to build systems that are able to provide
seamless access to a multitude of independently developed, heterogeneous data sources. These
systems should have the following capabilities:
• integrate sources at scale (hundreds of thousands of sources),
• support automated discovery of new data sources,
• protect data privacy,
• incorporate structured, semi-structured, text, multimedia, and data streams, and possibly
• provide flexible querying and exploration of the sources and the data,
• adapt in the presence of unreliable sources, and
• support secure data access.
In [Halevy & Li, 2004] there is also a list of specific challenges. I will explain these
challenges here, to give you an idea about what work has to be done in the future. Hence,
architectures are needed that enable large-scale sharing of data without no central control.
Reconciling heterogeneous schemas/ontologies: The fundamental problem of II is that the
sources are heterogeneous, which means that the sources have different schemas and underlie
different structuring methodologies. To integrate such heterogeneous sources, a semantic
mapping is needed (often referred to as schema matching or ontology alignment). Today,
these mappings are generated by humans. Because this task is time consuming and error
prone, tools have to be developed to aid the human designer by its work.
Data-sharing with no central control: In many cases data cannot be shared freely between
connected parts. Central control is therefore not possible in many environments. For such
cases architectures are needed that enable large-scale sharing of data with no central control.
On-the-fly-integration: Currently, IIS’s cannot be easily scaled up with new data sources.
Thus, a challenge is to reduce the time and skill needed to integrate new data sources. This
would make it possible to integrate any data source immediately after discovering it.
Source discovery and deep-web integration: Over the past few years the information
(mostly stored in databases) behind the websites, which are queried on the fly when a user
makes a request, deepened the web dramatically. To integrate these information sources is a
very big deal and have potential. To discover these sources automatically, to integrate them
appropriately, to support efficient query processing of user queries, etc. are some of the
challenges that arise in that context.
Management of changes in data integration: IISs need to be able to handle updates of the
Combining structured and unstructured data: Currently, IISs are often not able to handle
structured data sources (e.g. databases, XML documents, etc.) and unstructured text (e.g. web
pages, etc.). This problem arises because the querying methods for both kinds of data sources
are quite different. Whereas, structured data is queried by predefined query languages (e.g.
SQL, XQL, etc.) unstructured text is queried by keyword searching. Thus, languages that are
appropriate for such queries and efficient methods for processing them are needed.
Managing inconsistency and uncertainty: In an IIS, it often occurs that the data sources are
inconsistent or uncertain. Methods must be developed to locate inconsistencies or
uncertainties and to reconciling them.
The use of domain knowledge: The usability of IISs can be increased by using domain
knowledge. Such knowledge can be used to guide the user by his work.
Interface integration and lineage: Often the data sources that are needed to be integrated in
the system have its own user interfaces that give users easy access to the data. Hence,
integrating such sources requires also combining the different visualisations.
Security and privacy: Data sources have often different security and privacy policies. By
integrating multiple data sources this differences in the policies must be considered to ensure
security and privacy.
In order to understand the usefulness of so-called wrappers better, the notion of “semi-
structured documents” must be explain first.
As the term “semi-structured” implies, this kind of structure is a form between free text and
fully structured text. The best-known examples for semi-structured documents are HTML
documents. HTML is a mark-up language that contains text and predefined tags to bring more
structure into the document. The text in such documents is often grammatically incorrect, and
does not always contain full sentences. This makes it difficult to apply standard IE techniques
on such documents, because IESs use linguistic extraction patterns. Thus, to extract
information from such documents specialised IE systems are needed. Programs with the aim
to locate relevant information in semi-structured data and to put it into a self-described
representation for further processing are referred to as wrappers. It seems, as if IESs and
wrappers do just the same, but the application areas are different.
The most widespread application area of wrappers is the World Wide Web with its unlimited
amount of web sites that are mostly semi-structured. The differences between the structure of
each document in the web and the fact that even sites of, for example, the same company are
changed periodically, makes it obvious that building such programs by hand is not a feasible
task. This lead to two main problems in this field: wrapper generation and wrapper
maintenance. [Chidlovskii, 2001]
1.4.1 Wrapper Generation
Manual wrapper generation means that the wrappers are written by hand using some sample
pages. This procedure is time-consuming, error prone and labour-intensive. It is not scalable
that means, even a little change in the page structure demands a re-writing of the extraction
rules. The automatic or semi-automatic generation approach is a result of this search for ways
to overcome these limitations.
The approaches of automatic or semi-automatic wrapper generation use some machine
learning techniques based on inductive learning. These can base on heuristics or domain-
knowledge. Although, the heuristic approach is relatively simple, it can only extract a limited
number of features. The knowledge-based approach, on the other hand, tries to make use of
the domain knowledge and so to build powerful wrappers. [Yang et al, 2001]
1.4.2 Wrapper Maintenance
Wrapper maintenance is the challenge of keeping a wrapper valid. Because, a wrapper cannot
control the sources from which it receive data. A little change in the structure of a web site
can make the wrapper useless (non-valid). The fact that some web sites change its structure
periodically makes the task only harder. There are two key challenges to wrapper
maintenance: wrapper verification (i.e., determining whether the wrapper is still operating
correctly), and wrapper re-induction. The second challenge is more difficult, because it
requires the change of the rules used by the wrapper. Even the wrapper verification task is not
a trivial one, because the sources may have changed either the content or the formatting, or
both, and the verification algorithm must distinguish this two. [Kushmerick & Thomas, 2002]
Information Processing with PDF Files
In this chapter, I will give a short description of the Portable Document Format (PDF), the
native file format of the Adobe™ Acrobat™ family. My aim is just to introduce this file
format and not to give a full description of it. This chapter is based highly on the “PDF
Reference Manual” written by Bienz and Cohn . For more detailed information I refer
to this manual.
2.1 The Portable Document Format (PDF)
Bienz and Cohn defined PDF as follows: “PDF is a file format used to represent a document
in a manner independent of the application software, hardware, and operating system used to
create it.” [Bienz & Cohn, 1996, p.5]
Further, Merz describe the PDF file format as: “PDF is a file format for saving documents
which are graphically and typographically complex. PDF ensures that layout will be preserved
both on screen and in print. All layout-related properties are fixed component of a PDF file
and do not allow any variation in its interpretation – the appearance of a PDF document is
completely fixed.” [Merz, 1998, p.4]
The PDF file format is the successor of the PostScript page description language which was
also initiated by Adobe™, and came out in the early 1990’s. The PostScript page description
language is a programming language like BASIC or C++, but is designed only to describe
extremely accurately what a page have to look like. During the development of PDF, Adobe
tries to avoid the weaknesses of the PostScript format. For example, PDF uses the same
imaging model as PostScript but does not use the programming constructs to keep the
simplicity of the format.
PDF is a layout-oriented representation format. Layout-oriented means, that the human
readability is in foreground, rather than the machine readability. As you will see later on,
layout-oriented representation formats are not that suitable for further processing. It could be
hard to extract even text from PDF files, because the text can be, in the worst case, saved as a
graphic, which would require the use of more complex algorithms, such as text recognition
algorithms. [Gartner, 2003] I will give you a more detailed overview of the challenges of
extracting information from PDF files later in 2.4.
2.2 Main Architecture of a PDF File
PDF files consist of four sections: a one-line header, a body, a cross-reference table, and a
1. Header: The header line is the first line of each PDF file. This line specifies the
version number of the PDF specification to which the file adheres.
2. Body: The body of a PDF file consists of a sequence of indirect objects representing
the document. In the body, there can also be comments. Comment can appear
anywhere in the body section.
3. Cross-reference table: This table contains information about where the objects in the
PDF file can be found. Every PDF file has only one cross-reference table, which
consists of one or more sections.
4. Trailer: The trailer enables an application reading a PDF file to find the cross-
reference table and thus the objects in the file.
No line in a PDF file can be longer than 255 characters long. The last line of a PDF file
contains the string “%%EOF” to indicates the end of the file.
There are three kinds of PDF files [Weisz]:
1. PFD Normal: If you produce your text in a word processing or publishing system,
with a PDF output capability, the PDF file you get is a PDF Normal file. That means
that your file contains all the text of the page. Such files are relatively small.
2. PDF Image Only: This type is easy to produce. It is only an image of the page and
contains no searchable text. Such files are fairly large and there is no possibility to
search for text. The image quality depends on the quality of the source document and
on the scanning operation.
3. PDF Searchable Image: Such files contain the image of the page and the text
portions of the image. This enables a user to search for text. These files are usually
larger than PDF Normal files.
2.3 Strengths and Weaknesses of PDF
If we surf the web we can find PDF files in heaps. Once technical details of an amazing five
mega pixel digital camera, once a statistic about the last two years incomes of an enterprise,
and once a brilliant crime novel of Sir Arthur Conan Doyle is saved in a PDF file. The
widespread use of this file format must have its reasons. Thus, I will try to explain you the
strengths of this file format shortly.
I mentioned that the PDF file format was created for a special purpose – to make it possible to
build a document in an environment and view it in a, maybe, completely different
environment without any difference. This specificity implies that PDF is not the file format
par excellence for all purposes. Thus, I will try to explain the weaknesses of this file format,
The PDF format has a lot of properties, which makes it a demanded format to use in mostly
all hardware environments and operating systems.
• Portability: This is the most exciting property of PDF. PDF files use only the
printable subset of the ASCII character set which leads to the fact that PDF files are
extremely portable across diverse hardware and operating systems environments.
• Compression: PDF supports a number of compression methods to keep the file size
within limits, even if the document contains images or other storage impressive data.
• Font independence: The fact that people like to use different fonts in their
documents, sometimes very extraordinary ones, is a challenge in document exchange.
It could happen that the receiver of a document has not the fonts to re-create the
document in her environment. PDF brings a solution to this problem, by saving a font
descriptor for each font used in the file. Such a descriptor includes the font name,
character metrics, and style information. This additional information does not affect
the file size, because one font descriptor takes only 1-2K of storage and contains all
the information needed to simulate missing fonts on the receiver side. This does not
apply for so called symbolic fonts (i.e., a font that does not use the standard ISOLatin1
These three properties are more important in the reader’s point of view. There are several
other properties which are important for PDF developers (i.e., for people who want to build or
change PDF files):
• Random access: Every PDF file has one cross-reference table which contains
information about the locations of the objects in the file. This permits random access
to each object in the file, thus the whole file needs not be read to locate any particular
object. Because of the fact that the cross-reference table has a fixed location, namely
the end of the file, a developer can easily get the information she wants about the
location of a specific page or another object and must not go through the whole
document to find what she wants.
• Incremental update: This property allows a user to easily change a file. Every time a
change is made, the cross-reference table will be updated and the changed objects will
be added to the file. The original data will remain unchanged. This allows a user to
simply undo changes by deleting one or more added objects and cross-reference
All of the mentioned properties form the strengths of the PDF format. On the other side there
are some weaknesses, too. Merz explains the main disadvantage of PDF as follows [Merz,
PDF’s biggest advantage is also its biggest drawback: it is purely layout-oriented and is not
concerned with the text and overall structure of a document. The basic unit of a PDF document is
the page. It can contain text, graphic and hypertext elements. The text is not saved as structural
objects like paragraphs or headings, but, as in PostScript, in layout-related elements: as characters,
words, or lines. This makes it much more difficult to modify or edit the text later on, as the
structural information can not easily be recreated.
This disadvantage implies that it is hard to use a PDF file for further processing. It is
understandable that many people are interested in further processing of PDF files, because
today a lot of information is saved in PDF files and must be extracted in some way or another.
Several tools are available, both commercial and free, to extract information from PDF files
and save it into a more structured file format such HMTL or XML. I will present some of
these conversion tools in CHAPTER 4.
Documents you may be interested
Documents you may be interested