44
CHAPTER 2. DOCUMENT REPOSITORY
40
0
documents, where detailed guidelines and strict attention to details are necessary.
Second, it is very difficult to ensure consistent categorization in the manual
categorization efforts. Even when a single individual is responsible for categorizing all
documents, that person may put the same document in different categories at different
points in time. When the categorization is a team project, the problem is multiplied since
different people may put the same document in different categories even though they use
the same set of elaborate guidelines. Despite these drawbacks, it is possible to build
high-quality categorization hierarchies manually. Some great successes in this area
include the MeSH
21
, the Yahoo directory
22
, and the open directory project.
23
In the automatic categorization approach, a categorization hierarchy is automatically
derived from the document set, and documents are automatically added to this
categorization hierarchy. There are many methods for accomplishing this goal, and it
continues to be an active area of research [27, 50]. In some cases, the categorization
hierarchy is automatically extracted from an uncategorized set of sample documents.
This methodology has a drawback illustrated by the Bailey quotation in Section 2.3.1;
that is, there are many possible dimensions along which to organize a categorization
hierarchy, and not all of them will be useful. Other methods for automatically creating a
categorization hierarchy involve the use of training sets that have already been properly
categorized. The system can then use the training set to automatically generate rules for
populating a categorization hierarchy with documents. The use of training sets is
generally quite effective when the training sets are reasonably large, but any manual
21
MeSH, the Medical Subject Headings controlled vocabulary, is used for indexing articles, for cataloging
books, and for searching MeSH-indexed databases. The MeSH vocabulary facilitates the retrieval of
information that may span different terminologies. MeSH is managed by the National Library of
Medicine, and is available on the Internet at the web address http://www.nlm.nih.gov/mesh.
22
The Yahoo directory is a directory of websites developed by Yahoo! Inc. The web address for this
directory is http://dir.yahoo.com/. A staff of editors at Yahoo categorizes web pages into the manually
developed classification hierarchy, which was one of the first to popularize this approach to organizing
the World Wide Web.
23
The open directory project is a directory of websites maintained by a community of volunteer editors.
The web address for this directory is http://dmoz.org/. Editors volunteer to maintain a small portion of the
complete classification hierarchy. The open directory project forms the core web directory for a number
of search engines, such as Netscape Search, AOL Search, Google, and Lycos.
31
CHAPTER 2. DOCUMENT REPOSITORY
41
1
categorization errors in developing training sets may be magnified by the automatic
classification system in the end results. There are two main drawbacks to a fully
automated classification system. First, the logical relationships within the categorization
hierarchy may not be explicitly clear. Since the primary reason for using categorization
structures is that they tend to be intuitively clear and easy for people to work with;
constructing categorization hierarchies that are not intuitive to work with reduces their
value. Second, since the logical transparency of the structures may be low, it may be
difficult to audit the classification structures for quality. Thus, the quality of the
categorization structure could be reduced. Despite these drawbacks, fully automated
categorization systems may be quite useful when it is necessary to quickly and
inexpensively categorize large document sets, particularly when quality is not a primary
concern.
Partially automated classification systems seek to blend the advantages of both manual
and automatic categorization. There are many possible combinations of manual and
automatic categorization, so the discussion here will focus on the most common method.
In this approach, the categorization structure is designed manually, perhaps with the
assistance of software tools, and the categorization hierarchy is automatically populated
with documents. Constructing the categorization hierarchy manually allows the use of
human judgment to develop useful logical relationships within the hierarchy.
Automatically populating the hierarchy with documents according to pre-specified
categorization rules reduces the two main drawbacks to manual categorization. First,
populating the categorization hierarchy is fast and efficient, much less time-consuming
and less expensive than doing it manually. Second, automatically populating the
categorization hierarchy ensures that the application of categorization rules is consistent.
This form of partially automated classification must deal with the problems of developing
an effective categorization hierarchy and specifying good classification rules. Extensive
experimentation and iterations are necessary for building a good classification hierarchy
using a partially automated classification approach.
30
CHAPTER 2. DOCUMENT REPOSITORY
42
2
There are many factors to consider when deciding which of the three categorization
approaches to use for organizing documents: manual, automatic, or partially automated.
The most salient feature, however, is the trade-off between the error rates of fully
automated approaches, and the time and cost of more manual approaches. In building the
regulatory document repository, we use a partially automated approach to categorizing
documents. Given the large volume of environmental regulatory related documents and
limited resources available from government or industry to organize them, a manual
categorization approach would be impractical. This is particularly true when one
considers that there are a variety of different perspectives that industry and government
groups would like to see, thus splitting these limited resources into a multitude of
categorization efforts. A fully automated approach to organizing environmental
regulatory information would not be a good fit either, since having clear logical structures
and low error rates is very important. Locating relevant environmental regulatory
documents is extremely important, so logically incoherent categorization structures, or a
high rate of incorrectly categorized documents would not be acceptable. The problem of
incorrectly categorized documents is particularly acute for environmental regulatory
information, since the proper category for a document sometimes depends upon minor
conceptual issues.
2.3.3.2 Approaches to Developing a Classification Hierarchy
An essential component of partially automated classification is to develop the
classification hierarchies. This section addresses several approaches for developing these
hierarchies. Categorization hierarchies can be developed from a top-down perspective,
bottom-up perspective, or a hybrid combination of these two methods.
A top-down approach to developing a classification hierarchy refers to the approach of
conceptualizing a meaningful way to break down documents into a set of categories, and
expanding these categories into subcategories to whatever depth seems appropriate. The
entire process is done without examining representative documents from the set of
31
CHAPTER 2. DOCUMENT REPOSITORY
43
3
documents to be categorized. Rules can then be developed to filter documents into
appropriate categories within the classification hierarchy. While very clean logical
structures can result from this type of approach, there are several weaknesses to this
method. First, a set of documents may not map well to an abstractly created
classification hierarchy. Some categories may be empty, or nearly empty. Other
categories may be populated with so many documents that it may be difficult to identify
those of interest. Second, there may be many documents in the set of input documents
that do not fit into any of the categories that were developed with the top-down approach.
These documents will be incorrectly classified, or not classified at all, thus making them
inaccessible.
A bottom-up approach is basically the method of browsing through a set of input
documents and developing a classification hierarchy based upon the terms and concepts
that seem to stand out in the document collection. This approach can be very effective
for a static document collection. However, if the document collection grows or changes
over time, it can be difficult to adapt the classification hierarchy to the new data. In
addition, a bottom-up classification hierarchy will not generalize well if applied to other
document sets. This is because the prominent terms, concepts, and depth of topic
coverage will be very specific to the particular document set for which the classification
hierarchy is developed.
A combination of top-down and bottom-up approaches, called a hybrid approach,
balances the strengths and weaknesses of the two methods. When using a hybrid
approach to developing a classification structure, a top-down conceptualization of the
classification hierarchy is iteratively refined using the data from a bottom-up perspective.
For example, the top levels of a classification hierarchy might be developed using a top-
down approach. Basic classification rules for adding documents to the respective
categories could then be developed, and an automated system could populate the
structure with documents. The designer could then survey the results, investigating how
well the categories break down the documents into manageable units, and what types of
27
CHAPTER 2. DOCUMENT REPOSITORY
44
4
documents failed to match any categories within the classification structure. Using an
iterative approach, subcategories can then be added to the initial classification structure
until the designer is satisfied with the distribution and coverage of the documents
included in the classification hierarchy. This hybrid approach to designing a
classification structure should mitigate some of the problems associated with a top-down
method, while improving the generalization of a bottom-up approach.
2.4 Document Repository Features
As mentioned in the previous section, a semiautomatic approach to developing
classification hierarchies is used for the development of the document repository for
environmental regulations. In this section we will discuss the process used for designing
and refining categorization hierarchies. A software package from Semio Corporation was
used for purposes of building the classification hierarchies. There are a number of
software programs available from companies, research entities, or the open source
software community that provide categorization tools. The use of a commercial software
package from Semio Corporation provides many useful features, such as a graphical user
interface, noun phrase extraction services, and other tools that greatly facilitate this
research work. Nevertheless, the issues discussed in this section are applicable to
designing and building classification hierarchies to organize sets of documents in general.
We will illustrate the process for building a categorization hierarchy using a hybrid top-
down, bottom-up strategy with the Semio software package. Once one is familiar with
the set of documents to be organized, the first step is to develop an initial high-level
categorization hierarchy. With the software tools used in this research project, this
entailed developing a small text file with a few high-level categories, and “latching” noun
phrases that help assign documents to those categories.
30
CHAPTER 2. DOCUMENT REPOSITORY
45
5
When a document is being processed, the software automatically extracts noun phrases
from the document that are characteristic of the topics the document is related to. For
convenience these noun phrases are termed “concepts”. C oncepts are useful when
developing categorization hierarchies because they can be used to assign documents that
contain specific concepts to particular locations within the categorization hierarchy.
In Semio, a text file containing an initial categorization hierarchy has the form shown in
Figure 2.5. Category names are denoted by a word or a phrase preceded by an
exclamation point. An indented list of words and phrases preceded by plus or minus
characters indicate the latching concepts for that particular category. Concepts preceded
by a plus character indicate that documents containing those concepts should be placed
under the related category. For example, documents containing the concept
“amendment” should be placed under the “On
the Topic of Regulation” category in
Figure 2.5. Concepts preceded by minus character indicates that documents containing
those concepts should not be placed under that particular category even if they contain
other latching concepts for the category. For example, “penalty” and “sanction” are
exclusionary concepts under the “On the Topic of Regulation” category, and they prevent
documents containing these concepts from latching into this category. An indented line
that is started by an exclamation point indicates another category within the
categorization hierarchy. The tabular depth of the category name indicates the depth
within the categorization hierarchy. For example, “Permits” and “Penalties and
Sanctions” are both subcategories of “On the Topic of Regulation” in Figure 2.5.
Once an initial specification file for the categorization hierarchy is created, the software
package can be used to assign documents to populate the classification structure. When
the classification structure is populated with documents it is possible to get statistics
indicating how well the classification structure represents the content of the document
corpus. For example, the percent of documents in the document corpus that are matched
Documents you may be interested
Documents you may be interested