34
CHAPTER 3. XML REPRESENTATION OF REGULATIONS
79
9
the regulations. If regulations are not annotated with references in a standardized format
in advance, natural language reference tracking capabilities would be needed. As the
number and type of regulations increase, extracting the references can be even more
complicated since different regulations will have slightly different referencing styles.
Adding reference data at the time an XML regulation is created reduces the complexity
for the development of other processing systems which need to use the references.
The complexity of regulation references ranges from relatively straightforward to
complex. An example of a straightforward casual English reference is the text “as stated
in 40 CFR section 262.14(a)(2).” An example of a more complex reference is the text
“the requirements in subparts G through I of this part” (where the current part is part
265). This latter example can be converted manually into the following list of complete
references: 40.cfr.265.G, 40.cfr.265.H, and 40.cfr.265.I. However, given the large
volume of federal and state environmental regulations, such manual translation of
references is too time consuming to be practical for existing regulations. The same
problem of dealing with a huge number of natural language references has been faced by
at least one other researcher, Justin Needle, when he was working with JUSTIS, a legal
research data provider
34
. In an article on the automatic linking of legal citations, Justin
Needle writes [66]:
“The conventional method of creating hypertext links between documents
involves manually editing each document and inserting fixed links at the database
production stage. Unfortunately, there is a major problem. The JUSTIS databases
contain millions of citations which, in order to achieve the required functionality,
need to be converted into millions of corresponding hypertext links. The manual
creation of links on this scale is not really an option since link creation is a
laborious process, requiring the services of skilled, and expensive, editors. Even if
an editor is able to identify and process ten links per hour, which is optimistic,
then the human effort required will be approximately a hundred thousand hours
34
JUSTIS is available at the web address http://www.justis.com.
28
CHAPTER 3. XML REPRESENTATION OF REGULATIONS
80
0
per million links created. … Clearly, there ought to be a better way of tackling
the problem.”
For the research work presented in this thesis, a parsing system was developed using a
context-free grammar and a semantic representation/interpretation system that is capable
of tagging regulation provisions with the list of references that they contain. The parsing
system consists of two phases. First, a context-free grammar parser scans through the
regulation text, constructing parse trees as shown in Figure 3.10. The original reference
for the parse tree in Figure 3.10 was, “…
Subpart O of part 264 or 265”. Then a
secondary parser converts the parse tree into lists of fully specified references. These
references are inserted into the XML regulation as new child elements of the appropriate
regElement XML element.
Note that the references are not tagged as hyperlinks, which would tie the reference to a
particular source for the referred document. Rather, the reference tags simply provide a
complete specification for what regulation provision is referenced. Where the regulation
is located is not specified so that an XML regulation viewing system may select any
document repository of regulations from which to retrieve the referenced provision.
Examples demonstrating the usage of the XML element for regulation references are
shown in Figure 3.11. The next section discusses the development of the reference parser
in detail.
3.4.3.1 Development of a Reference Parser
3.4.3.1.1 Simple Tabular Parsing
Parsing can be viewed as a search problem of matching a particular grammar and lexicon
to a set of input tokens by constructing all possible parse trees [46]. The grammar
defines a set of categories and how the categories can be manipulated. The lexicon
44
CHAPTER 3. XML REPRESENTATION OF REGULATIONS
81
1
Figure 3.10 Example parse tree for identifying regulation references
Figure 3.11 Example of a reference XML element
defines to what categories the input tokens belong. The search problem is associated
with manipulating the grammar to find all possible matches with the input tokens. A
simple top-down, left-to-right parser is described in this section.
Suppose we start with a very simple model of English grammar. In this grammar we
might say that all sentences are composed of a noun plus a verb phrase. Verb phrases can
be a verb plus a noun, or simply a single verb. This grammar can be represented as
shown in Figure 3.12. We can then create a small lexicon containing the words “cars”,
REF
ASSUME_LEV0
LEV2’
SUBPART UL’
UL
BACKREFKEY
LEV1r’
LEV1p
LEV1r
CONN’ LEV1a’
LEV1a
LEV1s
INT
CONN
PART INT CONL2
e
Subpart
O
part
of
265
264
or
40.cfr
<regElement id=“40.cfr.279.12.a” name=“Surface impoundment prohibition” >
<regText>
Used oil shall not be managed in surface impoundments or waste piles unless the units are subject to
regulation under parts 264 or 265 of this chapter.
</regText>
<reference id=“40.cfr.264” />
<reference id=“40.cfr.265” />
</regElement>
31
CHAPTER 3. XML REPRESENTATION OF REGULATIONS
82
2
“oil”, and “use”, in which we define what categories these words may match. This
simple lexicon is shown in Figure 3.13.
We can use the simple grammar and lexicon to parse the sentence “cars use oil”. We can
model the parsing process with a category stack, an input stack, and a set of operations
for manipulating these stacks. We start the parsing by adding the “S”, or a sentence start
symbol, to the category stack and the input tokens to the input stack. We can then use
two operations to parse the input. The expand operation is used to expand the top
category on the category stack using one of the grammar rules in Figure 3.12. The match
operation is used to match the top category in the category stack with the top token in the
input stack according to the lexicon rules in Figure 3.13. A parse is considered
successful when both the category stack and the input stack are empty. Table 3.1 shows
the successful parsing of the sentence “cars use oil”, and Figure 3.14 shows the
corresponding parse tree.
In this simple tabular parsing strategy, it is necessary to try all possible expansions of the
grammar categories. For example, the “VP” category in Table 3.1 could also have been
expanded to be a “V”. Since this expansion would not have resulted in a successful parse
the expansion was not used in the example in Table 3.1. When a program is searching
for a parse for an input stack, it will not know in advance which expansions will result in
a successful parse. Therefore it must perform all possible expansions. The general
procedure used, however, is the same as that illustrated in Table 3.1. It is possible to
have multiple parses for a single set of input tokens. It should also be noted that the
lexicon may map the same input token to multiple categories. The parser design
described here is known as left-recursive, and if a grammar rule is left recursive the
parsing algorithm will not terminate [46]. This is because a rule like “VP
Æ
VP N” can
be expanded an infinite number of times. In our work, left recursive grammar rules are
not permitted.
55
CHAPTER 3. XML REPRESENTATION OF REGULATIONS
83
3
Table 3.1 Simple parsing example
Category Stack
Input Stack
Operation
S
Cars use oil
Start
N VP
Cars use oil
Expand
VP
use oil
Match
V N
use oil
Expand
N
Oil
Match
Finish
Figure 3.12 Simple grammar
Figure 3.13 Simple lexicon
S
VP
N
V
N
cars
use
oil
Figure 3.14 Simple parse tree
S
Æ
N VP
VP
Æ
V N
VP
Æ
V
N
Æ
cars
N
Æ
oil
V
Æ
use
N
Æ
use
32
CHAPTER 3. XML REPRESENTATION OF REGULATIONS
84
4
3.4.3.1.2 Constructing a Reference Parse Tree
As mentioned above, the reference parsing system is based on a simple tabular parser.
This simple tabular parser must be adapted to the reference identification problem. The
simple tabular parser knows the start and end of the sentence in advance, and
incorporates this information into the algorithm. The reference identification problem is
different from general sentence parsing in that the start and end of the reference are not
known in advance. The termination conditions for the reference parser are changed so
that the parse is considered complete if the category stack is empty. In other words, the
input stack does not need to be empty for the parse to be complete. The parser was also
modified to recognize a number of special category tokens in addition to the lexicon.
These special categories, which can be used in addition to a vocabulary specified in the
lexicon, are shown in Table 3.2. In addition, grammar specifications can use special
categories such as “txt(abc)” to match “abc” input, which makes some patterns more
transparent by bypassing the lexicon.
A third type of grammar category was of the form “ASSUME_LEV1”, where LEV1
represents the level within the reference hierarchy. When the parser produces a category
that starts with “ASSUME_” during a parse, it matches the assumed reference level
(whatever follows the underscore after ASSUME, in this case “LEV1”) with the
identifier for that level within the XML tree
35
. This is useful when a natural language
reference does not fully denote the reference. For example, in Figure 3.10 the category
“ASSUME_LEV0” was automatica lly assumed to be “40.cfr”, because the reference “…
Subpart O of part 264 or 265” did not exp licitly state that parts 264 and 265 are in 40
CFR.
A WordQueue object is used to tokenize and buffer the input. Regulation provisions are
read individually and added into the WordQueue. The tokenized regulation provision is
35
This is similar to the standard treatment of “empty” rules, except an assumed value is matched.
41
CHAPTER 3. XML REPRESENTATION OF REGULATIONS
85
5
Table 3.2 Special reference parsing grammar categories
Category
Matches
INT
Integers
DEC
Decimal numbers
NUM
Integers or decimal numbers
UL
Uppercase letters
LL
Lowercase letters
ROM
Roman numerals
BRAC_INT
Integers enclosed in ()
BRAC_UL
Uppercase letters enclosed in ()
BRAC_LL
Lowercase letters enclosed in ()
BRAC_ROM
Roman numerals enclosed in ()
then passed to the parser to look for a reference. If a reference is found, the tokens that
constitute the reference are removed from the queue. Otherwise, the first token in the
queue is removed and the input is returned to the parser. Once the input queue is empty,
the next regulation provision is read.
Many experiments have been conducted to develop the algorithm for tokenizing the
regulation provision text input. Splitting first on white space and then splitting off any
trailing punctuation will not work. Some of the text includes lines like, “oil leaks (as in
§§279.14(d)(1)). Materials…”. The tokenize
d version of this line (using a space
delimiter) should look like, “oil leaks ( as in § § 279.14 (d) (1) ) . Materials”. The
algorithm cannot split on all “.” characters because some may occur as part of a number.
It cannot split on all the parenthesis characters because some may be part of an identifier
“(d)” that should be preserved. The solution is to first split the input on white space, and
then perform a second pass on each individual token. This second pass involves splitting
the token into smaller tokens until no more splits are possible. The process splits off
Documents you may be interested
Documents you may be interested