CHAPTER 3. XML REPRESENTATION OF REGULATIONS
18.104.22.168 Statistically-Based Reference Parser
In this research, an n-gram model is employed to make the parsing process more efficient
by skimming over text that was not predicted to contain a reference. An n-gram model is
a probabilistic model for sets of n sequential words . For example, one might use
unigrams, bigrams or trigrams in a model. A unigram is a single word, a bigram is a pair
of words, and a trigram is a sequence of three consecutive words. These n-grams can be
used to predict where a reference occurs in a regulation by how frequently each n-gram
precedes a reference string.
To develop an n-gram model, a regulation corpus of about 650,000 words was assembled.
The parser found 8,503 references after training on this corpus. These 8,503 references
are preceded by 184 unique unigrams, 1,136 unique bigrams, and 2,276 unique trigrams.
For these n-grams to be good predictors of a reference, they should occur frequently
enough to be useful predictors, but they should not occur so frequently in the general
corpus that their reference prediction value is low.
For the unigrams, it is interesting to note that 18 of the most “certain” predictors are
identified as highly “certain” because they ar e only seen once in the entire corpus. Some
other unigrams that one might intuitively expect to be good predictors actually are weak
predictors for references. For example, “in” has a 5% prediction value. This is because
the 2,626 references that are preceded by “i n” are so heavily outweighed by the 49,325
total occurrences of “in” in the corpus. These two factors make the unigram model a
weak one, since words with high certainty tend to be those that are rarely seen, and words
that preceded many references tend to be common words that also appear often
throughout the corpus. One exception to this is the word “under”, which precedes 1,135
references and only appears 2,403 times in the corpus (a 47% prediction rate).
The bigram model is a good predictor of references. While over 200 (18%) of the
bigrams only occur once in the corpus, the significance of bigrams that precede a
reference is not diminished by an even larger number of occurrences in the corpus (as is