23
a batch algorithm that iterates over the corpus multiple times. Gibbs samplers are guaranteed to
converge, which means that after a number of initial iterations (usually called “burn-in”), each
iteration produces a sample from the posterior distribution of the model in question (here, either
the unigram or bigram GGJ model). This convergence guarantee is what makes these samplers
popular for ideal learner problems, since it means that the true posterior of the model can be
examined without the effects of additional constraints imposed by the learning algorithm.
During each iteration of GGJ’s Gibbs sampler, every possible boundary location (position
between two phonemes) in the corpus is considered in turn. At each location b, the probability
that b is a boundary is computed, given the current boundary locations in the rest of the corpus
(details of this computation can be found in GGJ; critically, it is based on the equations defining
the Bayesian model and thus on the lexicon and frequencies implicit in the current
segmentation). Then the segmentation is updated by inserting or removing a boundary at b
according to this probability, and the learner moves on to the remaining boundary locations.
Pseudocode for this algorithm is shown in (6).
(6) Pseudocode for Gibbs sampler (Ideal Learner)
Randomly initialize all word boundaries in corpus
For i = 1 to number of iterations
For each possible boundary location b in corpus
(1) Compute p, the probability that b is a boundary (b = 1)
given the current segmentation of the rest of the corpus
(2) With probability p, set b to 1; else set b to 0
GGJ found that in order to converge to a good approximation of the posterior, the Gibbs sampler
required 20000 iterations (i.e., each possible boundary in the corpus was sampled 20000 times),
20
with α = 20 for the unigram models, and β = 10, γ = 3000 for the bigram models.
Due to the convergence guarantees noted above, this algorithm is well-suited to the
computational-level analysis that GGJ were interested in, allowing them to ask what kinds of
segmentations would be learned by ideal learners with different assumptions about the nature of
language. GGJ discovered that an ideal learner that is biased to heed context (the bigram model)
achieves far more successful segmentation than one that is not (the unigram model). Moreover,
a unigram ideal learner will severely undersegment the corpus, identifying common collocations
as single words (e.g., you want segmented as youwant), most likely because the only way a
unigram learner can capture strong word-to-word dependencies is to assume those words are
actually a single word. This tells us about the expected behavior in learners who are able to
make optimal use of their input – that is, what in principle are the useful biases for humans to
use, given the available data.
Turning to the algorithmic level of analysis, however, the GGJ learner is clearly less
satisfactory, since the Gibbs sampling algorithm requires the learner to store the entire corpus in
memory, and also to perform a significant amount of processing (recall that each boundary in the
corpus is sampled 20000 times). In the following section, we describe three algorithms that make
more cognitively plausible assumptions about memory and processing. These algorithms will
allow us to investigate how such memory and processing limitations might affect the learner’s
ability to achieve the optimal solution to the segmentation task (i.e., the solution found by the
ideal learners in GGJ).
22
2.3.2. Constrained learners
To simulate limited resources, all the learning algorithms we present operate in an online
fashion, so that processing occurs one utterance at a time rather than over the entire corpus
simultaneously. Under GGJ’s Bayesian model, the only information necessary to compute the
probability of any particular segmentation of an utterance is the number of times each word (or
bigram, in the case of the bigram model) has occurred in the model’s current estimation of the
segmentation. Thus, in each of our online learners, the lexicon counts are updated after
processing each utterance (and in the case of one learner, during the processing of each utterance
as well). The primary differences between our algorithms lie in the additional details of how
resource limitations are implemented, and whether the learner is assumed to sample
segmentations from the posterior distribution or choose the most probable segmentation. [end
note 2]
2.3.2.1 Dynamic Programming Maximization
We first tried to find the most direct translation of the ideal learner to an online learner
that must process utterances one at a time, such that the only limitation is that utterances must be
processed one at a time. One idea for this is an algorithm we call Dynamic Programming
Maximization (DPM), which processes each utterance as a whole, using dynamic programming
(specifically the Viterbi algorithm) to efficiently compute the highest-probability segmentation
of that utterance given the current lexicon.[end note 3] It then adds the words from that
segmentation to the lexicon and moves to the next utterance. This algorithm is the only one of
our three that has been previously applied to word segmentation (Brent, 1999). Pseudocode for
this learner is shown in (7).
VB.NET Word: Word to JPEG Image Converter in .NET Application on you can forget about copying and pasting the word Sample.docx" has been converted into an individual translate page to image 'save image REFile.SaveImageFile
how to copy text from pdf image to word; how to copy and paste a picture from a pdf document
22
(7) Pseudocode for DPM Learner
initialize lexicon (initially empty)
For u = 1 to number of utterances in corpus
(1) Use Viterbi algorithm to compute the highest probability
segmentation of utterance u, given the current lexicon
(2) Add counts of segmented words to lexicon
2.3.2.2 Dynamic Programming Sampling
We then created a variant that is similar to DPM, but instead of choosing the most
probable segmentation of each utterance conditioned on the current lexicon, it chooses a
segmentation based on how probable that segmentation is. This algorithm, called Dynamic
Programming Sampling (DPS), computes the probabilities of all possible segmentations using
the forward pass of the forward-backward algorithm, and then uses a backward pass to sample
from the distribution over segmentations. Pseudocode for this learner is shown in (8); the
backward sampling pass is an application of the general method described in Johnson, Griffiths,
and Goldwater (2007).
(8) Pseudocode for DPS learner
initialize lexicon (initially empty)
For u = 1 to number of utterances in corpus
(1) Use Forward algorithm to compute probabilities of all
possible segmentations of utterance u, given the current lexicon
(2) Sample segmentation, based on probability of the segmentation
(3) Add counts of segmented words to lexicon
24
2.3.2.3 Decayed Markov Chain Monte Carlo
We also examined a learning algorithm that recognizes that human memory decays over
time and so focuses processing resources more on recent data than on data heard further in the
past (a recency effect). We implemented this using a Decayed Markov Chain Monte Carlo
(DMCMC) algorithm (Marthi et al., 2002), which processes an utterance by probabilistically
sampling s word boundaries from all the utterances encountered so far. The sampling process is
similar to Gibbs sampling, except that the learner only has the information available from the
utterances encountered so far to inform its decision, rather than information derived from
processing the entire corpus.
The probability that a particular potential boundary b is sampled is given by the
exponentially decaying function b
a
-d
, where b
a
is the number of potential boundary locations
between b and the end of the current utterance, and d is the decay rate. Thus, the further b is
from the end of the current utterance, the less likely it is to be sampled. The exact probability is
based on the decay rate d. For example, suppose d was 1, and there are 5 potential boundaries
that have been encountered so far. The probabilities for sampling each boundary are shown in
Table 1.
[Insert Table 1 approximately here: Likelihood of sampling a given boundary in DMCMC]
After each boundary sample is completed, the learner updates the lexicon. Pseudocode
for this learner is shown in (9).
24
(9) Pseudocode for DMCMC learner
initialize lexicon (initially empty)
For u = 1 to number of utterances in corpus
Randomly initialize word boundaries for utterance u.
For s = 1 to number of samples to be taken per utterance
(1) Probabilistically sample one potential boundary from
utterance u or earlier, based on decay rate d (has bias to
sample more recent boundaries) and decide whether a word
boundary should be placed there
(2) Update lexicon if boundary changed (inserted or
deleted)
We note that one main difference between the DMCMC learner and the Ideal learner is
that the Ideal learner samples every boundary from the corpus on each iteration, rather than being
restricted to a certain number from the current utterance or earlier. The Ideal learner thus has
knowledge of future utterances when making its decisions about the current utterance and/or
previous utterances, while the DMCMC learner does not.[end note 4] In addition, restricting the
number of samples in the DMCMC learner means that it requires less processing time/resources
than the Ideal learner.
We examined a number of different decay rates, ranging from 2.0 down to 0.125. To
give a sense of what these really mean for the DMCMC learner, Table 2 shows the probability of
sampling a boundary within the current utterance assuming the learner could sample a boundary
from any utterances that occurred within the last 30 minutes of verbal interaction (i.e., this
includes child-directed speech as well as any silences or pauses in the input stream).
Calculations are based on samples from the alice2.cha file from the Bernstein corpus, where an
18
utterance occurs on average every 3.5 seconds. As we can see, the lower decay rates cause the
learner to look further back in time, and thus require the learners to have a stronger memory in
order to successfully complete the boundary decision process.
[Insert Table 2 approximately here: Probability of sampling a boundary from the current
utterance, based on decay rate]
The DMCMC learner has some similarity to previous work on probabilistic human
memory, such as Anderson & Schooler (2000). Specifically, Anderson & Schooler argue for a
rational model of human memory that calculates a “need” probability for accessing words, which
is approximately how likely humans are to need to retrieve that word. The higher a word’s need
probability, the more likely a human is to remember it. The need probability is estimated based
on statistics of the linguistic environment. Anderson & Schooler demonstrate that the need
probability estimated from a number of sources, including child-directed speech, appears to
follow a power law distribution with respect to how much time has elapsed since the word was
last mentioned. Our DMCMC learner, when doing its constrained inference, effectively
calculates a need probability for potential word boundaries – this is the sampling probability
calculated for a given boundary, which is derived from an exponential decay function. Potential
word boundaries further in the past are less likely to be needed for inference, and so are less
likely to be retrieved by our DMCMC learner.
21
3. Bayesian Model Results
3.1 The data set
We tested the GGJ Ideal learner and our three constrained learners on data from the
Bernstein corpus (Bernstein-Ratner, 1984) from the CHILDES database (MacWhinney 2000).
We used the phonemic transcription of this corpus that has become standard for testing word
segmentation models (Brent, 1999).[end note 5] The phonemically transcribed corpus contains
9790 child-directed speech utterances (33399 tokens, 1321 types, average utterance length = 3.4
words, average word length = 2.9 phonemes) See Table 3 for sample transcriptions and
Appendix Figure 1 for the phonemic alphabet used. Unlike previous work, we used cross-
validation to evaluate our models, splitting the corpus into five randomly generated training sets
(~8800 utterances each) and separate test sets (~900 utterances each), where each training and
test set were non-overlapping subsets of the data set used by GGJ. We used separate training and
test sets to examine the modeled learner’s ability to generalize to new data it has not seen before
(and been iterating over, in the case of the Ideal learner). Specifically, we wanted to test if the
lexicon the learner inferred was useful beyond the immediate dataset it trained on. Temporal
order of utterances was preserved in the training and test sets, such that utterances in earlier parts
of each set appeared before utterances in later parts of each set.[end note 6]
[insert Table 3 approximately here: Samples of Bernstein corpus.]
3.2 Performance measures
We assessed the performance of these different learners, based on precision and recall
over word tokens, word boundaries, and lexicon items, where precision is # correct/# found and
Documents you may be interested
Documents you may be interested