25
Bayesian Reader 1
Introduction
Words that appear frequently in the language are recognized more easily than words that appear
less frequently. This fact is perhaps the single most robust finding in the whole literature on visual
word recognition. The basic result holds across the entire range of laboratory tasks used to
investigate reading. For example, frequency effects are seen in lexical decision (Forster &
Chambers, 1973; Murray & Forster, 2004) in naming (Balota & Chumbley, 1985; Monsell, Doyle,
& Haggard, 1989), semantic classification (Forster & Hector, 2002; Forster & Shen, 1996),
perceptual identification (Howes & Solomon, 1951; King-Ellison & Jenkins, 1954) and eye
fixation times (Inhoff & Rayner, 1986; Just & Carpenter, 1980; Rayner & Duffy, 1986; Rayner,
Sereno, & Raney, 1996; Schilling, Rayner, & Chumbley, 1998). Frequency effects are also seen
reliably in spoken word recognition (Connine, Mullennix, Shernoff, & Yelen, 1990; Dahan,
Magnuson, & Tanenhaus, 2001; Howes, 1954; Luce, 1986; Marslen-Wilson, 1987; Pollack,
Rubenstein, & Decker, 1960; Savin, 1963; Taft & Hambly, 1986) and therefore appear to be a
central feature of word recognition in general.
The fact that high frequency words are easier to recognize than low frequency words seems
intuitively obvious. However, possibly because the result seems so obvious, very little attention has
been given to explaining why it is that high frequency words should be easier to recognize than low
frequency words. All models of word recognition contain some mechanism to ensure that high
frequency words are identified more easily than low frequency words but, in many models, high
frequency words are easier by virtue of some arbitrary parameter setting. For example, in the E-Z
Reader model (Reichle, Pollatsek, Fisher, & Rayner, 1998; Reichle, Rayner, & Pollatsek, 1999,
2003), lexical access is specified to be a function of log frequency, where the slope of the
frequency function is a model parameter. In the logogen model (Morton, 1969) resting levels of
logogens for high frequency words are set to be higher than resting levels for low frequency words.
25
Bayesian Reader 2
However, resting levels or thresholds could equally well be set to make low frequency words easier
than high. The only factor preventing this move is that it would conflict with the data. More
generally, even when models contain a mechanism that necessarily produces a frequency effect
(e.g. Forster’s, 1976, search model) one might still ask why there should be a frequency effect at
all. That is, wouldn't it be better if all words were equally easy to recognize? The present paper
attempts to answer this question by presenting a rational analysis (Anderson, 1990) of the task of
recognising written words. This analysis assumes that people behave as optimal Bayesian
recognizers. This assumption leads to an explanation of why it is that high frequency words ought
to be easier to recognize than low frequency words. Furthermore, it explains why the function
relating frequency to reaction time (Whaley, 1978) and perceptual identification threshold (Howes
& Solomon, 1951; King-Ellison & Jenkins, 1954), is approximately logarithmic (although for
further qualification see Balota, Cortese, Sergent-Marshall, Spieler, & Yap, 2004, and Murray &
Forster, 2004). It also explains why neighborhood density influences word recognition, and why
neighborhood density can have a different influence on tasks such as lexical decision, word
identification, and semantic categorization.
Explaining the word-frequency effect
Explanations of the word frequency effect fall into three main categories: First, frequency could
influence processing efficiency or sensitivity (e.g. Solomon & Postman, 1952). That is, perceptual
processing of high frequency words might be more effective than perceptual processing of low
frequency words. Second, frequency might alter the bias or threshold for recognition (e.g.
Broadbent, 1967; Grainger & Jacobs, 1996; McClelland & Rumelhart, 1981; Morton, 1969; Norris,
1986; Pollack et al., 1960; Savin, 1963). That is, frequency might not alter the effectiveness of
perceptual processing, but would simply make readers more prepared to recognize a high frequency
word on the basis of less evidence than would be required to identify a low frequency word.
23
Bayesian Reader 3
Finally, lexical access could involve a frequency ordered search through some or all of the words in
the lexicon (Becker, 1976; Forster, 1976; Glanzer & Ehrenreich, 1979; Murray & Forster, 2004;
Paap, McDonald, Schvaneveldt, & Noel, 1987; Paap, Newsome, McDonald, & Schvaneveldt,
1982; Rubenstein, Garfield, & Millikan, 1970).
Frequency as sensitivity
The idea that frequency might have a direct effect on the efficiency or sensitivity of perceptual
processing was first proposed by Solomon and Postman. However, there has been very little direct
evidence for this view, and there have been no explicit accounts of how changes in word frequency
might alter the nature of perceptual processing, beyond what might be involved in the initial
learning of a new word. Broadbent (1967) directly compared bias and sensitivity accounts of
frequency in spoken word perception and concluded that the evidence was entirely consistent with
a response bias account.
However, the debate over whether frequency effects are due to changes in sensitivity or bias has
been reengaged recently by Wagenmakers, Zeelenberg, and Raaijmakers (2000), Wagenmakers,
Zeelenberg, Schooler, and Raaijmakers (2000), and Ratcliff and McKoon (2000). The debate
centers on data from a two-alternative forced-choice tachistoscopic identification task. In this task,
participants see a briefly presented word followed by a display consisting of two words, one of
which is the briefly presented word. The participant's task was to decide which of these two words
had actually been displayed. Wagenmakers, Zeelenberg, and Raaijmakers showed that participants
perform better when the alternatives are both high-frequency words than when they are both low in
frequency. This is exactly what one would expect if frequency increased perceptual sensitivity
rather than bias. A simple response bias account would predict that, with words of equal frequency,
24
Bayesian Reader 4
the biases would cancel out, and performance on high and low frequency words would be identical.
However, Wagenmakers, Zeelenberg, Schooler, and Raaijmakers suggest that this result can be
explained by assuming that participants sometimes make a random guess when words have not
reached a sufficiently high level of activation. As participants will be more likely to guess high-
frequency words, high-frequency words will tend to be identified more accurately.This suggestion
is similar to Treisman’s (1978a,1978b) Perceptual Identification model. Triesman assumed that
once perception had narrowed its search to some subvolume of perceptual space, words within that
subvolume were chosen at random, with a bias proportional to frequency, regardless of how near
they were to the centre of the subvolume. As Wagenmakers, Zeelenberg, Schooler, and
Raaijmakers note, this is not an optimal strategy, as some potentially useful information is
discarded.
However, it is also possible that this task may make it difficult for participants to behave optimally.
The best strategy would be to consider only the two alternative words presented, and to ignore the
rest of the lexicon. With words of equal frequency, participants should then choose the alternative
that best matches the target. However, because of the delay (300ms) between presentation of the
target and the alternatives, participants may begin to identify the target word in the normal way,
such that all of the words in the lexicon are potential candidates. If this were to happen, low
frequency words would be less likely to be identified correctly, and participants might often
misidentify low frequency targets as a word other than one of the two alternatives. When the two
alternatives are presented, participants would then have to make a random choice between them.
There would therefore be more random guesses for low than high-frequency words. The important
point here is that participants can only behave optimally if they can completely disregard other
words in the lexicon.
25
Bayesian Reader 5
Response Bias Theories
The most familiar example of a response bias account of word frequency is Morton's (1969)
logogen model. In the logogen model, each word in the lexicon has a dedicated feature counter, or
logogen. As perceptual features arrive, they increment the counts in matching logogens. A word is
recognized when its feature count exceeds some threshold. Frequency effects are explained by
assuming that high-frequency words have higher resting levels (or equivalently, a lower threshold)
than low frequency words. High frequency words can therefore be identified on the basis of fewer
perceptual features than low-frequency words. In such a model, it might seem that word
identification would be most efficient when thresholds are set as low as possible, consistent with
each logogen responding to only the corresponding word, and not to any other word. However, as
Forster (1976) pointed out, if this were so, then increasing the resting level of high frequency words
beyond that point would often cause a logogen for a high frequency word to respond in error when
a similar word of lower frequency was presented. To avoid this, and allow headroom for frequency
to modulate recognition, thresholds must initially be set at a conservative level where many more
than the minimum number of features is required for recognition. So, if the baseline setting is that
words need N more features than are actually required for reliable recognition, the resting levels of
high-frequency words can be raised by up to N features before causing errors. However, all other
words will now be harder to recognize than they would have been using the original threshold. That
is, in order to be able to safely raise the resting levels of high-frequency words, all other words in
the lexicon have had to have their thresholds set higher than necessary. The overall effect of
making a logogen system sensitive to frequency is therefore to make word recognition harder. This
seems to be a quite fundamental problem with the account of frequency given by the logogen
model: Incorporating frequency into resting levels decreases the efficiency of word recognition.
However, as will be shown later, this is not a problem with criterion bias models in general, but
23
Bayesian Reader 6
rather with the way the logogen model combines frequency and perceptual information in a
completely additive fashion.
Search Models
The other main class of theory is search models. The most fully developed search model is that of
Forster (1976). In Forster's search model, a partial analysis of a word directs the lexical processor
to perform a serial search through a frequency ordered subset of the lexicon (a ‘bin’). The
straightforward prediction of this model is that the time to identify a word should be a linear
function of the rank position of that word in a frequency ordered lexicon. Recently, Murray and
Forster have presented evidence that rank frequency does actually give a slightly better account of
the relation between RT and frequency in a lexical decision task than does log frequency. Although
the difference between rank frequency and log frequency correlations is small, it is quite possible
that the choice between alternative models will ultimately hinge on such subtle differences.
However, although the search model does give a principled explanation of the form of the relation
between frequency and RT, this is the only prediction that follows directly from the assumption of
a search process. For example, Murray and Forster (2004) showed that the function relating
frequency and errors was also closely approximated by a rank function. However, to explain this in
a search model, they had to suggest that, as each lexical entry was searched, there was some
probability that the search might get side-tracked and move to the wrong subset of the lexicon. A
search of the wrong bin should always lead to a 'no' response. Murray and Forster pointed out that
if the probability of a side-tracking error is sufficiently small, error rate will be approximately
proportional to the number of lexical entries that must be searched before encountering the correct
word, i.e. rank frequency.
25
Bayesian Reader 7
Note that the search model also incorporates a degree of 'direct access'. The initial analysis of a
word directs the search to a bin containing only a subset of the words in the lexicon which share
some common orthographic characteristics. By reducing the bin size to 1, the search model would
become a direct access model. The fewer words there are to be searched in each bin, the faster
recognition would become. In other words, search is a suboptimal process. Direct access would be
more efficient, but then there would be no word frequency effect.
Frequency and learning
A deficiency in all of these explanations is that they simply indicate how a word frequency effect
might arise. None offers a convincing explanation for why word recognition should be influenced
by word frequency at all. Monsell (1991) has directly considered the question of why there should
be frequency effects. He argued that frequency effects follow naturally from a consideration of how
words are learned. His arguments were cast in the framework of connectionist learning models, and
he suggested that it was an inevitable consequence of such models that word recognition would
improve the more often a word was encountered. Indeed, all models of word recognition in the
parallel-distributed processing framework do show a word frequency effect (e.g. Plaut, 1997; Plaut,
McClelland, Seidenberg, & Patterson, 1996; Seidenberg & McClelland, 1989). Depending on the
exact details of the network, a connectionist model could predict that frequency would influence
bias, sensitivity or both. However, two main factors undermine the learning argument somewhat.
The first is that not all connectionist learning networks need extensive experience to learn. In
particular, localist models are capable of single trial learning (see Page, 2000, for a discussion of
the merits of localist connectionist models). Human readers are also capable of very rapid and long-
lasting learning of new words (Salasoo, Shiffrin, & Feustel, 1985). The second is that readers will
have extensive experience with words from all but the lowest end of the frequency range. Why
should these words not end up being learned as well as high frequency words? In particular, one
24
Bayesian Reader 8
would expect that readers with more experience would show a smaller frequency effect. Over time,
performance on high frequency words should approach asymptote, while low frequency words
would continue to improve. There is some support for this prediction. Tainturier, Tremblay, and
Lecours (1992) found that individuals with more formal education (18 years vs 11 years) showed a
smaller frequency effect. As Murray and Forster (2004) point out, the word frequency effect should
also get smaller with age. However, the frequency effect seems to either remain constant with age
(Tainturier, Tremblay, & Lecours, 1989) or to increase (Balota, Cortese, & Pilotti, 1999; Balota et
al., 2004; Spieler & Balota, 2000). This would seem to imply that learning is not very effective.
That is, a learning-based explanation of the word frequency effect seems to be predicated on the
assumption that the learning mechanism in the human brain fails to learn words properly even after
thousands of exposures. In effect, this form of explanation amounts to a claim that sensitivity to
word frequency is an unfortunate and maladaptive consequence of an inadequate learning system.
Search models are open to a similar criticism. A frequency ordered search process will make low
frequency words take longer to identify than high frequency words, but a parallel access system
would eliminate the disadvantage suffered by low frequency words completely. Once again, the
word frequency effect is explained as an undesirable side-effect of a suboptimal mechanism.
A Bayesian recognizer
One answer to the question of why word recognition should be influenced by frequency is provided
by asking how an ideal observer should make optimal use of the available information. The concept
of an ideal observer has a long history in vision research (Geisler, 1989; Hecht, Shalaer, & Pirenne,
1942; Weiss, Simoncelli, & Adelson, 2002), but has less commonly been applied to higher
cognitive processes. However, recently, the ideal observer analysis has also been applied to object
perception (Kersten, Mamassian, & Yuille, 2004), eye movements in reading (Legge, Klitz, &
Documents you may be interested
Documents you may be interested