pdf viewer control without acrobat reader installed c# : Add an image to a pdf application SDK utility azure .net windows visual studio piantadosi2014zipfs0-part2680

Zipf’s word frequency law in natural language:
acritical review and future directions
Steven T. Piantadosi
June 2, 2015
Abstract
The frequencydistributionof words hasbeen a key objectof studyinstatisticallinguistics forthe
past 70 years. This distribution approximately follows a simple mathematical form known as Zipf’s
law. Thispaperrstshowsthathumanlanguagehashighlycomplex,reliablestructureinthefrequency
distributionoverandabovethisclassiclaw,thoughpriordatavisualizationmethodsobscuredthisfact.
Anumberofempiricalphenomenarelatedtowordfrequenciesarethenreviewed.Thesefactsarechosen
to beinformativeaboutthemechanismsgivingrise to Zipf’slaw, and arethenusedtoevaluatemany
ofthetheoreticalexplanationsofZipf’slawinlanguage. Noprioraccountstraightforwardlyexplainsall
thebasic facts, norissupported with independent evaluation ofitsunderlying assumptions. Tomake
progress at understanding whylanguageobeysZipf’s law, studies must seekevidence beyond thelaw
itself,testing assumptionsandevaluatingnovelpredictionswithnew,independentdata.
1 Introduction
Oneofthemostpuzzlingfactsabouthumanlanguageisalsooneofthemostbasic: wordsoccuraccordingto
afamouslysystematicfrequencydistributionsuchthattherearefewveryhighfrequencywordsthataccount
formost ofthetokensintext (e.g. \a",\the",\I",etc.),andmany low frequency words(e.g. \accordion",
\catamaran",\ravioli"). Whatisstrikingisthatthedistributionismathematicallysimple,roughlyobeying
apower lawknownasZipf’slaw: therthmost frequentwordhasafrequency f(r) that scalesaccordingto
f(r)/
1
r
(1)
for   1 (Zipf, 1936, 1949)
1
. In this equation, r is called the \frequency rank" of a word, and f(r)
is its frequency in a natural corpus. Since the actual observed frequency will depend on the size of the
corpusexamined,thislawstatesfrequenciesproportionally: themostfrequentword(r=1)hasafrequency
proportionalto1,thesecondmostfrequent word(r=2)hasafrequencyproportionalto
1
2
,thethirdmost
frequent wordhasafrequency proportionalto
1
3
,etc.
Mandelbrot proposedandderivedageneralizationof thislawthat morecloselytsthefrequency distri-
butioninlanguage by\shifting"therank byanamount  (Mandelbrot,1962,1953):
f(r)/
1
(r+)
(2)
for1and 2:7(Zipf,1936,1949; Mandelbrot, 1962,1953). Thispaper willstudy (2)asthe current
incarnationof \Zipf’slaw,"althoughwe willuse the term \near-Zipan"more broadlytomeanfrequency
distributions where this law at least approximately holds. Such distributions are observed universally in
languages,eveninextinct andyet-untranslatedlanguageslikeMeroitic(R.D.Smith,2008).
It isworthre ecting on peculiarity ofthislaw. It iscertainly anontrivialproperty of human language
thatwordsvaryinfrequencyatall|itmighthavebeenreasonabletoexpectthatallwordsshouldbeabout
1
Notethatthisdistributionisphrasedoverfrequencyranksbecausethesupportofthedistributionisanunordered,discrete
set (i.e. words). Thiscontrastswith,forinstance,aGaussianwhichisdenedoveracomplete,totally-orderedeld(Rn),and
sohasamorenaturallyvisualizedprobabilitymassfunction.
1
Add an image to a pdf - insert images into PDF in C#.net, ASP.NET, MVC, Ajax, WinForms, WPF
Sample C# code to add image, picture, logo or digital photo into PDF document page using PDF page editor control
how to add a picture to a pdf document; adding image to pdf file
Add an image to a pdf - VB.NET PDF insert image library: insert images into PDF in vb.net, ASP.NET, MVC, Ajax, WinForms, WPF
Guide VB.NET Programmers How to Add Images in PDF Document
add a jpeg to a pdf; adding a png to a pdf
equally frequent. But given that words do vary in frequency, it isunclear why words shouldfollow such
aprecisemathematicalrule|inparticular one that doesnot referenceany aspect of eachword’smeaning.
Speakers generate speech by needing to communicate a meaning in a given world or social context; their
utterancesobey much more complex systemsof syntactic, lexical, and semantic regularity. How could it
be that the intricate processes of normal human language production conspire to result in a frequency
distributionthat issomathematically simple|perhaps\unreasonably"so(Wigner,1960)?
Thisquestionhasbeena centralconcern of statisticallanguage theoriesfor the past 70years. Deriva-
tions of Zipf’s law from more basic assumptions are numerous, both in language and in the many other
areas of science where this law occurs(for overviews, see Mitzenmacher, 2004; Newman, 2005; Farmer &
Geanakoplos, 2006; Saichev, Malevergne, & Sornette, 2010). Explanations for the distribution across the
sciencesspanmany formal ideas,frameworksandsetsof assumptions. To givea brief picture of the range
ofexplanationsthat havebeenworkedout,suchdistributionshave beenarguedtoarise fromrandom con-
catenative processes (Miller, 1957; W. Li, 1992; Conrad& Mitzenmacher, 2004), mixtures of exponential
distributions(Farmer&Geanakoplos,2006),scale-invariance(Chater&Brown,1999),(bounded)optimiza-
tionofentropy(Mandelbrot,1953)orFisherinformation(Hernando,Puigdomenech,Villuendas,Vesperinas,
&Plastino, 2009),theinvarianceofsuchpowerlawsunderaggregation(see Farmer &Geanakoplos,2006),
multiplicative stochasticprocesses(see Mitzenmacher,2004),preferentialre-use(Yule,1944;Simon,1955),
symbolic descriptionsof complex stochastic systems (Corominas-Murtra & Sole, 2010), random walkson
logarithmic scales (Kawamura & Hatano, 2002), semantic organization (Guiraud, 1968; D. Manin, 2008),
communicative optimization (Zipf, 1936, 1949; Mandelbrot, 1962; Ferrer i Cancho & Sole, 2003; Ferrer i
Cancho, 2005a; i Cancho, 2005; Salge, Ay, Polani, &Prokopenko,2013), random division of elementsinto
groups(Baek, Bernhardsson,&Minnhagen,2011), rst- andsecond-order approximation of most common
(e.g. normal) distributions (Belevitch, 1959), optimized memory search (Parker-Rhodes & Joyce, 1956),
amongmanyothers.
For language in particular, any such account of the Zipf’s law provides a psychological theory about
whatmust be occurringinthemindsoflanguageusers. Isthereamultiplicative stochasticprocessat play?
Communicativeoptimization?Preferentialre-useofcertainforms? Inthefaceofsuchaprofusionoftheories,
the question quickly becomes which|if any|of the proposed mechanisms provides a true psychological
account of the law. This meansanaccount which is connectedtoindependently testable phenomena and
mechanisms,andtswiththepsychologicalprocessesofwordproductionandlanguage use.
Unfortunately, essentially all of the work in language research has focused on solely deriving the law
itselfinprinciple; verylittleworkhasattemptedtoassessthe underlyingassumptionsof thehypothesized
explanation,aproblemformuchworkonpowerlawsinscience(Stumpf&Porter,2012)2. Itshouldbeclear
whythisisproblematic: thelawitself canbederivedfrom many startingpoints. Therefore,the ability of a
theorytoderivethelawprovidesvery weakevidencefor thataccount’scognitivevalidity. Otherevidenceis
needed.
Thispaper reviewsa wide range of phenomena any theory of word frequency distributions andZipf’s
lawmust beabletohandle. Thehope isthat areviewoffactsabout wordfrequencieswillpushtheorizing
about Zipf’slawtoaddressabroader range of empirical phenomena. Thisreview intentionally steersclear
fromother statisticalfactsabout text (e.g. Heap’slaw,etc.) becausethesearethroughlyreviewedinother
work(seeBaayen,2001;Popescu,2009). Instead,wefocusherespecicallyonfactsaboutwordfrequencies
whichareinformative about themechanismsgivingrise toZipf’slaw
3
.
We beginrst,however,bypointingoutanimportantfeature of the law: it isnot assimpleasZipfand
other since have suggested. Indeed,some of the simplicityof the relationshipbetweenword frequency and
frequency rank is the result of a statistical sinthat ispervasive in the literature. Inparticular, the plots
whichmotivateequation(2)almostalwayshaveunaddressed,correlatederrors,leadingthemtolooksimpler
thanthey should. Whenthisiscorrected,thecomplexitiesof thewordfrequencydistributionbecome more
apparent. Thispoint isimportant because it means that (2) isat best a good approximation to what is
demonstrably a much more complicated distribution of word frequencies. This complication means that
2
As theywrite, \Finally, and perhaps most importantly, even if the statistics of a purported power law have been done
correctly,thereisatheorythatunderliesitsgenerativeprocess,andthereisampleanduncontroversialempiricalsupportforit,
acriticalquestionremains: Whatgenuinelynewinsightshavebeengainedbyhavingfoundarobust,mechanisticallysupported,
andin-all-other-ways superbpowerlaw? Webelievethatsuchinsightsareveryrare."
3Importantly,however,other statisticalpropertiesarealsolikelyinformative,asa\full"theoryofwordfrequencies would
beabletoexplainawiderangeofempiricalphenomena.
2
C# PDF Image Extract Library: Select, copy, paste PDF images in C#
Get image information, such as its location, zonal information, metadata, and so on. Able to edit, add, delete, move, and output PDF document image.
add jpg to pdf file; how to add a jpg to a pdf
VB.NET PDF Password Library: add, remove, edit PDF file password
VB: Add Password to PDF with Permission Settings Applied. This VB.NET example shows how to add PDF file password with access permission setting.
add image to pdf reader; add image in pdf using java
detailedstatisticalanalysisofwhatparticularformthewordfrequencydistributiontakes(e.g. (1)vs(2)vs
lognormaldistributions,etc.) willnot be fruitful: noneisstrictly\right."
Followingthoseresults,thispaperpresentsandreviewsanumberofother factsabout wordfrequencies.
Each fact about word frequencies is studied because of its relevance to a proposed psychological account
ofZipf’slaw. Most strikingly,Section3.7providesexperimentalevidencethat near-Zipanwordfrequency
distributions occur for novel words in a language production task. Section 4 then reviews a number of
formalmodelsseekingtoexplainZipf’slawinlanguage,andrelateseachproposedaccount totheempirical
phenomenadiscussedinSection3.
2 The word frequency distribution is complex
Quitereasonably,alargebodyofworkhassoughttoexaminewhatformmostpreciselytsthewordfrequency
distributionobservedinnaturallanguage. Zipf’soriginalsuggestionof (1)wasimprovedby Mandelbrot to
that in (2), but many other forms have been suggested including for instance, a log-normal distribution
(Carroll,1967, 1969),whichmightbeconsideredareasonably \null"(e.g. unremarkable)hypothesis.
AsuperbreferenceforcomparingdistributionsisBaayen(2001,Chapter3),whoreviewsevidenceforand
against a log-normal distribution (Carroll, 1967, 1969), a generalized inverse Gauss-Poissonmodel (Sichel,
1975), anda generalizedZ-distribution (Orlov & Chitashvili, 1983) for which many other models(due to,
e.g.,Yule,1924;Simon,1955,1960;Herdan,1960,1964;Rouault,1978;Mandelbrot,1962)areaspecialcase
(seealsoMontemurro,2001;Popescu,2009). Baayenndswithaquantitativemodelcomparisonthatwhich
modelisbestdependsonwhichcorpusisexamined. Forinstance,thelog-normalmodelisbestforthe text,
The Hound of the Baskervilles,but the Yule-Simonmodelisbest for Alice inWonderland. One plausible
explanation for thisis that none of these simple models|including the Zipf-Mandelbrot law in Equation
(2)|is\right,"
4
insteadonly capturingsome aspectsof the fulldistributionofwordfrequencies.
Indeed,noneisright. The apparent simplicity of the distributionisanartifact of howthedistribution
isplotted. Thestandardmethodforvisualizingthe wordfrequency distributionistocount howofteneach
wordoccursinacorpus,andsortthewordfrequencycountsby decreasingmagnitude. The frequency f(r)
ofther’thmostfrequentwordisthenplottedagainstthefrequencyrankr,yieldingtypicallyamostlylinear
curve onalog-log plot (Zipf, 1936), corresponding to roughly a power law distribution
5
. Thisapproach|
though essentially universal since Zipf|commits a serious error of data visualization. In estimating the
frequency-rank relationshipthisway, thefrequency f(r) and frequency rank r of a wordare estimated on
thesame corpus,leadingto correlatederrorsbetweenthe x-location r andy-locationf(r) of pointsinthe
plot.
Thisisproblematicbecauseitmaysuggestspuriousregularity
6
.Theproblemcanbebestunderstoodby
asimpleexample. Imaginethatallwordsinlanguagewereactuallyequallyprobable. Inanysample(corpus)
of words, we will nd that some wordsoccur morethanothersjust by chance. Whenplottedinthe stan-
dardmanner,we willndastrikingly decreasingplot,erroneouslysuggesting that the true frequency-rank
relationshiphassomeinterestingstructuretobeexplained. Thisspuriousstructureisespeciallyproblematic
forlowfrequencywords,whosefrequenciesaremeasuredleast precisely. Additionally,inthestandardplot,
deviationsfrom theZipan curveare dicult tointerpret due tothe correlationof measurement errors: it
ishardtotellsystematic deviationsfromnoise.
Fortunately, theproblemiseasily xed: wemayusetwoindependentcorporatoestimatethe frequency
andfrequencyrank. Intheabovecasewhereallwordsareequallyprobable,useofindependentcorporawill
leadtonoapparentstructure|justaroughly  atfrequency-rankrelationship. Ingeneral,weneednothave
twoindependentcorporafromthestart: wecanimaginesplittingourinitialcorpusintotwosubcorporabefore
anytextprocessingtakesplace. Thiscreatestwocorporawhichareindependentbodiesoftext(conditioned
onthe general properties of the starting corpus), and sofrom which we can independently estimate r and
f(r). Aconvenient technique toperform thissplit isto perform abinomialsplit onobservedfrequency of
4
SeeFerrer-i-CanchoandServedio(2005)for relatedargumentsbasedontherangeofZipan exponents.
5
Sincelinearityonalog-logplotmeans thatlogf=alogr+b,sof =e
b
r
a
/r
a
.
6
SuchestimationalsoviolatestheassumptionsoftypicalalgorithmsusedtotZipanexponentssincemostttingalgorithms
assumethat xisknownperfectlyandonlyyismeasured witherror. This s concernappliesinprincipletomaximum-likelihood
estimation,leastsquares(onlog-logvalues),andanyothertechniquethatplacesallofmeasurementerroronfrequencies,rather
thanbothfrequenciesandfrequencyranks.
3
VB.NET PDF Image Extract Library: Select, copy, paste PDF images
DLLs for PDF Image Extraction in VB.NET. In order to run the sample code, the following steps would be necessary. Add necessary references:
acrobat insert image into pdf; add image to pdf
C# PDF Password Library: add, remove, edit PDF file password in C#
C# Sample Code: Add Password to PDF with Permission Settings Applied in C#.NET. This example shows how to add PDF file password with access permission setting.
add jpg signature to pdf; how to add image to pdf acrobat
0
2
4
6
8
10
12
Log
e
frequency rank
16
14
12
10
8
6
4
2
Loge normalized frequency
α=1.13
β=2.73
R
2
=0.91 ***
R2
adj
=0.97
(a)
0
2
4
6
8
10
12
Log
e
frequency rank
1.0
0.5
0.0
0.5
1.0
Error (log space)
(b)
Figure1: 1(a) showsthe relationship betweenfrequency rank(x-axis)and (normalized) frequency(y-axis)
forwordsfromtheAmericanNationalCorpus. Thisisplottedusingatwo-dimensionalhexagonalhistogram.
Bins are shaded blue to green along a logarithmic scale depending on how many words fall into the bin.
Theredlineshowsthet of (2) tothisdata. 1(b) showsfrequencyrankversusthedierence(inlogspace)
between aword’sfrequency and the prediction of (2). Thisgure shows only a subset of the full y range,
croppingsomeextremeoutlierson the right hand sideof the plot inorder tobetter visualizethiserrorfor
thehighfrequencywords.
each word: if we observe a word, say,100 times, we may sample from a binomial (N =100;p=0:5) and
arrive at a frequency of, say, 62 used to estimate its true frequency, and a frequency of N  62 = 38 to
estimateitstruefrequencyrank. Thisexactly mirrorsifwehadrandomlyputtokensof eachwordintotwo
independent corpora, before any text processing began. The choice of p =0:5 is not necessary, but yields
twocorporaof approximately the samesize. With thismethod,the deviationsfrom at areinterpretable
andourplotting methodnolongerintroducesanyerroneousstructure.
Figure 1(a) shows such a plot, giving the frequency/frequency-rank relationship from the American
NationalCorpus(Reppen&Ide,2004),afreelyavailablecollectionofwrittenAmericanEnglish. Allgures
in this paper follow this plotting procedure, unless otherwise noted. The plot shows a two-dimensional
histogram of where words fall infrequency/frequency-rank space
7
. The shading of the histogram is done
logarithmically with the the number of wordsfalling into each hexagonal bin, and iswhite for zero-count
bins. Becausetheplothasalogarithmicy-axis,wordswithzerofrequencyafterthesplitarenotshown. The
t of (2)usingamaximum-likelihoodmethodontheseparatefrequencyandfrequency rankportionsofthe
corpusisshownintheredsolidline. Additionally, alocally-smoothedregressionline(LOESS) (Cleveland,
Grosse,&Shyu,1992)isshown ingray. Thislinecorrespondstoalocal estimateofthe meanvalue of the
data,andispresentedasacomparisonpointtoseehowwellthetof(2)matchestheexpectedvalueofthe
pointsfor eachfrequency rank (x-value). Inthecorner severalkeyvaluesarereported: thet and,an
R
2
measuregivingtheamountofvarianceexplainedby the redlinet,andanadjustedR
2
adj
capturingthe
proportionofexplainable variancecapturedby thet,takingthesmoothedregressionasanestimateofthe
maximum amount of variance explainable. For simplicity,statisticsare computedonly onthe original R
2
,
anditssignicanceisshownwithstandardstarnotation(threestartsmeansp<0:001).
Thisplotmakesexplicitseveralimportantpropertiesofthedistribution. First,itis approximatelylinear
onalog-logplot,meaningthewordfrequency distributionisapproximatelypower law,andmoreoverist
very well by (2) accordingto the correlationmeasures. Thisplot showshigher variability towards the low
frequency end, (accurately) indicating that we cannot estimatethe curvereliablyfor lowfrequency words.
While thescatter of pointsisnolonger monotonic, notethatthe true plot relating frequency to frequency
rank must be monotonic by denition. Thus,one might imagineestimatingthe true curveby drawingany
monotoniccurvethroughthisdata. Atthelowfrequencyendwehavemorenoiseandsogreateruncertainty
7
Intheseplots,tiedranksarenotallowed,sowords ofthesamefrequencyarearbitrarilyordered.
4
C# Create PDF from images Library to convert Jpeg, png images to
List<Bitmap> images = new List<Bitmap>(); images.Add(new Bitmap(Program.RootPath + "\\" 1.gif")); / Build a PDF document with GIF image.
add an image to a pdf acrobat; adding image to pdf in preview
C# PDF Sticky Note Library: add, delete, update PDF note in C#.net
C#.NET PDF SDK - Add Sticky Note to PDF Page in C#.NET. Able to add notes to PDF using C# source code in Visual Studio .NET framework.
add jpeg to pdf; add an image to a pdf form
about theshapeofthatcurve. Thisplot alsoshowsthatequation(2) providesafairlyaccuratet(red) to
theoverallstructureofthefrequency-rank relationshipacrossbothcorpora.
Importantly, becausewe have estimated r and f(r) in astatistically independent way, deviations from
thecurvecanbeinterpreted. Figure1(b)showsaplotofthesedeviations,correspondingtotheresidualsof
frequency once (2) ist tothedata. Note that if the truegenerating processwere something like(2), the
residualsshouldbeonly noise,meaning that which areaboveand below the t line (y=0 inthe residual
plot)shouldbedeterminedentirelybychance. Thereshouldbenoobservablestructuretotheresidualplot.
Instead, what Figure1(b)revealsisthat thereisconsiderablestructuretothe wordfrequencydistribution
beyondthet oftheZipf-Mandelbrotequation,includingnumerousminimaandmaximaintheerrorofthis
t. Thisismost apparentbythe\scoop"ontherighthandsizeoftheplot,correspondingtomis-estimation
higher ranked (lower-frequency) words. This type of deviation has been observed previously with other
plottingmethodsandmodeledasadistinct powerlawexponent byFerreriCanchoandSole(2001),among
others.
However, what ismore striking, isthe systematic deviationobservedinthe left halfof thisplot, corre-
spondingtolowrank(highfrequency)words. EventhemostfrequentwordsdonotexactlyfollowZipf’slaw.
Instead,thereisasubstantialautocorrelation,correspondingtothemanylocalminimaandmaxima(\wig-
gles") inthe left half of thisplot. Thisindicatesthat there arefurther statistical regularities|apparently
quitecomplex|thatarenotcapturedby(2). Theseautocorrelationsintheerrorsarestatisticallysignicant
using the Ljung-Box Q-test (Ljung & Box, 1978) for residual autocorrelation (Q = 126810:1;p < 0:001),
even for the most highly-ranked twenty-ve (Q = 5:7;p = 0:02), fty (Q = 16:7;p < 0:001), or hundred
(Q=39:8;p<0:001)wordsexamined.
Suchcomplexstructureshouldhavebeenexpected: of course the numerousin uencesonlanguagepro-
duction result in a distribution that is complex andstructured. However, the complexity is not apparent
instandardwaysof plottingpower laws. Suchcomplexity isprobablyincompatiblewithattemptstochar-
acterize the distributionwitha simpleparametric law, sinceit isunlikely asimple equationcouldt all of
theminima andmaxima observedinthisplot. At the sametime, almost all of the variance infrequencies
ist very well by a simple lawlike Zipf’spower law or its close relatives. A simple relationship captures
aconsiderable amountabout wordfrequencies, but clearly willnot explaineverything. The distributionin
languageisonly near-Zipan.
3 Empirical phenomena in word frequencies
Having established that the distribution of word frequencies is more complex than previously supposed,
we now reviewseveral basic factsabout word frequencies whichany theory of the Zipanor near-Zipan
distribution must account for. The plan of this paper is to present these empirical phenomena in this
section,andthenusethemtoframespecicmodel-basedaccountsofZipf’slawinSection4. Aswewillsee,
thepropertiesof wordfrequenciesreviewedinthissectionwill have muchtosay about themost plausible
accountsofthewordfrequencydistributioningeneral.
Thegeneralmethodfollowedinthissectionistostudyrelevantsubsetsofthelexiconandquantifythet
of(2). Thisapproachcontrastssomewhat withthevast literatureonstatisticalmodelcomparisontocheck
for power laws(ascompared to,e.g., lognormaldistributions,etc.). The reasonfor thisissimple: Section
2providesstrongevidencethatnosimple lawcanbe the fullstory behindwordfrequenciesbecause of the
complexitiesof the frequency rank /frequency curve. Therefore, comparisonsbetweensimple models will
inevitably bebetweenalternativesthat are both\wrong".
Ingeneral, it isnot soimportant which simpledistributionalform isa better approximationto human
language. What matters more are the general properties of word frequencies that are informative about
theunderlying mechanisms behind the observeddistribution. Thissectiontriesto bring out those general
properties. Dothedistributionsappear near-Zipanfor systematicsubsetsofwords? Aredistributionsthat
look similartopowerlawscommonacrosswordtypes,or aretheyrestrictedtowordwithcertainsyntactic
or semantic features? Any psychologically-justiedtheory of the wordfrequency distribution will depend
on appreciating, connecting-to, and explaining these types of high-level features of the lexical frequency
distribution.
5
C# PDF remove image library: remove, delete images from PDF in C#.
C# Read: PDF Image Extract; C# Write: Insert text into PDF; C# Write: Add Image to PDF; Remove Image from PDF Page Using C#. Add necessary references:
add jpeg signature to pdf; add image to pdf form
VB.NET PDF remove image library: remove, delete images from PDF in
C# Read: PDF Image Extract; C# Write: Insert text into PDF; C# Write: Add Image to PDF; VB.NET: Remove Image from PDF Page. Add necessary references:
add photo to pdf; add photo to pdf preview
1
0
1
2
3
4
5
6
10
8
6
4
2
α=1.40
β=1.88
R=0.87 ***
R
2
adj
=0.99
English
1
0
1
2
3
4
5
6
10
8
6
4
2
α=1.84
β=3.81
R=0.88 ***
R2
adj
=1.00
Spanish
1
0
1
2
3
4
5
6
12
10
8
6
4
2
α=1.88
β=5.02
R
2
=0.88 ***
R2
adj
=1.00
Russian
1
0
1
2
3
4
5
6
12
10
8
6
4
2
0
α=1.17
β=−0.45
R
2
=0.83 ***
R2
adj
=0.97
Greek
1
0
1
2
3
4
5
6
12
10
8
6
4
2
α=3.68
β=26.26
R=0.87 ***
R
2
adj
=1.00
Portuguese
1
0
1
2
3
4
5
6
9
8
7
6
5
4
3
2
α=1.46
β=3.13
R=0.73 ***
R2
adj
=0.97
Chinese
1
0
1
2
3
4
5
6
12
10
8
6
4
2
α=0.79
β=−0.66
R=0.59 ***
R2
adj
=0.89
Swahili
1
0
1
2
3
4
5
6
12
10
8
6
4
2
α=2.42
β=6.56
R=0.87 ***
R2
adj
=1.00
Chilean
1
0
1
2
3
4
5
6
12
10
8
6
4
2
0
α=1.17
β=−0.60
R=0.83 ***
R2
adj
=0.98
Finnish
1
0
1
2
3
4
5
6
10
8
6
4
2
α=1.75
β=6.71
R
2
=0.84 ***
R2
adj
=1.00
Estonian
1
0
1
2
3
4
5
6
10
8
6
4
2
α=1.71
β=2.09
R=0.92 ***
R2
adj
=0.98
French
1
0
1
2
3
4
5
6
12
10
8
6
4
2
α=1.02
β=−0.42
R=0.80 ***
R2
adj
=0.96
Czech
1
0
1
2
3
4
5
6
12
10
8
6
4
2
α=1.01
β=0.27
R=0.65 ***
R2
adj
=0.99
Turkish
1
0
1
2
3
4
5
6
10
8
6
4
2
α=1.67
β=5.03
R
2
=0.85 ***
R2
adj
=0.99
Polish
1
0
1
2
3
4
5
6
12
10
8
6
4
2
α=1.03
β=−0.13
R=0.77 ***
R
2
adj
=0.97
Basque
1
0
1
2
3
4
5
6
12
10
8
6
4
2
α=0.51
β=−0.94
R=0.63 ***
R
2
adj
=0.94
Maori
1
0
1
2
3
4
5
6
12
10
8
6
4
2
α=0.93
β=0.22
R=0.58 ***
R
2
adj
=0.94
Tok Pisin
1
0
1
2
3
4
5
6
10
8
6
4
2
α=1.10
β=0.40
R=0.78 ***
R2
adj
=0.93
German
Aggregate log
e
frequency rank
Loge
Figure2: Cross-linguistic wordfrequency distributionsusingwords from a Swadeshlist (dataprovidedby
Calude andPagel,2011). Here, the x-locationof each point (word) is xedacrosslanguagesaccording to
theaggregate frequency rank of the word’smeaningonanindependent set ofdata. Thesystematicityhere
meansthat the wordfrequency distributionfallsosimilarly accordingto wordmeaningacrosslanguages,
andapproximatelyaccordingtoapower lawlike(2) (red).
3.1 Semantics strongly in uences word frequency
Asalanguage user,it certainly seems like weuse wordstoconvey anintendedmeaning. From thissimple
pointofview,Zipf’slawisreallyafactaboutthe\need"distributionforhowoftenweneedtocommunicate
eachmeaning. Surprisingly,manyaccountsofthelawmake noreferencetomeaningandsemantics(except
see 4.3 and some work in 4.4), deriving it from principles independent of the content of language. But
this view is incompatible withthe fact that even cross-linguistically, meaning is systematically related to
frequency. Calude and Pagel (2011) examined Swadesh lists from 17 languagesrepresenting six language
familiesandcomparedfrequenciesofwordsonthelist. Swadeshlistsprovidestranslationsofsimple,frequent
wordslike\mother"acrossmanylanguages;theyareoftenusedtodohistoricalreconstruction. Caludeand
Pagel(2011)reportanaverageinter-languagecorrelationinlogfrequencyofR
2
=0:53(p<0:0001)forthese
6
1
0
1
2
3
4
5
Log
e
cardinality
10
8
6
4
2
0
Loge normalized frequency
α=3.05
β=2.25
R=0.97 ***
R2
adj
=0.99
(a)
1
0
1
2
3
4
5
Log
e
cardinality
12
10
8
6
4
2
0
Loge normalized frequency
α=2.52
β=0.22
R
2
=0.97 ***
R
2
adj
=1.00
(b)
1
0
1
2
3
4
5
Log
e
cardinality
12
10
8
6
4
2
0
Loge normalized frequency
α=4.37
β=6.36
R=0.94 ***
R2
adj
=0.99
(c)
Figure3: Power lawfrequenciesfornumber words(\one",\two",\three",etc.) inEnglish(a),Russian(b)
andItalian(c)usingdatafromGoogle(Linetal.,2012). Notethatherethex-axisisorderedbycardinality,
not frequency rank, although these two coincide. Additionally, decades (\ten", \twenty", \thirty", etc.)
were removedfrom this analysisdue tounusually highfrequency from their approximateusage. Here and
inallplotstheredlineisthetof(2) andthe graylineisa LOESS.
common words, indicating that wordfrequenciesare surprisingly robust across languagesandpredictable
fromtheirmeanings. Importantly,notethatSwadeshwordswilltendtobehigh-frequency,sotheestimated
R
2
is almost certain to be lower for less frequent words. In any case, if meaning has any in uence on
frequency,asatisfyingaccount ofthefrequencydistributionwillhavetoaddressit.
We canalsoseesystematic frequency-rankrelationshipacrosslanguages,groupingwordsby theirmean-
ing. Figure2showsfrequency-rankplotsoftheSwadeshlistscompiledinCaludeandPagel(2011)8,plotted,
likeallother plotsinthepaper,accordingto the methodsinSection2. However, unlike other plotsinthis
paper,the frequency rank here isxedacrossalllanguages, estimatedindependently on25% of datafrom
eachlanguageandthencollapsedacrosslanguages. Thus,therankordering|correspondingtothex-location
ofeachmeaningontheSwadeshlist|doesnotvarybylanguageandisdeterminedonlybyaggregate,cross-
linguisticfrequency(independentlyestimatedfromthe y-location). Wecanthencomparethefrequenciesat
eachrank to see if they follow similar distributions. As these plotsreveal, the distributionsare extremely
similar acrosslanguages,andfollowanear-Zipandistributionforthepooledrank ordering.
Inthisplot, because therankorderingisxedacrossall languages, not only dofrequenciesfallo like
(Equation2), but they dosoasroughly withthe same coecientsacrosscross-linguistically. If frequency
wasnotsystematically relatedtomeaningtheseplotswouldrevealnosuchtrends.
Another domain where the meaning-dependence of frequency is apparent is that of number words
(Dehaene & Mehler, 1992; Piantadosi, in press). Figure 3 shows number word frequencies (e.g. \one",
\two", \three", etc.), previously reported in Piantadosi (in press). These plots show cardinality vs. fre-
quency in English, Russian, and Italian, using all the data from the Google Books N-gram dataset (Lin
et al., 2012). This clearly shows that across languages, number words follow a near-Zipan distribution
accordingto the magnitude (meaning)|infact,a very particular one withexponent  2(the \inverse
squarelaw"fornumberfrequency),andingpreviouslyreportedbyDehaeneandMehler(1992). Piantadosi
(in press) shows that these trends also hold for the decade words, and across historical time. Thus, the
frequencyofthesewordsispredictablefrom what cardinalitythewordsrefer to,evenacrosslanguages.
The general point from this section is therefore that word meaning is a substantial determinant of
frequency,andit isperhapsintuitivelythebestcausalforceinshapingfrequency. \Happy"ismorefrequent
than \disillusioned" because the meaning of the former occurs more commonly in topics people like to
discuss. A psychologically-justied explanation of Zipf’s law in language must be compatible with the
powerfulin uencethat meaninghasonfrequency.
3.2 Near-Zipan distributions occur for xed referential content
Giventhatmeaningsinpartdeterminefrequencies,itisimportanttoaskifthereareanyphenomenawhich
cannot be straightforwardly explainedintermsofmeaning. Oneplacetolookiswordswhichhave roughly
8
Weareextremelygratefultotheauthorsfor providingthisdata.
7
0.0
0.5
1.0
1.5
2.0
2.5
Log
e
frequency rank
4.0
3.5
3.0
2.5
2.0
1.5
1.0
0.5
Loge normalized frequency
α=1.84
β=0.65
R
2
=0.95 ***
R
2
adj
=0.99
(a)
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Log
e
frequency rank
5
4
3
2
1
Loge normalized frequency
α=3.75
β=8.48
R
2
=0.92 ***
R
2
adj
=1.00
(b)
Figure4: Distributionsfor taboowordsfor (a)sex (gerunds) and(b) feces.
0.0
0.5
1.0
1.5
2.0
2.5
Log
e
frequency rank
3.0
2.8
2.6
2.4
2.2
2.0
1.8
1.6
Loge normalized frequency
α=1.54
β=9.43
R
2
=0.97 ***
R
2
adj
=1.00
(a)
0.0
0.5
1.0
1.5
2.0
Log
e
frequency rank
5
4
3
2
1
Loge normalized frequency
α=11.65
β=13.49
R=0.95 ***
R2
adj
=0.98
(b)
0
1
2
3
4
5
Log
e
frequency rank
8
7
6
5
4
3
2
Loge normalized frequency
α=3.53
β=17.17
R=0.91 ***
R2
adj
=0.94
(c)
Figure5: FrequencydistributionintheANCfor wordswhosescopeofmeaninghasbeenhighlyconstrained
bythenaturalworld(a) months,(b)planets,(c)elements.
thesamemeaning,at least intermsofreferentialcontent. Factslike the principle of contrast (Clark,1987)
may mean that true synonyms do not exist in humanlanguage. However, taboo words provide a classof
wordswhichreferat least approximately tothesamething(e.g. \fornicating,"\shagging,"\fucking,"etc.).
Figure4showsthefrequencydistributionofseveraltaboowords,gerundsreferringtosex4(a)andsynonyms
forfeces4(b)
9
,plottedusingthe methodsofSection2ondata from the ANC.Bothcasesrevealthat near-
Zipan word frequency distributionscan still be observedfor words that have a xed referential content,
meaningthatotherfactors(e.g. formality,socialconstraints)alsoplayaroleindeterminingwordfrequency.
3.3 Near-Zipan distributions occur for naturally constrained meanings
Ifmeaningsinpart determinewordfrequencies,itisplausiblethat thedistributionarisesfromhowhumans
languagessegmenttheobservableworldintolabeledcategories(seeSection4.3). Forinstance,languagesare
insomesensefreeto choose the rangeof referentsfor eachword10: should\dog"refer toa specickindof
dog,or abroadclass,or toanimalsingeneral? Perhapslanguageevolution’sprocessfor choosingthescope
ofwordmeaningsgivesrisetothefrequencydistribution(foradetailedaccount,see D.Manin,2008).
However,thedistributionfollowsanear-Zipandistributionevenindomainswhere theobjectsof refer-
ence are highlyconstrainedby the naturalworld. Figure5(a)-5(c)showsseveralof these domains
11
chosen
9More common taboo words meaning \penis"and \vagina"werenot t used since e many of their euphemisms have salient
alternativemeanings(e.g. \cock"and\box").
10
Althoughtherearecompellingregularitiesinatleastsomesemanticdomains|see,e.g.,KempandRegier(2012);Kayand
Regier(2003).
11
In theelements, \lead"and\iron"wereexcludedduetotheir ambiguity, andthus frequent use as non-elements. In n the
months,\May"and\March"wereremovedfortheiralternativemeanings.
8
1
0
1
2
3
4
Log
e
frequency rank
14
12
10
8
6
4
2
Loge
α=112.93
β=687.28
R
2
=0.93 ***
R
2
adj
=1.00
nn
in
dt
jj
nnp
nns rbprp
Figure 6: Frequencydistributionof syntacticcategoriesfrom the PennTreebank.
apriorifortheirsemanticxedness: months,planets,andelementnames. Intuitively,ineachofthesecases,
itislikelythatthelexicondidnot havemuchfreedominhowitlabeledthetermsinthesecategories,since
thereferentsof thesetermsaresalient,xed naturalkinds. For instance, our divisionof the worldinto12
monthscomesfrom phases of themoonand the seasons, not from atotally free choice that language may
easilyadapt or optimize. Theseplotsallshowclosetsby(2),showninred,andhigh,reliablecorrelations.
3.4 The t of Zipan distributions vary by category
Zipf’slaw isstated asa fact about the distributionof words, but it isimportant to remember that there
maynotbeanythingparticularlyspecialabout analyzinglanguageatthelevelofwords. Indeed,wordsmay
notevenbeapreciselydenedpsychologicalclass,withmanyidiomaticphrasesstoredtogetherbylanguage
processingmechanisms,andsomewordformspotentiallycreatedonthe ybygrammaticalormorphological
mechanisms. It isthereforeimportant toexaminethefrequencydistributionfor otherlevelsof analysis.
Figure6showsthefrequencydistributionofvarioussyntacticcategories(partofspeechtagsonindividual
words)fromthePennTreebank(Marcus,Marcinkiewicz,&Santorini,1993),usingthetaggedBrowncorpus.
Thisrevealsthatwordcategoriesarealsotnicelyby (2)|perhapsevenmorecloselythanwords|butthe
shapeofthet (parametersand) diers. The qualityoft appearstoofor the lowestfrequencytags,
althoughit is not clear howmuchof this eect isdue to datasparsity. The general pattern suggeststhat
afull explanation of the wordfrequency distribution would ideally call on mechanismsgeneral enough to
applytosyntactic categoriesandpossiblyevenother levelsofanalysis
12
.
Thesamecorpuscanalsobeusedtoexaminethe tandparameterswithin syntacticcategories. Figure
7(a)-7(c)showsthedistributionofwordswithineachofsixcategoriesfromthetreebank: determiners(DT),
prepositions/subordinatingconjunctions(IN),modals(MD),singularormassnouns(NN),past participle
verbs(VBN),and3rdpersonsingular presenttenseverbs(VBZ).Noneofthesewerepredictedtopatternin
anyspecicwaybyanyparticular theory,butwerechosenpost-hocasinterestingexamplesof distributions.
Determiners,modals,andsomeverbsappear tohavethelowest adjustedcorrelationswhen(2)ist. These
gures illustrate that the word typesvary substantially in the best-tting parameters  and , but show
12
It is apparently unclearwhetherN-grams intextfollowZipf’s law(seeEgghe(1999,2000); cf. Ha,Sicilia-Garcia,Ming,
andSmith(2002);Ha,Hanna,Ming,andSmith(2009)).
9
1
0
1
2
3
4
Log
e
Rank Frequency
12
10
8
6
4
2
0
Loge Normalized Frequency
α=2.15
β=0.21
R=0.91 ***
R2
adj
=0.93
DT
(a)
1
0
1
2
3
4
5
6
Log
e
Rank Frequency
12
10
8
6
4
2
Loge Normalized Frequency
α=2.37
β=4.25
R=0.95 ***
R2
adj
=0.97
IN
(b)
1
0
1
2
3
4
Log
e
Rank Frequency
10
8
6
4
2
Loge Normalized Frequency
α=119.60
β=458.71
R=0.86 ***
R2
adj
=0.88
MD
(c)
0
2
4
6
8
10
Log
e
Rank Frequency
12
10
8
6
4
Loge Normalized Frequency
α=1.15
β=77.06
R=0.86 ***
R2
adj
=0.96
NN
(d)
0
2
4
6
8
Log
e
Rank Frequency
10
8
6
4
2
Loge Normalized Frequency
α=0.82
β=−0.27
R=0.84 ***
R2
adj
=0.96
VBN
(e)
0
2
4
6
8
Log
e
Rank Frequency
10
8
6
4
2
0
Loge Normalized Frequency
α=1.04
β=−0.81
R=0.81 ***
R2
adj
=0.93
VBZ
(f)
Figure7: Frequency distribution of wordswithinseveral syntacticcategoriesfromthe PennTreebank: de-
terminers(DT),prepositionsorsubordinatingconjunctions(IN),modals(MD),nouns(NN),pastparticiple
verbs(VBN),3rdpersonsingular present verbs(VBZ). Theseplotsrepresent apost-hoc-selectedsubset of
allsyntactic categories.
ingeneralfairly Zipandistributions. Additionally, theresidualstructure (deviationfrom the redline t)
showsinterestingvariabilitybetweencategories. For instance,theverbs(7(f))showaninterestingconcavity
that istheoppositeofthat observedintypicalZipandistributions,bowingtothebottom rather thanthe
top. This concavity is primarily driven by the much larger frequency of the rst several words like \is",
\has", and \does." These auxiliary verbs may in truth belong in a separate category than other verbs,
perhaps changing the shape of this curve. There also appears to be a cluster of low-frequency modals of
about allthe same frequency. Thedeterminer plot suggeststhat therate atwhichfrequencydecreaseswith
rank changesthroughtwo scalingregimes|a slowfall-ofollowedbya fastone|whichisoftenarguedfor
thelexiconingeneral(FerreriCancho&Sole,2001)andwouldbeinconsistent withthesimpletof(2).
Overall,thevariabilityacrosspartofspeechcategoriessuggeststhatsomeofthetofZipandistribution
arisesby collapsingtogetherdierent partsof speech.
3.5 The distribution of word frequencies is not stationary
An often over-looked factor in the search for explanations of Zipf’s law is that word frequencies are not
stationary, meaning that the probability of uttering eachword changesdepending onother factors. This
phenomenonoccurs at, for instance, a longtimescalere ecting the topic of discussion. One ismore likely
to utter \Dallas" in a discussion about Lyndon Johnson than a discussion about Carl Sagan. The non-
stationarityoftextisaddressedbyBaayen(2001,Chapter5),whonotesthattheclumpyrandomnessofreal
text leadsto diculties estimating vocabulary sizesand distributions. Recently, Altmann, Pierrehumbert,
andMotter(2009)showedthatwordrecurrencesonatimescalecompatiblewithsemantics(notsyntax)follow
astretchedexponentialdistribution,withacertaindegree of \burstiness." Thevariability infrequenciesis
animportantmethodofclassicationofdocumentsviatopicmodels(seeBlei,Ng,&Jordan,2003;Steyvers
&Griths,2007;Blei&Laerty,2007,2009)orlatentsemantic analysis(Landauer,Foltz,&Laham,1998;
Dumais,2005). Suchmodelsworkbyessentiallynotingthatwordfrequencieswithinadocumentarecuesto
itssemantic topic;onecanthenwork backwardsfromthefrequenciestothe topicor setof possible topics.
Thevariabilityinwordfrequenciesisalsousefulininformationretrieval(Manning&Schutze,1999,Chapter
10
Documents you may be interested
Documents you may be interested