ontext
and information rate in natural language
∗
D. Yu. Manin
Deember 26, 2006
Abstrat
Basedondatafromalarge-saleexperimentwithhumansubjets,
we onlude thatthe logarithm of probabilityto guess aword in
on-text (unpreditability) depends linearlyon theword length. This
re-sult holds both for poetry and prose, even though with prose, the
subjets don'tknowthe length of theomitted word. We hypothesize
thatthiseetreetsatendenyofnaturallanguageto have aneven
information rate.
1 Introdution
Inthispaperwereportapartiularresultofanexperimentalstudyon
preditability ofwordsinontext. The experiment's primary
motiva-tion is thestudy of some aspets of poetry pereption, but theresult
reported hereis, inthe author'sview,of ageneral linguisti interest.
Therst studyofnaturaltext preditabilitywasperformedbythe
founder of information theory, C. E. Shannon [1 ℄. (We'll note that
even in his groundbreaking work [2℄, Shannon briey touhed on the
relationship between literaryqualities andredundany byontrasting
highlyredundantBasiEnglishwithJoye'sFinnegan'sWake whih
∗
ThisisasomewhatexpandedversionofthepaperManin,D.Yu. 2006. Experimentson
preditability ofwordinontextandinformationratein natural language. J.Information
semanti ontent.) Shannon presented his subjet withrandom
pas-sagesfromJeerson'sbiographyandhadherguessthenextletteruntil
the orretguess wasreorded. The number of guessesfor eahletter
wasthenused toalulate upperandlower boundsfor theentropyof
English, whih turned out to be between 0.6 and 1.3bits per
hara-ter (bp), muh lowerthan thatof arandom mixof thesame letters.
Shannon's resultsalso indiatedthatonditional entropydereases as
more and more text history beomes known to the subjet, up to at
least100 letters.
Several authors repeated Shannon's experiments with some
mod-iations. Burton and Liklider [3 ℄ used 10 dierent texts of similar
style, and fragment lengths of1, 2,4, ..., 128,1000 haraters. Their
onlusion was that, ontrary to Shannon, inreasing history doesn't
aet measured entropy whenhistory length exeeds32 haraters.
Fonagy[4 ℄omparedpreditabilityofthenextletterforthreetypes
of text: poetry, newspaper, and a onversation of two young girls.
Apparently, his tehnique involved only one guess per letter, so
en-tropy estimates ould not be alulated (see below), and results are
presented interms of the rate of orret answers, poetry being muh
lesspreditable than both othertypes.
Kolmogorovreportedtheresultsof0.91.4bpforRussiantextsin
hiswork[5℄thatlaidthegroundofalgorithmiomplexitytheory. The
paperontainsnodetailsontheveryingeniousexperimentaltehniue,
butitisdesribedinthewell-knownmonographbyYaglom &Yaglom
[6℄.
Cover and King [7℄ modied Shannon's tehnique by having their
subjets plae bets on the next letter. They showed that the
opti-mal betting poliywould be to distributeavailable apitalamong the
possibleoutomesaordingtotheirprobabilityandsoifthesubjets
play in an optimal way (whih is not self-evident though), the letter
probabilities ould be inferredfrom their bets. Their estimateof the
entropyof Englishwasalulated at 1.3bp. This workalso ontains
an extensive bibliography.
Moradietal[8℄rstusedtwodierenttexts(atextbookondigital
signal proessing and a novel by Judith Krantz) to onrm Burton
and Liklider's results on the ritial history length (32 haraters),
then added two more texts (101 Dalmatians and a federal aviation
manual) tostudy the dependeneofentropyon texttype andsubjet
language by means of statistial analysis, without using human
sub-jets. One of the rst attempts is reported in [9℄, where 39 English
translations of 9 lassial Greek textswere used to study entropy
de-pendeny on subjetmatter, style, andperiod. A very rudeentropy
estimate by letter digram frequeny wasused. For some of the more
reentdevelopments,see[10 ℄,[11℄ andreferenes therein. Bythevery
nature ofthesemethods theyan't utilize meaning(and evensyntax)
ofthetext,butbythebruteforeofontemporaryomputersthey
be-ginahievingresultsthatomereasonablylosetothosedemonstrated
byhuman languagespeakers.
Ourexperimental setup diers from theprevious work in two
im-portant aspets. First, we have subjets guess whole words, and not
individual haraters. Seond, the words to be guessed ome
(gener-allyspeaking)from themiddle ofaontext, ratherthan at theend of
a fragment. In additiontollingblanks, wepresent thesubjets with
two other tasktypes where authentiity of a presented word is to be
assessed. Thereasonforthisisthatwhilemostofthepreviousstudies
wereeventuallyaimedateienttextompression,weareinterestedin
literary (hiey, poeti) texts asworks of literature, and not as mere
harater strings subjet to appliation of ompression algorithms 1
.
Ourgoalindesigningtheexperimentwastoprovideresearhersinthe
eldof poetiswithhard data to ground somehypothesesthat
other-wiseareunavoidably speulative. Guessingthenextword insequene
is not the best way to treat literary text, beause even an ordinary
sentene like this one is not essentially a linear sequene of words or
haraters, but a omplex struture with word assoiations running
all over the plae, both forward and bakward. A poem, even more
so, is a struture with strongly oordinated parts, whih is not read
sequentially,muh lesswritten sequentially. Also, pratieshowsthat
even when guessing letter by letter, people almost always base their
next harater hoie ona tentative word guess. Thisiswhyguessing
wholewords inontext wasmore appropriatefor our purpose.
However,theresultswepresenthere,asalreadymentioned,arenot
relevant to poetis proper, so we will not dwell on this further, and
refer theinterestedreader to [12 ℄.
1
Itshouldbenotedthoughthateientompressionisimportantnotonlyperse,but
also for ryptographiappliations aspointed outin [11℄. In addition, languagemodels
developedforthepurposeofompressionaresuessfullyusedin appliationslikespeeh
In their Introdution to the speial issue on omputational
linguis-tis usinglargeorpora, ChurhandMerer [13℄notethatThe1990s
have witnessed a resurgene of interest in 1950s-style empirial and
statistial methods of language analysis. They attribute this
empir-ial renaissane primarily to the availability of proessing power and
of massive quantities of data. Of ourse, these fators favor
statisti-al analysis of texts as harater strings. However, wide availability
of omputer networks and interative Web tehnologies also made it
possibleto set uplarge-sale experiments withhuman subjets.
TheexperimenthastheformofanonlineliterarygameinRussian 2
.
However, the players are also fully aware of the researh side, have
freeaesstotheoretialbakgroundandurrentexperimentalresults,
and an partiipate in online disussions. The players are presented
withtext fragmentsinwhihone ofthe words isreplaedwithblanks
or with a dierent word. Any sequene of 5 or more Cyrilli letters
surroundedbynon-letterswasonsideredaword. Wordsareseleted
from fragments randomly. There are threedierent trial types:
type 1: a word is omitted, andis to be guessed.
type 2: a word is highlighted, and thetask isto determine whether
itis originalor replaed.
type 3: two words are displayed, and the subjet has to determine
whihone is the originalword.
Inorretguessesfrom trials oftype 1 areusedasreplaements in
trials oftypes2 and 3.
Texts are randomly drawn from a orpus of 3439 fragments of
mostlypoetiworksinawiderangeofstylesandperiods: fromA
vant-garde tomassultureand from18th entury to ontemporary. Three
prosaitextsarealsoinluded(twolassinovels,andaontemporary
politialessay).
Asofthiswriting,theexperimenthasbeenrunningalmost
ontin-uouslyforthreeyears. Over8000peopletookpartinitandolletively
madealmost900,000guesses, aboutathirdofwhihisoftype1. The
traditionallaboratoryexperimentouldhaveneverahievedthissale.
Ofourse,thetehniquehasitsowndrawbaks,whiharedisussedin
2
espeially ifit an'tbeahieved inanyotherway.
3 Results
Thespei goaloftheexperimentisto disoverandanalyze
system-atidierenesbetweendierentategoriesoftextsfromtheviewpoint
ofhoweasyitistoa)reonstrut anomittedword,andb)distinguish
the original word from a replaement. However here we'll onsider a
partiular property of the texts that turns out to be independent of
the text type and soprobablyharaterizes thelanguage itself rather
than spei texts. This property is the dependeny of word
unpre-ditabilityonits length.
We dene unpreditability
U
as the negative binary logarithm of theprobabilitytoguessaword,U
=
−
log
2
p
1
,wherep
1
istheaveragerate of orret answers to trials of type 1. For a single word, this is
formally equivalent to Shannon's denition of entropy,
H
. However,when multiple words aretaken into aount,entropyshould be
alu-latedasthe averagelogarithmofprobability,andnotasthelogarithm
of average probability,
H
=
−
1
N
N
X
i
=1
log
2
p
i
1
(1)U
=
−
log
2
1
N
N
X
i
=1
p
i
1
(2)Indeed, the logarithm of probability to guess a word equals the
amount of information inbits required to determine theword hoie.
Thus, it is this quantity that is subjet to averaging. When dealing
withexperimentaldata,itisustomarytousefrequeniesasestimates
of unobservable probabilities. However, there are always words that
were never guessed orretly and have
p
1
= 0
for whih logarithm isundened (thisiswhyShannon'stehinqueinvolvesrepeatedguessing
of the same letter until the orret answer is obtained). Formally, if
there is one element in the sequene with zero (very small, in fat)
probability of being guessed, then the amount of information of the
wholesequene maybe determined solely by thisone element.
On the other hand, unpreditability as dened above is not
triesrequiredtoguessarandomlyseletedword,unpreditability
har-aterizes theportion of words that would be guessed on the rst try.
Theyareequal, ofourse, ifallwords havethesame entropy.
One way around the problem presented by never-guessed words
would be to assign some arbitrary nite entropy to them. We
om-paredunpreditability withentropyalulated underthis
approxima-tionwithtwovaluesoftheonstant: 10bits(orrespondingroughlyto
wildguessingusingafrequenyditionary)and3bits(thelowbound).
Inbothases,while
H
isnotequalnumeriallytoU
,theyturnedouttobeinanalmostmonotoni,approximatelylinearorrespondene. This
probablymeansthatthefrationofhard-to-guesswordso-varieswith
unpreditability of the rest of the words. Beause of this, we prefer
toworkintermsofunpreditability,ratherthanintroduingarbitrary
hypothesesto alulate anentropy value ofdubious validity.
Unpreditability as a funtion of word length alulated over all
words of the same length aross all texts is plotted in Fig. 1 and
Fig. 2 (where word length is measured in haraters and syllables
respetively). Condene intervalson the graphsarealulatedbased
on thestandard deviation ofthebinomialdistribution (sinethedata
omes froma series of independent trials withtwo possible outomes
ineah: a guessmay be orretor inorret).
In the range from 5 to 14 haraters and from 1 to 5 syllables,
an exellent linear dependene is observed. Longer words arerare, so
the datafor them is signiantly lessstatistially reliable. We'll only
disuss the lineardependeneinthe rangewhere itis denitely valid.
4 Disussion
It is very diult, for the reasons mentioned above, to ompare our
results withprevious studies. However, there aretwo points of
om-parison thatanbemade. First,wean roughlyestimatetheeetof
word guessingin ontext asopposedto guessing the nextword in
se-quene. ReallthatShannon [1℄ estimatedzeroth-order word entropy
forEnglishbasedonZipf'slawtobe11.82bitsperword(bpw). Brown
et al [10 ℄ used a word trigram model to ahieve an entropy estimate
of 1.72 bp, whih translates to 7.74 bpw for average word length of
4.5haraters inEnglish. Thismeansthat trigramword probabilities
word length in haraters
unpreditabilit
y
20 18 16 14 12 10 8
6 4 2 0 6
5
4
3
2
1
0
Figure 1: Unpreditability as a funtion of word length in haraters, all
texts
word length in yllables
unpreditabilit
y
7 6
5 4
3 2
1 0
3.5
3
2.5
2
1.5
1
0.5
0
the middle and the rst word of a trigram. Only therst trigram is
availablewhenthemodelispreditingthenextword,butallthree
tri-grams ould be used to llin anomitted word (this isa hypothetial
experiment whih was not atually performed). Of ourse, they are
not statistially independent, andasa rough estimatewean assume
that the last trigram ontributes somewhat lessinformation than the
rst one, whilethe middle trigram ontributes very little (sineall of
its words arealreadyaountedfor). Inother words, we ouldexpet
thismodeltohaveabout4bpwmoreinformationwhenguessingwords
inontext, whih isverysigniant.
Theseondpointofomparisonisprovided by[14℄(Fig. 13there),
where entropy is plotted for the
n
-th letter of eah word, versus its positionn
. Entropy was estimated using a ZivLempel type algo-rithm. It is well-known that guessing is least ondent at the wordboundaries for both human subjets and omputer algorithms, and
this hart quanties the observation: the rst letter has the entropy
of 4 bp, whih drops quikly to about 0.60.7 bp for the 5th
let-ter andthenstayssurprizingly onstant all thewaythroughthe 16th
harater. Thishartis pratially thesame for theoriginal text and
a text with randomly permutedwords, whih givesa telling evidene
of the urrent language models' strengths and weaknesses. For the
purposes of this disussion, thedata allows to reonstrut the
depen-deny of word entropy on theword length as
h
(
w
)
n
=
P
n
i
=1
h
(
l
)
i
, whereh
(
n
w
)
is the entropy of words of lengthn
, andh
(
l
)
i
is the entropy ofthe
i
-th letter in a word. This dependeny, valid for the language model in [14 ℄, has a steep inrease from 1 through 5 haraters, andthen an approximately linear growth with a muh shallower slope of
0.60.7 bp. This is very dierent from our Fig. 1, and even though
our data is on unpreditability, rather than entropy, the dierene is
probablysigniant.
Infat,our resultmayatrst glaneseem trivial. Indeed,
aord-ingtoatheoremduetoShannon(Theorem3in[2 ℄),foraharater
se-queneemittedbyastationaryergodisoure,almostallsubsequenes
oflength
n
havethesameprobabilityexponentialinn
:P
n
= 2
−
H n
for
largeenoughlength(
H
istheentropyofthesoure). However,thisex-planationisnotvalidherefor severalreasons. Evenifwesetaside the
questionofnaturallanguageergodiity,fromtheformalpointofview,
the theoremrequiresthat
n
is large enough sothat all possible letterthan that. Pratially, if this explanation were to be adopted, we'd
expetthe probability to guess a word to be on the order
P
n
, whihis muh smaller than the observed probability. In fat, the only
rea-son our subjets are able to guess words in ontext is that thewords
are onneted to the ontext and make sense in it, while under the
assumptionsofShannon'stheorem,the equiprobablesubsequenesare
asymptotially independent ofthe ontext.
Another tentative argument is to presume that the total number
of words inthe language (either in the voabulary or intexts, whih
is not the same thing) of a given length inreases with length, whih
makeslongerwordshardertoguessduetosheerexpansionof
possibil-ities. Iftherehadbeenexponentialexpansionofvoabularywithword
length,we ouldarguethatontextual restritionsonword hoie ut
the number of hoies by a onstant fator (on the average), so the
numberof wordssatisfying these restritionsstillgrows exponentially
withword length. However,thedatadoesnot supportthis idea.
Dis-tribution ofwordsbylength,whetheromputed fromtheatualtexts
or fromaditionary(weusedaRussianfrequenyditionary
ontain-ing 32000 words [15 ℄), is not even monotoni, let alone exponentially
growing. The number of dierent words grows up to about 8
har-aters of length, thendereases. This behavior isin no wayreeted
in Figs 1, 2, sowe an onlude that the total number of ditionary
words ofa givenlength isnot a fator inguessingsuess.
Infat,theword lengthdistribution ouldhave had adireteet
onunpreditabilityonlyifthe wordlengthwereknowntothesubjet.
Butthis isgenerallynot thease. Subjets inour experiment arenot
given any external lue as to the length of the omitted word. Sine
Russianverse isfor themost partmetri, thesyllabilength ofa line
istypiallyknown,andthisallowstopredit thesyllabilengthofthe
omittedwordwithagreat dealofertainty. Howeverunpreditability
dependsonword length inexatlythe same wayfor poetry andprose
(see Fig. 3), and in prose there are no external or internal lues for
the word length. 3
This leaves us with the only reasonable explanation for the
ob-3
It is also worth noting that average unpreditability of words in poetry and prose
is surprisinglylose. In poetry, it turns out, preditability due to meter and rhyme is
ounteratedbyinreasedunpreditabilityofsemantisand,possibly,grammar. Notably,
these twotendenies almostbalaneeahother. Thisphenomenonand itssignianeis
word length in haraters
unpreditabilit
y
20 18 16 14 12 10 8
6 4 2 0 6
5
4
3
2
1
0
Figure 3: Unpreditabilityas a funtion of word length in haraters, prose
only
served dependeny: in ourse of its evolution, the language tends to
even out information rate, so that longer words arry proportionally
more information. This would be a natural assumption, sine an
un-eveninformation rateisineient: someportionswillunderutilizethe
bandwidth of the hannel, and some will overutilize it and diminish
error-orretionapabilities. Inotherwords,aslanguagehangesover
time, some words and grammatial forms that are too long will be
shortened, and those that are too short will be expanded and
rein-fored.
It is interesting to note that this hypothesis was also proposed in
passingbyChurhandMererinadierentontextin[13 ℄. Disussing
appliations oftrigram word-predition modelsto speeh reognition,
they write(page 12):
In general, high-frequeny funtion words like to and
the,whihareaoustiallyshort,aremorepreditablethan
ontent wordslikeresolve and important,whih arelonger.
Thisis onvenient for speeh reognition beause itmeans
thatthelanguagemodelprovidesmorepowerfulonstraints
uralresultoftheevolutionofspeehtollthehumanneeds
for reliable ommuniation inthe preseneof noise.
Afeaturethatisonvenient forspeehreognition is,indeed, not
to be unexpetedinnatural language,andfromour resultsitappears
thatitsextentismuhbroaderthanouldbesuggestedbyChurhand
Merer's observation. Ofourse,this isonlyone ofmanymehanisms
thatdrive languagehange,and itonlyats statistially,soanygiven
languagestatewillhavelow-redundanyandhigh-redundanypokets.
Thus, any Russian speaker knows how diult it is to distinguish
between mne nado 'I need' and ne nado 'please don't'. Moreover,
it is likely that this hange typially proeeds by vaillations. As an
exampleonsidertheevolutionofnegationinEnglishaordingto[16 ℄
(p. 175176):
the original Old English word of negation was ne, as
in i ne w at, 'I don't know'. This ordinary mode of
nega-tionouldbereinforedbythehyperboliuseofeitherwiht
'something,anything'orn awiht'nothing,notanything'[...℄.
Astime progressed,thehyperbolifore of(n a)wiht began
to fade [...℄ andtheform n awihtame to beinterpreted as
part of a two-part, disontinuous marker of negation ne
... n awiht [...℄. But one ordinary negation was expressed
by two words, neand n awiht, the stage wasset for ellipsis
to ome inand to eliminate theseeming redundany. The
result was that ne, the word that originally had been the
markerofnegation,wasdeleted,andnot,thereexof
orig-inally hyperboli n awiht beame the only marker of
nega-tion. [...℄ (ModernEnglish hasintrodued further hanges
throughthe introdution of thehelpingword do.)
This looks very muh like osillations resulting from an iterative
searh for the optimum length of a partiular grammatial form. It's
all the more amazing then, how this tendeny, despite its statistial
and non-stationary harater, beautifullymanifestsitself inthedata.
Addendum. After this paper was published in J. Information
Pro.,the authorbeameawareofthefollowing worksinwhiheets
of the same naturewasdisovered ondierent levels:
The disoure level. Genzel and Charniak [17℄ study the entropy
of a sentene depending on its position in the text. They show that
They onlude that the hypothetial entropy value with aount for
semantiswouldbeonstant,beausetheontentofthepreedingtext
would helpprediting thefollowing text.
The sentene level. The authors of [18℄ onsider the English
sen-teneswithoptionalrelativizerthat. Theydemonstrateexperimentally
thatthespeakerstendtouttertheoptionalrelativizermorefrequently
inthose sentenes where information densityis higher thus diluting
them. This an be interpreted asa tendeny to homogenize
informa-tion density.
The syllabe level Aulett and Turk [19℄ demonstrate that in
spon-taneous speeh,thosesyllables thatarelesspreditable,areinreased
in duration and prosodi prominene. In this way, speakers tend to
smooth out theredundany level.
Aknowledgements. I am grateful to D. Flitman for hosting
the website, to Yu. Manin, M. Verbitsky, Yu. Fridman, R. Leibov,
G. Mints and many others for fruitful disussions. This work ould
not have happened without thegenerous support ofover8000 people
whogenerated experimental databy playingthe game.
Referenes
[1℄ Shannon C.E. Predition and entropy of printed English. Bell
System Tehnial Journal, 1951,vol. 30,pp.5064.
[2℄ ShannonC.E.Amathematialtheoryofommuniation.Bell
Sys-tem Tehnial Journal, 1948,vol. 27,pp.379423.
[3℄ Burton N.G., Liklider J.C.R.Long-range onstraintsin the
sta-tistial struture of printed English. Amerian Journal of
Psy-hology,1955, vol. 68,no.4,pp.650653
[4℄ Fonagy I. Informationsgehalt von wort und laut in der
dih-tung. In: Poetis. Poetyka. Ïîýòèêà. Warszawa: Panstwo
Wydawnitwo Naukowe, 1961,pp.591605.
[5℄ KolmogorovA.Threeapproahestothequantitativedenitionof
information. Problems Inform.Transmission,1965, vol.1,pp.1
7.
[6℄ Yaglom A.M. and Yaglom I.M. Probability and information
entropy of English. Information Theory, IEEE Transations on,
1978, vol. 24,no.4,pp.413421.
[8℄ Moradi H., Roberts J.A., Grzymala-Busse J.W. Entropy of
En-glish text: Experiments with humans and a mahine learning
system based on rough sets. Inf. Si., 1998, vol. 104, no. 12,
pp.3147.
[9℄ Paisley W.J.The eetsof authorship, topi struture,and time
of omposition on letter redundany in English text. J. Verbal.
Behav., 1966, vol.5,pp. 2834.
[10℄ Brown P.F., Della Pietra V.J., Merer R.L., Della Pietra S.A.,
LaiJ.C.AnestimateofanupperboundfortheentropyofEnglish.
Comput. Linguist.,1992, vol.18, no.1,pp.3140.
[11℄ Teahan W.J., Cleary J.G. The entropy of English using
PPM-based models. In: DCC '96: Proeedings of the Conferene on
Data Compression, Washington: IEEE Computer Soiety, 1996,
pp.5362.
[12℄ Leibov R.G., Manin D.Yu. An attempt at experimental poetis
[tentative title℄. To be published in: Pro. Tartu Univ. [in
Rus-sian℄, Tartu: TartuUniversityPress
[13℄ Churh K.W., Merer R.L. Introdution to the speial issue on
omputational linguistisusinglarge orpora.Comput.Linguist.,
1993, vol. 19,no.1,pp.124.
[14℄ T.Shurmann and P.Grassberger. Entropy estimation of symbol
sequenes. Chaos,1996,vol. 6,no.3,pp.414427.
[15℄ Sharo S., The frequeny ditionary for Russian.
http://www.artint.ru/projets/frqlist/frqlist-en.asp
[16℄ Hok H.H., Joseph B.D. Language History, Language Change,
and Language Relationship. BerlinNew York: Mouton de
Gruyter, 1996.
[17℄ Genzel & Charniak, 2002. Entropy rate onstany in text. Pro.
40thAnnualMeeting of ACL, 199206.
[18℄ Anonymous authors (paper under review), 2006. Speakers
opti-mize informationdensitythroughsyntatiredution.Tobe
Hypothesis: A Funtional Explanation for Relationships between
Redundany, ProsodiProminene, andDurationin Spontaneous