Experiments on predictability of word in context and information rate in natural language

(1)

ontext

and information rate in natural language

∗

D. Yu. Manin

Deember 26, 2006

Abstrat

Basedondatafromalarge-saleexperimentwithhumansubjets,

we onlude thatthe logarithm of probabilityto guess aword in

on-text (unpreditability) depends linearlyon theword length. This

re-sult holds both for poetry and prose, even though with prose, the

subjets don'tknowthe length of theomitted word. We hypothesize

thatthiseetreetsatendenyofnaturallanguageto have aneven

information rate.

1 Introdution

Inthispaperwereportapartiularresultofanexperimentalstudyon

preditability ofwordsinontext. The experiment's primary

motiva-tion is thestudy of some aspets of poetry pereption, but theresult

reported hereis, inthe author'sview,of ageneral linguisti interest.

Therst studyofnaturaltext preditabilitywasperformedbythe

founder of information theory, C. E. Shannon [1 ℄. (We'll note that

even in his groundbreaking work [2℄, Shannon briey touhed on the

relationship between literaryqualities andredundany byontrasting

highlyredundantBasiEnglishwithJoye'sFinnegan'sWake whih

∗

ThisisasomewhatexpandedversionofthepaperManin,D.Yu. 2006. Experimentson

preditability ofwordinontextandinformationratein natural language. J.Information

(2)

semanti ontent.) Shannon presented his subjet withrandom

pas-sagesfromJeerson'sbiographyandhadherguessthenextletteruntil

the orretguess wasreorded. The number of guessesfor eahletter

wasthenused toalulate upperandlower boundsfor theentropyof

English, whih turned out to be between 0.6 and 1.3bits per

hara-ter (bp), muh lowerthan thatof arandom mixof thesame letters.

Shannon's resultsalso indiatedthatonditional entropydereases as

more and more text history beomes known to the subjet, up to at

least100 letters.

Several authors repeated Shannon's experiments with some

mod-iations. Burton and Liklider [3 ℄ used 10 dierent texts of similar

style, and fragment lengths of1, 2,4, ..., 128,1000 haraters. Their

onlusion was that, ontrary to Shannon, inreasing history doesn't

aet measured entropy whenhistory length exeeds32 haraters.

Fonagy[4 ℄omparedpreditabilityofthenextletterforthreetypes

of text: poetry, newspaper, and a onversation of two young girls.

Apparently, his tehnique involved only one guess per letter, so

en-tropy estimates ould not be alulated (see below), and results are

presented interms of the rate of orret answers, poetry being muh

lesspreditable than both othertypes.

Kolmogorovreportedtheresultsof0.91.4bpforRussiantextsin

hiswork[5℄thatlaidthegroundofalgorithmiomplexitytheory. The

paperontainsnodetailsontheveryingeniousexperimentaltehniue,

butitisdesribedinthewell-knownmonographbyYaglom &Yaglom

[6℄.

Cover and King [7℄ modied Shannon's tehnique by having their

subjets plae bets on the next letter. They showed that the

opti-mal betting poliywould be to distributeavailable apitalamong the

possibleoutomesaordingtotheirprobabilityandsoifthesubjets

play in an optimal way (whih is not self-evident though), the letter

probabilities ould be inferredfrom their bets. Their estimateof the

entropyof Englishwasalulated at 1.3bp. This workalso ontains

an extensive bibliography.

Moradietal[8℄rstusedtwodierenttexts(atextbookondigital

signal proessing and a novel by Judith Krantz) to onrm Burton

and Liklider's results on the ritial history length (32 haraters),

then added two more texts (101 Dalmatians and a federal aviation

manual) tostudy the dependeneofentropyon texttype andsubjet

(3)

language by means of statistial analysis, without using human

sub-jets. One of the rst attempts is reported in [9℄, where 39 English

translations of 9 lassial Greek textswere used to study entropy

de-pendeny on subjetmatter, style, andperiod. A very rudeentropy

estimate by letter digram frequeny wasused. For some of the more

reentdevelopments,see[10 ℄,[11℄ andreferenes therein. Bythevery

nature ofthesemethods theyan't utilize meaning(and evensyntax)

ofthetext,butbythebruteforeofontemporaryomputersthey

be-ginahievingresultsthatomereasonablylosetothosedemonstrated

byhuman languagespeakers.

Ourexperimental setup diers from theprevious work in two

im-portant aspets. First, we have subjets guess whole words, and not

individual haraters. Seond, the words to be guessed ome

(gener-allyspeaking)from themiddle ofaontext, ratherthan at theend of

a fragment. In additiontollingblanks, wepresent thesubjets with

two other tasktypes where authentiity of a presented word is to be

assessed. Thereasonforthisisthatwhilemostofthepreviousstudies

wereeventuallyaimedateienttextompression,weareinterestedin

literary (hiey, poeti) texts asworks of literature, and not as mere

harater strings subjet to appliation of ompression algorithms 1

.

Ourgoalindesigningtheexperimentwastoprovideresearhersinthe

eldof poetiswithhard data to ground somehypothesesthat

other-wiseareunavoidably speulative. Guessingthenextword insequene

is not the best way to treat literary text, beause even an ordinary

sentene like this one is not essentially a linear sequene of words or

haraters, but a omplex struture with word assoiations running

all over the plae, both forward and bakward. A poem, even more

so, is a struture with strongly oordinated parts, whih is not read

sequentially,muh lesswritten sequentially. Also, pratieshowsthat

even when guessing letter by letter, people almost always base their

next harater hoie ona tentative word guess. Thisiswhyguessing

wholewords inontext wasmore appropriatefor our purpose.

However,theresultswepresenthere,asalreadymentioned,arenot

relevant to poetis proper, so we will not dwell on this further, and

refer theinterestedreader to [12 ℄.

1

Itshouldbenotedthoughthateientompressionisimportantnotonlyperse,but

also for ryptographiappliations aspointed outin [11℄. In addition, languagemodels

developedforthepurposeofompressionaresuessfullyusedin appliationslikespeeh

(4)

In their Introdution to the speial issue on omputational

linguis-tis usinglargeorpora, ChurhandMerer [13℄notethatThe1990s

have witnessed a resurgene of interest in 1950s-style empirial and

statistial methods of language analysis. They attribute this

empir-ial renaissane primarily to the availability of proessing power and

of massive quantities of data. Of ourse, these fators favor

statisti-al analysis of texts as harater strings. However, wide availability

of omputer networks and interative Web tehnologies also made it

possibleto set uplarge-sale experiments withhuman subjets.

TheexperimenthastheformofanonlineliterarygameinRussian 2

.

However, the players are also fully aware of the researh side, have

freeaesstotheoretialbakgroundandurrentexperimentalresults,

and an partiipate in online disussions. The players are presented

withtext fragmentsinwhihone ofthe words isreplaedwithblanks

or with a dierent word. Any sequene of 5 or more Cyrilli letters

surroundedbynon-letterswasonsideredaword. Wordsareseleted

from fragments randomly. There are threedierent trial types:

type 1: a word is omitted, andis to be guessed.

type 2: a word is highlighted, and thetask isto determine whether

itis originalor replaed.

type 3: two words are displayed, and the subjet has to determine

whihone is the originalword.

Inorretguessesfrom trials oftype 1 areusedasreplaements in

trials oftypes2 and 3.

Texts are randomly drawn from a orpus of 3439 fragments of

mostlypoetiworksinawiderangeofstylesandperiods: fromA

vant-garde tomassultureand from18th entury to ontemporary. Three

prosaitextsarealsoinluded(twolassinovels,andaontemporary

politialessay).

Asofthiswriting,theexperimenthasbeenrunningalmost

ontin-uouslyforthreeyears. Over8000peopletookpartinitandolletively

madealmost900,000guesses, aboutathirdofwhihisoftype1. The

traditionallaboratoryexperimentouldhaveneverahievedthissale.

Ofourse,thetehniquehasitsowndrawbaks,whiharedisussedin

2

(5)

espeially ifit an'tbeahieved inanyotherway.

3 Results

Thespei goaloftheexperimentisto disoverandanalyze

system-atidierenesbetweendierentategoriesoftextsfromtheviewpoint

ofhoweasyitistoa)reonstrut anomittedword,andb)distinguish

the original word from a replaement. However here we'll onsider a

partiular property of the texts that turns out to be independent of

the text type and soprobablyharaterizes thelanguage itself rather

than spei texts. This property is the dependeny of word

unpre-ditabilityonits length.

We dene unpreditability

U

as the negative binary logarithm of theprobabilitytoguessaword,

U

=

−

log

2 p

1

,where

p

1

istheaverage

rate of orret answers to trials of type 1. For a single word, this is

formally equivalent to Shannon's denition of entropy,

H

. However,

when multiple words aretaken into aount,entropyshould be

alu-latedasthe averagelogarithmofprobability,andnotasthelogarithm

of average probability,

H

=

−

1 N

N

X

i

₌₁

log

₂

p

i

₁

(1)

U

=

−

log

2

1 N

N

X

i

₌₁

p

i

₁

(2)

Indeed, the logarithm of probability to guess a word equals the

amount of information inbits required to determine theword hoie.

Thus, it is this quantity that is subjet to averaging. When dealing

withexperimentaldata,itisustomarytousefrequeniesasestimates

of unobservable probabilities. However, there are always words that

were never guessed orretly and have

p

1 = 0

for whih logarithm is

undened (thisiswhyShannon'stehinqueinvolvesrepeatedguessing

of the same letter until the orret answer is obtained). Formally, if

there is one element in the sequene with zero (very small, in fat)

probability of being guessed, then the amount of information of the

wholesequene maybe determined solely by thisone element.

On the other hand, unpreditability as dened above is not

(6)

triesrequiredtoguessarandomlyseletedword,unpreditability

har-aterizes theportion of words that would be guessed on the rst try.

Theyareequal, ofourse, ifallwords havethesame entropy.

One way around the problem presented by never-guessed words

would be to assign some arbitrary nite entropy to them. We

om-paredunpreditability withentropyalulated underthis

approxima-tionwithtwovaluesoftheonstant: 10bits(orrespondingroughlyto

wildguessingusingafrequenyditionary)and3bits(thelowbound).

Inbothases,while

H

isnotequalnumeriallyto

U

,theyturnedoutto

beinanalmostmonotoni,approximatelylinearorrespondene. This

probablymeansthatthefrationofhard-to-guesswordso-varieswith

unpreditability of the rest of the words. Beause of this, we prefer

toworkintermsofunpreditability,ratherthanintroduingarbitrary

hypothesesto alulate anentropy value ofdubious validity.

Unpreditability as a funtion of word length alulated over all

words of the same length aross all texts is plotted in Fig. 1 and

Fig. 2 (where word length is measured in haraters and syllables

respetively). Condene intervalson the graphsarealulatedbased

on thestandard deviation ofthebinomialdistribution (sinethedata

omes froma series of independent trials withtwo possible outomes

ineah: a guessmay be orretor inorret).

In the range from 5 to 14 haraters and from 1 to 5 syllables,

an exellent linear dependene is observed. Longer words arerare, so

the datafor them is signiantly lessstatistially reliable. We'll only

disuss the lineardependeneinthe rangewhere itis denitely valid.

4 Disussion

It is very diult, for the reasons mentioned above, to ompare our

results withprevious studies. However, there aretwo points of

om-parison thatanbemade. First,wean roughlyestimatetheeetof

word guessingin ontext asopposedto guessing the nextword in

se-quene. ReallthatShannon [1℄ estimatedzeroth-order word entropy

forEnglishbasedonZipf'slawtobe11.82bitsperword(bpw). Brown

et al [10 ℄ used a word trigram model to ahieve an entropy estimate

of 1.72 bp, whih translates to 7.74 bpw for average word length of

4.5haraters inEnglish. Thismeansthat trigramword probabilities

(7)

word length in haraters

unpreditabilit

y

20 18 16 14 12 10 8

6 4 2 0 6

5

4

3

2

1

0

Figure 1: Unpreditability as a funtion of word length in haraters, all

texts

word length in yllables

unpreditabilit

y

7 6

5 4

3 2

1 0

3.5

3

2.5

2

1.5

1

0.5

0

(8)

the middle and the rst word of a trigram. Only therst trigram is

availablewhenthemodelispreditingthenextword,butallthree

tri-grams ould be used to llin anomitted word (this isa hypothetial

experiment whih was not atually performed). Of ourse, they are

not statistially independent, andasa rough estimatewean assume

that the last trigram ontributes somewhat lessinformation than the

rst one, whilethe middle trigram ontributes very little (sineall of

its words arealreadyaountedfor). Inother words, we ouldexpet

thismodeltohaveabout4bpwmoreinformationwhenguessingwords

inontext, whih isverysigniant.

Theseondpointofomparisonisprovided by[14℄(Fig. 13there),

where entropy is plotted for the

n

-th letter of eah word, versus its position

n

. Entropy was estimated using a ZivLempel type algo-rithm. It is well-known that guessing is least ondent at the word

boundaries for both human subjets and omputer algorithms, and

this hart quanties the observation: the rst letter has the entropy

of 4 bp, whih drops quikly to about 0.60.7 bp for the 5th

let-ter andthenstayssurprizingly onstant all thewaythroughthe 16th

harater. Thishartis pratially thesame for theoriginal text and

a text with randomly permutedwords, whih givesa telling evidene

of the urrent language models' strengths and weaknesses. For the

purposes of this disussion, thedata allows to reonstrut the

depen-deny of word entropy on theword length as

h

(

w

₎

n

=

P

n

i

₌₁

h

(

l

₎

i

, where

h

(

n

w

)

is the entropy of words of length

n

, and

h

(

l

₎

i

is the entropy of

the

i

-th letter in a word. This dependeny, valid for the language model in [14 ℄, has a steep inrease from 1 through 5 haraters, and

then an approximately linear growth with a muh shallower slope of

0.60.7 bp. This is very dierent from our Fig. 1, and even though

our data is on unpreditability, rather than entropy, the dierene is

probablysigniant.

Infat,our resultmayatrst glaneseem trivial. Indeed,

aord-ingtoatheoremduetoShannon(Theorem3in[2 ℄),foraharater

se-queneemittedbyastationaryergodisoure,almostallsubsequenes

oflength

n

havethesameprobabilityexponentialin

n

:

P

n

= 2

−

H n

for

largeenoughlength(

H

istheentropyofthesoure). However,this

ex-planationisnotvalidherefor severalreasons. Evenifwesetaside the

questionofnaturallanguageergodiity,fromtheformalpointofview,

the theoremrequiresthat

n

is large enough sothat all possible letter

(9)

than that. Pratially, if this explanation were to be adopted, we'd

expetthe probability to guess a word to be on the order

P

n

, whih

is muh smaller than the observed probability. In fat, the only

rea-son our subjets are able to guess words in ontext is that thewords

are onneted to the ontext and make sense in it, while under the

assumptionsofShannon'stheorem,the equiprobablesubsequenesare

asymptotially independent ofthe ontext.

Another tentative argument is to presume that the total number

of words inthe language (either in the voabulary or intexts, whih

is not the same thing) of a given length inreases with length, whih

makeslongerwordshardertoguessduetosheerexpansionof

possibil-ities. Iftherehadbeenexponentialexpansionofvoabularywithword

length,we ouldarguethatontextual restritionsonword hoie ut

the number of hoies by a onstant fator (on the average), so the

numberof wordssatisfying these restritionsstillgrows exponentially

withword length. However,thedatadoesnot supportthis idea.

Dis-tribution ofwordsbylength,whetheromputed fromtheatualtexts

or fromaditionary(weusedaRussianfrequenyditionary

ontain-ing 32000 words [15 ℄), is not even monotoni, let alone exponentially

growing. The number of dierent words grows up to about 8

har-aters of length, thendereases. This behavior isin no wayreeted

in Figs 1, 2, sowe an onlude that the total number of ditionary

words ofa givenlength isnot a fator inguessingsuess.

Infat,theword lengthdistribution ouldhave had adireteet

onunpreditabilityonlyifthe wordlengthwereknowntothesubjet.

Butthis isgenerallynot thease. Subjets inour experiment arenot

given any external lue as to the length of the omitted word. Sine

Russianverse isfor themost partmetri, thesyllabilength ofa line

istypiallyknown,andthisallowstopredit thesyllabilengthofthe

omittedwordwithagreat dealofertainty. Howeverunpreditability

dependsonword length inexatlythe same wayfor poetry andprose

(see Fig. 3), and in prose there are no external or internal lues for

the word length. 3

This leaves us with the only reasonable explanation for the

ob-3

It is also worth noting that average unpreditability of words in poetry and prose

is surprisinglylose. In poetry, it turns out, preditability due to meter and rhyme is

ounteratedbyinreasedunpreditabilityofsemantisand,possibly,grammar. Notably,

these twotendenies almostbalaneeahother. Thisphenomenonand itssignianeis

(10)

word length in haraters

unpreditabilit

y

20 18 16 14 12 10 8

6 4 2 0 6

5

4

3

2

1

0

Figure 3: Unpreditabilityas a funtion of word length in haraters, prose

only

served dependeny: in ourse of its evolution, the language tends to

even out information rate, so that longer words arry proportionally

more information. This would be a natural assumption, sine an

un-eveninformation rateisineient: someportionswillunderutilizethe

bandwidth of the hannel, and some will overutilize it and diminish

error-orretionapabilities. Inotherwords,aslanguagehangesover

time, some words and grammatial forms that are too long will be

shortened, and those that are too short will be expanded and

rein-fored.

It is interesting to note that this hypothesis was also proposed in

passingbyChurhandMererinadierentontextin[13 ℄. Disussing

appliations oftrigram word-predition modelsto speeh reognition,

they write(page 12):

In general, high-frequeny funtion words like to and

the,whihareaoustiallyshort,aremorepreditablethan

ontent wordslikeresolve and important,whih arelonger.

Thisis onvenient for speeh reognition beause itmeans

thatthelanguagemodelprovidesmorepowerfulonstraints

(11)

uralresultoftheevolutionofspeehtollthehumanneeds

for reliable ommuniation inthe preseneof noise.

Afeaturethatisonvenient forspeehreognition is,indeed, not

to be unexpetedinnatural language,andfromour resultsitappears

thatitsextentismuhbroaderthanouldbesuggestedbyChurhand

Merer's observation. Ofourse,this isonlyone ofmanymehanisms

thatdrive languagehange,and itonlyats statistially,soanygiven

languagestatewillhavelow-redundanyandhigh-redundanypokets.

Thus, any Russian speaker knows how diult it is to distinguish

between mne nado 'I need' and ne nado 'please don't'. Moreover,

it is likely that this hange typially proeeds by vaillations. As an

exampleonsidertheevolutionofnegationinEnglishaordingto[16 ℄

(p. 175176):

the original Old English word of negation was ne, as

in i ne w at, 'I don't know'. This ordinary mode of

nega-tionouldbereinforedbythehyperboliuseofeitherwiht

'something,anything'orn awiht'nothing,notanything'[...℄.

Astime progressed,thehyperbolifore of(n a)wiht began

to fade [...℄ andtheform n awihtame to beinterpreted as

part of a two-part, disontinuous marker of negation ne

... n awiht [...℄. But one ordinary negation was expressed

by two words, neand n awiht, the stage wasset for ellipsis

to ome inand to eliminate theseeming redundany. The

result was that ne, the word that originally had been the

markerofnegation,wasdeleted,andnot,thereexof

orig-inally hyperboli n awiht beame the only marker of

nega-tion. [...℄ (ModernEnglish hasintrodued further hanges

throughthe introdution of thehelpingword do.)

This looks very muh like osillations resulting from an iterative

searh for the optimum length of a partiular grammatial form. It's

all the more amazing then, how this tendeny, despite its statistial

and non-stationary harater, beautifullymanifestsitself inthedata.

Addendum. After this paper was published in J. Information

Pro.,the authorbeameawareofthefollowing worksinwhiheets

of the same naturewasdisovered ondierent levels:

The disoure level. Genzel and Charniak [17℄ study the entropy

of a sentene depending on its position in the text. They show that

(12)

They onlude that the hypothetial entropy value with aount for

semantiswouldbeonstant,beausetheontentofthepreedingtext

would helpprediting thefollowing text.

The sentene level. The authors of [18℄ onsider the English

sen-teneswithoptionalrelativizerthat. Theydemonstrateexperimentally

thatthespeakerstendtouttertheoptionalrelativizermorefrequently

inthose sentenes where information densityis higher thus diluting

them. This an be interpreted asa tendeny to homogenize

informa-tion density.

The syllabe level Aulett and Turk [19℄ demonstrate that in

spon-taneous speeh,thosesyllables thatarelesspreditable,areinreased

in duration and prosodi prominene. In this way, speakers tend to

smooth out theredundany level.

Aknowledgements. I am grateful to D. Flitman for hosting

the website, to Yu. Manin, M. Verbitsky, Yu. Fridman, R. Leibov,

G. Mints and many others for fruitful disussions. This work ould

not have happened without thegenerous support ofover8000 people

whogenerated experimental databy playingthe game.

Referenes

[1℄ Shannon C.E. Predition and entropy of printed English. Bell

System Tehnial Journal, 1951,vol. 30,pp.5064.

[2℄ ShannonC.E.Amathematialtheoryofommuniation.Bell

Sys-tem Tehnial Journal, 1948,vol. 27,pp.379423.

[3℄ Burton N.G., Liklider J.C.R.Long-range onstraintsin the

sta-tistial struture of printed English. Amerian Journal of

Psy-hology,1955, vol. 68,no.4,pp.650653

[4℄ Fonagy I. Informationsgehalt von wort und laut in der

dih-tung. In: Poetis. Poetyka. Ïîýòèêà. Warszawa: Panstwo

Wydawnitwo Naukowe, 1961,pp.591605.

[5℄ KolmogorovA.Threeapproahestothequantitativedenitionof

information. Problems Inform.Transmission,1965, vol.1,pp.1

7.

[6℄ Yaglom A.M. and Yaglom I.M. Probability and information

(13)

entropy of English. Information Theory, IEEE Transations on,

1978, vol. 24,no.4,pp.413421.

[8℄ Moradi H., Roberts J.A., Grzymala-Busse J.W. Entropy of

En-glish text: Experiments with humans and a mahine learning

system based on rough sets. Inf. Si., 1998, vol. 104, no. 12,

pp.3147.

[9℄ Paisley W.J.The eetsof authorship, topi struture,and time

of omposition on letter redundany in English text. J. Verbal.

Behav., 1966, vol.5,pp. 2834.

[10℄ Brown P.F., Della Pietra V.J., Merer R.L., Della Pietra S.A.,

LaiJ.C.AnestimateofanupperboundfortheentropyofEnglish.

Comput. Linguist.,1992, vol.18, no.1,pp.3140.

[11℄ Teahan W.J., Cleary J.G. The entropy of English using

PPM-based models. In: DCC '96: Proeedings of the Conferene on

Data Compression, Washington: IEEE Computer Soiety, 1996,

pp.5362.

[12℄ Leibov R.G., Manin D.Yu. An attempt at experimental poetis

[tentative title℄. To be published in: Pro. Tartu Univ. [in

Rus-sian℄, Tartu: TartuUniversityPress

[13℄ Churh K.W., Merer R.L. Introdution to the speial issue on

omputational linguistisusinglarge orpora.Comput.Linguist.,

1993, vol. 19,no.1,pp.124.

[14℄ T.Shurmann and P.Grassberger. Entropy estimation of symbol

sequenes. Chaos,1996,vol. 6,no.3,pp.414427.

[15℄ Sharo S., The frequeny ditionary for Russian.

http://www.artint.ru/projets/frqlist/frqlist-en.asp

[16℄ Hok H.H., Joseph B.D. Language History, Language Change,

and Language Relationship. BerlinNew York: Mouton de

Gruyter, 1996.

[17℄ Genzel & Charniak, 2002. Entropy rate onstany in text. Pro.

40thAnnualMeeting of ACL, 199206.

[18℄ Anonymous authors (paper under review), 2006. Speakers

opti-mize informationdensitythroughsyntatiredution.Tobe

(14)

Hypothesis: A Funtional Explanation for Relationships between

Redundany, ProsodiProminene, andDurationin Spontaneous

Experiments on predictability of word in context and information rate in natural language

∗

∗

U

U

=

−

log

2

p

1

p

1

H

H

=

−

1

N

N

X

i

=1

log

2

p

i

1

U

=

−

log

2

1

N

N

X

i

=1

p

i

1

p

1

= 0

H

U

n

n

h

(

w

)

n

=

P

n

i

=1

h

(

l

)

i

h

(

n

w

)

n

h

(

l

)

i

i

n

n

P

n

₌₁

₂

₁

₌₁

₁

₎

₌₁

₎

₎