3.2 Elements of an “aesthetic” theory of information
3.2.2 Utterance as a Markov process
In “A Mathematical Theory of Communication,” Shannon memorably illustrated the value of even rudimentary statistical analysis of language by reporting on a set of informal experiments that he carried out on written language. Shannon demonstrated how counting the relative frequencies of characters as they appear in a representative example of the English language can be used to generate new text, whose patterns reflect those found in reality. To do so, Shannon modeled the text generation process as a discrete Markov process. This model ties the generation of characters (or words) to the behavior of an abstract “machine” which exists in one of a fixed, finite set of states (
S
1, S
2, . . . , S
n)at a time. The behavior of the most simple version of this machine, a so-called first-order Markov model, is completely specified in terms of the probability of transitioning from one state to another (
p
ij being the probability of transitioning from stateS
itoS
j), with all other things being equal.These transition probabilities are usually represented and stored as two-dimensional tables, whose rows and columns are labeled by the names of the states, allowing either a human or machine user of the table, given the current state of the machine, to determine the relative probabilities of transitioning to each of the other possible states (including the probability of staying in the same state).
As an abstract mathematical constructions, Markov processes need not necessarily be understood as information sources. Indeed, Markov processes were defined by their namesake, Andrey
in advance of Shannon’s paper, as he recognizes.79 Shannon makes explict the function of the Markov
process in the communications theory context:
To make [a] Markoff process into an information source we need only assume that a letter is produced for each transition from one state to another. The states will correspond to the “residue of influence” from preceding letters.80
The Markov process is deployed here not as an abstract statistical curiosity, interesting in its own right but as a model of something. This a subtle but important point because it suggests that identifying this particular mathematical technique is a necessary but not sufficient condition for tracing the influence of information theory discourse on, for example, music composition. Pointing out that a particular composer used or uses Markov chains to generate musical material does not tell us why—in the sense of either cause or teleology—they chose to do so. Nevertheless, Markov chains become part of the rhetoric of information theory primers, and served as concrete demonstrations of how information-theoretic tools provided a way to analyze the structure of a given domain of communication. Pinkerton’s discussion of the BANAL TUNE MAKER is a case in point. When melody-writing is idealized as a Markov process, the history of notes up to a fixed cutoff point is used to determine the probability with which the next note is selected.81 In “Information Theory
and Melody,” Pinkerton published a matrix (shown in Figure 3.5) which contains the transition probabilities between the notes of the diatonic scale that he used to define the structure of his coin-flipping graph.82 Pinkerton explains:
79. Shannon and Weaver, The Mathematical Theory of Communication, 45. That said, Markov himself did apply his mathematical discoveries to the analysis of consonant patterns in text. In 1913, Markov presented an analysis of consonant and vowel distributions in Pushkin’s Eugene Oneigin. At this time, of course, the discourse of information is many decades away, and though Markov’s analysis is of great relevance to the statistical study of written language more generally, it has little to add to the story of information. See David Link, “Traces of the Mouth: Andrei Andreyevich Markov’s Mathematization of Writing,” History of Science 44, no. 3 (September 2006): 321–48, https://doi.org/10.1177/ 007327530604400302.
80. Shannon and Weaver, The Mathematical Theory of Communication, 45. Shannon (along with many older authors) transliterates as “Markoff.” Here, “Markov” is preferred throughout except in direct quotations of sources that use the older form.
81. Christopher Ariza discusses Caplin and Prinz’s use of a related technique and connections with Hiller’s use of Markov chains. Ariza, “Two Pioneering Projects from the Early History of Computer-Aided Algorithmic Composition,” 46–47.
Figure 3.5: Transition probabilities for Pinkerton’s first-order Markov model of melodies based on his analysis of 39 nursery tunes. (In Richard Pinkerton, “Information Theory and Melody,” Scientific American 194 (1956), 80.)
TRANSITION PROBABILITIES show how frequently any note follows any other in the 39 nursery tunes. The first notes of all possible pairs are listed in the column at the left; the second notes, in the row at the top. Thus each number in the table gives the probability that the note at the top of its column will come after the note at the left of its row. The color pattern divides the table between likely transitions (colored) and unlikely (white).83
As Shannon had shown, the comprehensibility of generated English text can be improved by increasing the length of the history of the melody used to compute the transition probabilities between
successive notes. Whereas a first-order Markov chain only uses the last note as the basis for proposing a new note, a second-order model uses the last two notes. Changing this feature of the model, the “order” of the Markov process, allows the model to capture more of the structure latent in the source
material. Pinkerton neglects to mention this fact. Increasing the order of the model requires more computational resources to both compute the relevant probabilities and to store them. Pinkerton’s article makes it clear that he computed the transition probabilities for his first-order Markov process by hand, working with a set of eight states (the seven notes of the diatonic scale, plus a symbol to indicate an eighth-note rest). Estimating the transition matrix for a model of this order requires
8
2(=
64)
separate calculations (counting and normalizing) over the complete data set. To move from a first-order model to a second-order model would involve eight times more computations, for a total of 512 separate calculations. The exponential growth of the scale of the task means that, at least for hand-computation, this quickly becomes unmanageable, even for such a toy case as Pinkerton’s.84Fortunately, as the order of the Markov chain is increased, the return on doing so diminishes.85 As the
order of the model is increased, more observations are required to accurately estimate the transition probabilities between the various states. A related issue occurs when the model begins to “memorize” and reproduce large segments (say, sentences) of recognizable material as it appeared verbatim in the input corpus. This behavior can be interpreted as a failure of the particular model to generalize; when it behaves in this way, the results of the model can be indistinguishable from the results of more naive “cut-and-paste” computer-assisted composition algorithms.86
Despite these limitations, Markov models remained an attractive formalism, because they were simple to compute, did not require vast amounts of storage or memory (at least for small-order models), and their behavior was relatively straightforward to interpret. Estimating the transition
84. More generally, if the alphabet is of size n a first-order Markov model requires n2memory positions to store
the associated probabilities, a second-order Markov model requires n3, and so on. Wilhelm Fucks, who also worked with
Markov models of pitch, alludes to this problem in one of his few published references to the practicalities of using a computer (Rechneranlage) to do information-theoretic research into music. Wilhelm Fucks, Mathematische Analyse der Formalstruktur von Musik ([Wiesbaden]: Springer Fachmedien Wiesbaden GmbH, 1958), 52.
85. The threshold at which this takes place is dependent on the data set and the size of the symbol set (“repertoire”). 86. This behavior, loosely called overfitting, must still be avoided in the design of mathematical models, including those used in the latest computer-aided algorithmic composition software.
probabilities for a first-order Markov model involves counting and normalizing symbol frequencies over a representative corpus of data. These estimates can be formed using the basic arithmetic operations available on almost all computer systems; no complicated or processor-intensive analytic functions are required. The final advantage is that the Markov chain model is what is understood to be a “generative” model, meaning that the model explicitly specifies a process according to which the data it models is generated. This lends its parameters a degree of natural interpretability since they are directly related to a particular probability distribution, and can normally be expressed analytically as a straightforward mathematical formula. These advantages of the Markov model show that accuracy or reproductive fidelity are not the only criteria for choosing a model: material exigencies, such as the model’s consumption of computational resources (either human or machine) and questions of interpretability come to bear on such a choice.
3.2.3 Redundancy: or, “the part of a message that can be eliminated without loss of essential