• No results found

Section Abstract

Modern information theory provides a theoretical framework for data and its appli- cability on computer hardware demonstrates its adequateness for this explicative role. The notion of information entropy illuminates how patterns are always discussed on the background of an information space. The notion of Kolmogorov complexity illuminates that an important property of a pattern is that it should be constructable for epistemic reasons. General pattern theory is a sufficiently general algebraic approach to actually define the class of all patterns.

In the former sections of this chapter I discussed how patterns should roughly be classified with regard to their relevance for phenomena into concrete and general patterns (4.1), why there is historically a philosophical debate about the notion of patterns (4.2) and why mathematics is all what we need to sufficiently explicate the

notion for pragmatic, epistemological and metaphysical needs (4.3). In this final

section I present an actual mathematical solution to the problem of the general explication of the notion of a pattern.

The route to the mathematical solution can be outlined as follows. First, we

to define a pattern in the broadest and conceptually simplest way. The problem with this view is that we gain a general notion of patterns without any further

knowledge about a specific pattern. This notion isnot constructivein a somewhat

stricter sense of constructivism than common in the philosophy of mathematics. To say it in more concrete terms, according to this first explication of the notion of a pattern, a set of data may show a pattern to a certain statistical degree, but there is no way to find this out, because the pattern, according to this first

notion, does not come with a description to test it in data or to construct it

mathematically. That is why, secondly, I introduce general pattern theory as an approach to constructively define patterns without any unnecessary restrictions.

I prefer the constructive approach over the adaption of the notion of complexity due to its epistemological merits. Even if we do not restrict the agents of our

epistemology to human agents, every agent (e.g. artificial intelligences; aliens)

has to reason by constructive inferences.—It is another question whether a human agent is able to reenact these inferences or not.

A Simple but Unconstructive Approach: Complexity

It seems natural to appeal to established mathematical theory for the aim to fur- ther specify what a pattern in scientific data is. Since data can occur in various

mathematical forms (from images; sound waves; via measurementsetc.) the main

obstacle to a mathematical specification of patterns in data is the necessary gen- erality of such a specification.

A common philosophical idea for a most general structural definition is to make use of the theory of a logic. But for the sake of our endeavour, I do not want to specify relations between non-mathematical objects, such as propositions or

natural kinds; my aim is to specify what property of a set of data makes it to show

a pattern and what a pattern mathematically is. Fortunately, a fleshed out theory is available with mathematical theory that seems suitable for my endeavour: the

so-called information theory. Ladyman and Ross (2013) refer to this solution as

“mere patterns”:

Mere patterns—stable but nonredundant relationships in data—are distin- guished from ‘real’ patterns [in Dennett’s sense] by appeal to mathematical information theory. A pattern is redundant, and not an ultimately sound object of scientific generalization or naturalized ontology, if it is generated by a pattern of greater computational power (lower logical depth). (p. 108) To me this description of patterns as “stable but nonredundant relationships in data” and the reference to “greater computational power” is not sufficiently precise. In my view, a thorough introduction into information theory and its notion of complexity is necessary.

In the following, I briefly introduce the field and explain what part of it is of specific use to us. I also mention other cornerstones of the discussion of data processing in the field to highlight why the approach that I chose is the most useful one for our purpose. My general idea here is to identify a set of data, as well as

a pattern in data with information, as explored by this field; further specification

follows.

Information theory is a discipline of mathematics and engineering that is moti- vated by the applications for electrical engineering and computer science. Claude Shannon, in his now classical paper (1948a;1948b), articulated a theoretical frame- work for the transmission of discrete data over channels without and with noise. A noisy channel implies that a set of data was sent, and must be interpreted or recon- structed by a receiver, which receives a distorted version of the data set. The trans-

mitted data is usually referred to as information. Figure 4.7 provides a schematic

illustration. As we will see, due to its close relation to pattern recognition in the most general sense, we are mostly interested in the aspects of information theory concerning compressing data for transmission.

Figure 4.7: Schematic diagram of a general communication system (Shannon1948a)

As Kolmogorov (1965) stresses, several very different approaches were intro-

duced to mathematically quantify the amount of information in an information- theoretical framework. I want to explain the preconditions of the discussion by a

very simple example. Assume we have a set of data d, which is a binary string1 of

the length of 3, i.e.

d∈ {(b1, b2, b3) :b1, b2, b3 ∈ {0,1}}=:{0,1}3,

and this set of data, for example d = 010, has to be transmitted via a noisy

channel. For the sake of applicability to our problem I want to emphasise that, in

1In information theory information is often referred to as ‘strings’ or ‘words’ (instead ofe.g.

‘texts’; ‘numbers’) to highlight that the treatment of the data is purely syntactical without taking any non-syntactical meaning (e.g. natural kinds; propositions; objects that are stipulated by a physical theory) of it into consideration.

principle, strings can also be texts, images, descriptions of waves or the like. An important insight for the designer of a technical communication system con- cerning the space of possible dataD:={0,1}3 is what the probability distribution

of the 23 = 8 possible different receivable outcomes

000,001,010,011,100,101,110,111

is; if the actually sent strings can be only one or two entries from this list, the transmitter design has to be different from a scenario in which every outcome is

equally probable. That is why the entropy of the space of possible data D is a

widely discussed measure in the context of quantifying the amount of information of a string. Be

P: 2D →[0,1]

a probability measure (with 2D being the power set ofD) that induces a probability

distribution over D. The entropy for Dwith regard to Pis defined as

H(D,P) :=−X

dD

P({d}) log2P({d})

with the convention1 0 log

20 := 0. Fig-

ure 4.8 provides an intuitive illustra-

tion. A base of 2 for the logarithm is natural in a computer theoretic setup since the addition of one bit doubles the

number off expressible strings.

Figure 4.8: Entropy for two possibilities

with probabilities p and 1−p (Shannon

1948a, p. 11)

The entropyH(D,P) is a measure of how

much information is given by a string,e.g.

010, with regard to the space of possibili- tiesDand the probabilities P. In fact, as

Shannon shows (sect. 6, theorem 2), en- tropy is the unique measure for this un- der some rather weak and natural con- straints. That is why entropy plays such a prominent role in thermodynamics (in particular due to its use in formulations of the second law).

1Due to log

2(0) = −∞ we need to introduce this convention; null sets naturally occure in

many setups: strings that will certainly never be sent. 0· −∞ is undefined in the common calculus.

How are entropy and complexity of information related? And how do these

help to explicate patterns in data? As figure 4.6 (p. 119) illustrates and Ladyman

and Ross’ adaption of Dennett’s notion of ‘real patterns’ hints at, the possibility of a distinction between the formerly defined pattern and the distorting noise in the

data isthecrucial aspect of a pattern. An information space (D,P) with maximum

entropy implies that every string has the same probability (orpropensity) to occur,

which means that no pattern isexpectedto play a relevant part at all. This is not a

trivial point. Imagine the images from figure4.6; maximum entropy means that all

four images are realisations of the noise with the same probability to occur. And this is not want we want, when we talk about the pattern in this figure. Another example that Shannon and Kolmogorov refer to is the use of everyday language; the string ‘house’ is much more likely to occur than ‘KKHGU’, because our language shows patterns like words, grammar and sentence structure.

Complexity of information can be seen as a measure of how epistemically sim-

plea string is regarding a defined information space (D,P). Given the information

space of the English language, ‘KKHGU’ is more complex than ‘house’—just imagine

how an English speaking agent could remember these strings. In other scientific

discussions, like Bennett’s (1990), complexity is used to describe the fundamen-

tal differences between living organisms (high complexity) and other matter (low complexity) on the background of the information space of physics and chemistry.

In a low entropy information space (e.g. English language) many strings with low

complexity (e.g. ‘house’) occur. In a high entropy information space (e.g. the last

four digits of the phone numbers in your personal phone book) most strings have a very high complexity.

What is a pattern? For epistemic reasons, a pattern should be relatively uncomplex—this seems to be Dennett’s reason to introduce “real patterns” (on the background of the information space of human cognition and sensory capabili- ties) with the problems of them depending on human agents and having ontological

implications. But the complexity depends on the information space—‘house’ is

uncomplex in English, but complex in Latin. Therefore, it seems to make sense to loosely explicate a pattern as a string with relatively low complexity regarding the information space. Note that the information space in a scientific context is given by the full body of scientific background assumptions and the language that is used for it, which undergoes changes with time.

One could start an approach to explicate patterns by, firstly, explicating com- plexity and then by, secondly, stipulating that a pattern is a string under a certain complexity threshold, or with relatively low complexity in comparison to most other strings with regard to the information space, or the like. However, to make the notion of complexity more accessible and to make its weaknesses for our ap-

plication more apparent, I briefly introduce an influential and useful approach, the Kolmogorov complexity. The Kolmogorov complexity, which is also applied to

quantify thedegree of randomness, also helps to further illuminate the important

role of the information space.

Despite many definitions that are well-embedded in modern information the-

ory, for instance by Vitányi and Li (2000), I use a definition and syntax close to

Kolmogorov’s (1965) original introduction with some simplifications, due to its

brevity and clarity for our purpose. The intuition is that the Kolmogorov com- plexity of a pattern is given by the length of its shortest description in a given language. For simplicity, we assume to have

y∈ {0,1}n, which we call

patternK

with some realistically large n ∈ N, e.g. the number of pixels in the images at

figure4.6 (p. 119). We define {0,1}∞:= ∞ [ i=0 {0,1}i×(0,0,0, ...)

to denote the set of all infinitely long binary series and [·] :

[

i=0

{0,1}i → {0,1}∞, [·] :y7→y×(0,0,0, ...)

the translation of a finite binary series into{0,1}∞. Furthermore,

l:{0,1}∞ →N∪ {∞}, l: s 7→ max

i∈N (

si 6= 0 for s=s1s2s3...with sn∈ {0,1} for all n ∈N)

defines thelength of a string s∈ {0,1}∞. Bep∈ {0,1}with l(p)< aprogram

and a ϕwith

ϕ:{0,1}∞ → {0,1}and

ϕis partial recursive

is the programming method. Partial recursiveness means that it can be computed

by Turing machines and, intuitively speaking for our purpose, foremostly avoids

that ϕ is chosen in a way that it cannot be explicated and computed in a finite

way.1 ϕ is fixed for an information space and can be thought of as, for instance,

1Partial recursive functions are most often defined on the domain of

Nn for somen∈N. But

the notion can easily be redefined for{0,1}∞by the use of the standard transformation of binary

numbers to decimal numbers and vice versa. In other words, there are trivial bijective mappings

{0,1}∞

the parser of a programming language or the interpreter of the English language that provides the physical reference to ‘house’ (in a Fregean sense of reference).

Finally, we can define the complexityK K of a patternK and a given program-

ming method ϕas

:{0,1}∞→N∪ {∞}, Kϕ : [y]7→ min ϕ(p)=[y]l(p)

([y]) = l([y]) describes the case of a maximum complexityK. Importantly, the

more powerful the stipulated ϕ is (e.g. a C++ parser with a lot of libraries; a

scientifically highly specialised terminology), the lower is the complexity of many patterns.

I give an example. Assume, ˆϕmaps the binary codified version of our standard

mathematical set-theoretical language to the binary number that is expressed with this language. Be ˆya series of one million 1’s and after that only zeros. A program

could be outlined as

ˆ

p={1}106

×(0,0,0, ...)

and ˆ(ˆy) would be very small, ˆ(ˆy)ly).

This is the general idea of Kolmogorov complexity. The approach is not re- stricted to series of binary numbers and can be adapted to every string and there-

fore every set of data and patterns with according definitions of the patternK, the

length, the programming method and the program. It should be mentioned that

Kolmogorov defines the complexity of a pattern originally based on some dataK

d ∈ {0,1}m with some sufficiently large m N. His programming method is then

ϕ(p, x) = y, but I ignored this further aspect for simplicity.

Some further theoretical insights are of interest here. Regarding the program-

ming method ϕ, the invariance theorem roughly states that for a given pattern

or class of patterns a complexity optimal programming method is only as good as any other descriptively sufficiently powerful programming method plus some constant that is necessary to describe the optimal programming method with the

other programming method.1 Chaitin (1992) showed that, roughly said and also

very intuitive, the choice of the programming method ϕand a maximum program

length l(p) (which is necessary to actually run the routines on computers) always

determines a maximum threshold of Kolmogorov complexity LN that can be

determined. He denoted this result ‘incompleteness theorem’, due to the unprov- ability of a statement like: (s) < L + 2, if we now that (s) > L for some

string s∈ {0,1}∞.

Kolmogorov complexity seems very promising after our discussion so far. What are the problems? For the definition we use minϕ(p)=[y], which refers to the set of all

programs p. Even if the set of all p is countably infinity it is practically very hard

to feasibly determine the optimal p under ϕ. If a certain pattern from a realistic

example is given and we have a lot of computational power at hand, it may still be take millions of years until even our best computers made a decision about the

optimal program tocompressthe patternxin question. But this is not the way we

(including non-human agents) epistemically talk about patterns. Usually, when we refer to a statistical or visual pattern, we are able to provide a (maybe vague) description of it in the first hand. We know that ‘house’ is a pattern to us English speakers, since we already have a list of vocabulary at hand. It is not the case that we see ‘house’ and then think about every possible combination of five letters, find possible references for all of these mostly made-up words and finally find out that houses are objects that can be referenced very easily. The pattern that is shown in the top left image from figure4.6(p. 119) is a pattern for us since we canconstruct

the depicted geometric object very easily from everyday geometry by referring to rectangles and lines and not by going through every possible arrangement of black and white pixles, and then find out that it might be a pragmatically good idea to talk about lines and rectangles specifically. This route of explicating complexity is therefore not a good approach to provide a descriptive epistemological account of what patterns in science are; this inadequateness holds for all relevant agents

(e.g. humans; AIs; aliens). Again, Dennett is right regarding his neo-Kantian

implications, but he is wrong by his restriction to some kind of epistemically fixed human agent and the ontological implications.

These are the reasons why, in the following, I want to focus on general pat- tern theory which provides an answer to the problem of pattern construction and keeping the merits of the complexity approach, which is the distinction from com- pressibly describable patterns from noise.

The Constructive Approach: General Pattern Theory

In accordance with (Bogen and) Woodward’s intentional use of the term, we dis- cussed “patterns” (“in science”) in the broadest possible meaning. Obviously, it

is a very extensive endeavour to actuallyshowthat one can mathematically expli-

cate all cases of patterns. Ulf Grenander’s1 œuvre revolves to a significant amount

around exactly this goal. I want to point out that every judgement regarding how well he achieved his goal can be only unfair without a sufficiently comprehensive investigation of his work and this is not the aim of this thesis.

1Mukhopadhyay (2006) provides a helpful overview over Grenander’s œuvre and academic

General pattern theory isconstructive in the most general sense, meaning that

every pattern comes with a finite and recursive construction rule, but these con- struction rules use one of the most general epistemic and mathematically fleshed out sets of vocabulary, which is mathematical algebra.—I justify this epistemologi-

cal view in chapter3about mathematics and my ante rem structuralist’s position.

The epistemological aims of general pattern theory are also stated by Grenander

in an interview with Mukhopadhyay (2006):

[T]he emphasis in pattern theory is on the actual act ofknowledge andact