2. Bayesian Artificial Intelligence
2.2 Basic concepts in probability
2.2.8 Central limit theorem
The term central limit theorem traces back to a paper published by George Pólya back in the 1920s titled “Central Limit theorem in probability theory” and has been used since then [37, p. 1]. However, it was the result of the successive work of three of the most brilliant mathematicians of the eighteenth century: Abraham de Moivre, Simon Laplace, and Carl Gauss [38,p. 29].
Nowadays, the central limit theorem refers to an umbrella of statements that describe the convergence of some probability distribution functions of single or many random variables [37, p. 1]. The importance of the central limit theorem in probability theory comes from its diverse application and its ability to explain some of the widely used distributions such as the normal distribution [38,p.
29].
The first and the simplest limit theorem is the Markov inequality which tells us how likely a sample deviates from the mean. In addition, it applies to any random variable even those whose their distribution is unknown [39,p. 187]. If X is a random variable that can only take positive values then for any x larger than 0 [18,p. 430]:
≥ ≤
(50)
However, Markov inequality is not always useful because if then all it tells us is that ≥ is less than a number larger than 1 which is an obvious statement as all probabilities are less than or equal to 1[39,p. 187].
Markov inequality is a generalization of Chebyshev’s inequality which works
Bayesian Artificial Intelligence 45
for both positive and negative numbers. It is one of the most famous inequalities in probability theory and Chebyshev’s best work [40,p. 75].
Equation 51 gives a mathematical formation of Chebyshev’s inequality for random variable X which has a mean of μ and variance σ2 for any value of k>1 [18,p. 431]:
| | ≥ ≤ (51)
Another importance of Markov’s and Chebyshev’s inequalities comes from the fact that when it is not always possible to know the distribution of the variable but rather its mean and variance, they can be used to set bounds on probabilities around the mean [18,p. 431].
The most important generalization drawn from Chebyshev’s inequality is the weak law of large numbers [18,p. 433]:
The weak law of large numbers shows how the practically calculated probability through experiment is more likely to diverge from the theoretical one proposed by the frequency interpretation of probability [38,p. 19].
The weak law of large numbers (52)
If X1, X2, … are random variables each with identical probability distribution function and a finite expectation value of μ1, μ2,…then for any ε>1:
P (|X1+X + +Xn μ| ≥ ε) as
Bayesian Artificial Intelligence 46
As previously mentioned, the most important result of probability theory is the central limit theorem [18,p. 434]. It simply tells us that averages (or sums) of n independent and identically distributed random variables each with mean of μ and variance of σ2 tend to come close to a Gaussian distribution as n becomes boundlessly large [41,p. 47]. Hence, providing a theoretical framework to explain why many natural statistical phenomena have a bell shaped distribution. It also gives theoretical framework that deals with measurement errors by proposing that they should have normal distribution, in fact the central limit theorem was used to refer to as the law of frequency of errors in the seventieth and eightieth centuries [18,p. 442]. The central limit theorem in a very simplistic mathematical form, that is: for a single random variable only) is given by [18,p. 434]:
However, there are examples of superimposed independent effect that lead to non-normal processes [42,p. 28]. Although the existence of such process seem at first glance to invalidate the central limit theorem, careful
X X X μ σ√
Central limit theorem (53)
If X1, X2, … are random variables each with identical probability distribution function and identical mean μ, and identical variance σ2, then the distribution of:
will converge to normal distribution as that is for a real number a:
P (X1+X + +Xn μ
σ√ ≤ a) √2π e x ⁄ dx as a
Bayesian Artificial Intelligence 47
analysis of such processes shows that they posses infinite variance which places them outside the applicability of the central limit theorem [42,p. 28].
The strong law of large numbers states that, with perfect certainty, the averages of a sequence of random variables each with similar distribution will converge to the mean of the distribution [18,p. 443].
The strong law of large numbers shows that the averages of repeated experiments should converge to their expected value. For example, if a game of coin head or tail is repeated infinitely, then the proportion of heads or tails will be ½ with undutiful likelihood. Jacob Bernoulli was the earliest mathematician to prove the law of large numbers [43,p. 79]. Bernoulli was interested in developing mathematical tools to help make good decisions in civil, economic, and moral issues. He thought that by proving the strong law of large numbers, the relative frequency of observation can be a corner stone on which such decisions can be established [43,p. 79].
There are many other famous inequalities in the inventory of probability theory that deals with various situations or help simplify others such as the one-sided Chebyshev inequality [44,p. 70], Jensen’s inequality, and Chernoff bounds. These are beyond the scope of this section which was mainly to
The strong law of large numbers (54)
If X1, X2, … are random variables each with identical probability distribution function and a finite mean of μ, then with probability =1 :
X1+X + +Xn μ as
Bayesian Artificial Intelligence 48
provide consistent mathematical background to establish the discussion of Bayesian networks in the next section and in chapter 3. Reference [44] gives quick introduction to them.