1.4 Some Statistical Tools
1.4.3 Bayesian Statistics
A method in order to quantify uncertainties in a measurement are given by Bayesian ideas (for a review, see D’Agostini 2003). The two crucial aspects of these ideas are
• Probability depends on our state of knowledge, which is different for different people. Therefore probability is necessarily subjective.
Consider twoevents AandB whereP(A) andP(B) are the probabilities of the
eventAorB, respectively. It is clear that for the probabilityP(A)the following holds
0≤P(A)≤1. Furthermore, for two eventsAandB
P(A∪B) = P(A) +P(B)−P(A∩B) (1.75)
P(A∩B) = P(B|A)P(B) =P(B|A)P(A), (1.76) where ∪ denotes the logical OR and ∩ denotes the logical AND. The probability P(A∪B)is thelogical sumandP(A∩B)is thelogical productof two probabilities
P(A)andP(B). The latter is also often called the joint probability. The termP(A|B)
describes the probability ofA under the condition thatB is true and is shortened by saying probability ofAgivenB.
Another important property is the probabilistic independence of events. If the probability of Adoes not change the status ofB, the events A andB are said to be independent. In that caseP(A|B) =P(A), andP(B|A) =P(B). Inserting this into
Eq. (1.76) yields
P(A∩B) =P(A)P(B). (1.77) From Eq. (1.76), Bayes Theorem is easily derived:
P(B|A) = P(A|B)P(B)
P(A) , (1.78)
whereP(B)is called thepriorprobability,P(B|A)is called theposteriorprobability
andP(A|B)is thelikelihood. If one identifies eventAwith an observation and event B with some set of model parameters, the likelihood can be literally described as the probability of the observationAgiven the specific hypothesesB. In the same context, the probability of the observation P(A)is a constant although it is unknown leaving the proportionality
P(B|A)∝P(A|B)P(B), (1.79) the prior probabilityP(B)is a statement about our knowledge of the hypotheses and is
mostly assumed to be uniform when one does not know anything about the probability of the hypotheses. However, Bayes postulates that all priors should be treated as equal. So far, it was implicitly assumed that the variablex is discrete and a probability function p(x) is interpreted as the probability of the proposition P(A), where A is true when the value of the variable is equal tox. However in most cases, continuous variables x have to be considered and the probability will be a continuous function interpreted as a probability density functionp(x)dx. In terms of the probabilityP(A),
it is understood asAis true when the value of the variable lies in the rangex+dx. In the further discussion, the latter perspective is assumed.
Assuming a set of data being the observationsdand a set of modelsθ describing our expectations, then Eq. (1.79) becomes
If the datadiare independent, then the likelihoodL(θ;d) =p(d|θ)can be expressed
as
L(θ;d) =p(d|θ) =Y
i
L(θ;di). (1.81)
As mentioned, if one knows so little about the appropriate values of the hypotheses parameter that for the priorp(θ)a uniform distribution is a practical choice and using
the independence described by Eq. (1.81), Eq. (1.80) becomes p(θ|d)∝p(d|θ) =Y
i
L(θ;di). (1.82)
Therefore, the maximum of the posterior probability p(θ|d), which is the interesting
probability of the model given the data, can be found by maximising the likelihood L(θ;d)– maximising the probability of the data given the model. This consideration leads to themaximum likelihood principle.
In order to derive a least-squares formulae (Eq. 1.51) as presented in Sect. 1.4.1, one considers the likelihood to be described by a Gaussian distribution. Assuming that the data are independent consisting of pairs{xi, yi}, whose true value{µxi, µyi}are
related by a deterministic functionµyi = y(µxi,θ)and with Gaussian errors σi only
inyi (i.e.xi≈µxi) then the likelihood function is a multivariate Gaussian
p(θ|x,y) =L(θ;x,y) ∝ Y i exp " −(yi−y(xi,θ)) 2 2σ2 i # (1.83) = exp −1 2χ 2(θ), (1.84) where χ2(θ) =X i (yi−y(xi,θ))2 σi2 , (1.85)
which is the same as Eq. (1.51). Maximising the likelihood function is equivalent to minimisingχ2(θ)with respect toθ. The interesting point is that this equation holds for
independent variables and Gaussian distributions. One should keep these assumptions in mind when applying this method to data sets.
The uncertainty inθis determined by considering the covariance matrixV
(V−1)ij(θ) = 1 2 ∂2χ2 ∂θi∂θj θ=θ m , (1.86)
whereθm is the set of parameters which minimise the χ2-function. This is a con-
sequence of the assumed multi-variate Gaussian distribution ofθ. Expanding χ2 in series around its minimumχ2(θm)
χ2(θ)≈χ2(θm) + 1 2∆θ T ∂2χ2 ∂θi∂θj ∆θ, (1.87)
where∆θ is the difference θ−θm. Using Eq. (1.86), and inserting Eq. (1.87) into
Eq. (1.83) and applying an appropriate normalisation results in the Likelihood function L(θ;x,y)≈ 1 (2π)n/2(detV)1/2 exp −12∆θTV−1∆θ , (1.88)
wherenis the dimension ofθ anddetV indicates the determinant. It is noteworthy that Eq. (1.88) is exact wheny(µxi,θ)depends linearly on the variousθi.
The likelihood function can also be used as a power spectrum estimator since such an estimator has to minimise the variance. If one identifies the covariance matrix as defined by Eq. (1.60) and assuming the mean to be zero hyii = 0, the likelihood
function changes L(θ;x) = 1 (2π)N/2(detC)1/2 exp −12∆TC−1∆ , (1.89)
where C is the covariance matrix, which expresses the theoretical expectations and dependents on the model parameters, ∆i are the data andN is the dimension of the
covariance matrix.