Bayesian Statistics - Some Statistical Tools

1.4 Some Statistical Tools

1.4.3 Bayesian Statistics

A method in order to quantify uncertainties in a measurement are given by Bayesian ideas (for a review, see D’Agostini 2003). The two crucial aspects of these ideas are

• Probability depends on our state of knowledge, which is different for different people. Therefore probability is necessarily subjective.

Consider twoevents AandB whereP(A) andP(B) are the probabilities of the

eventAorB, respectively. It is clear that for the probabilityP(A)the following holds

0_≤P(A)_≤1. Furthermore, for two eventsAandB

P(A_∪B) = P(A) +P(B)₋P(A_∩B) (1.75)

P(A_∩B) = P(B_|A)P(B) =P(B_|A)P(A), (1.76) where ∪ denotes the logical OR and ∩ denotes the logical AND. The probability P(A_∪B)is thelogical sumandP(A_∩B)is thelogical productof two probabilities

P(A)andP(B). The latter is also often called the joint probability. The termP(A_|B)

describes the probability ofA under the condition thatB is true and is shortened by saying probability ofAgivenB.

Another important property is the probabilistic independence of events. If the probability of Adoes not change the status ofB, the events A andB are said to be independent. In that caseP(A_|B) =P(A), andP(B_|A) =P(B). Inserting this into

Eq. (1.76) yields

P(A_∩B) =P(A)P(B). (1.77) From Eq. (1.76), Bayes Theorem is easily derived:

P(B_|A) = P(A|B)P(B)

P(A) , (1.78)

whereP(B)is called thepriorprobability,P(B_|A)is called theposteriorprobability

andP(A_|B)is thelikelihood. If one identifies eventAwith an observation and event B with some set of model parameters, the likelihood can be literally described as the probability of the observationAgiven the specific hypothesesB. In the same context, the probability of the observation P(A)is a constant although it is unknown leaving the proportionality

P(B_|A)_∝P(A_|B)P(B), (1.79) the prior probabilityP(B)is a statement about our knowledge of the hypotheses and is

mostly assumed to be uniform when one does not know anything about the probability of the hypotheses. However, Bayes postulates that all priors should be treated as equal. So far, it was implicitly assumed that the variablex is discrete and a probability function p(x) is interpreted as the probability of the proposition P(A), where A is true when the value of the variable is equal tox. However in most cases, continuous variables x have to be considered and the probability will be a continuous function interpreted as a probability density functionp(x)dx. In terms of the probabilityP(A),

it is understood asAis true when the value of the variable lies in the rangex+dx. In the further discussion, the latter perspective is assumed.

Assuming a set of data being the observationsdand a set of modelsθ describing our expectations, then Eq. (1.79) becomes

If the datadiare independent, then the likelihoodL(θ;d) =p(d|θ)can be expressed

L(θ;d) =p(d_|θ) =Y

L(θ;di). (1.81)

As mentioned, if one knows so little about the appropriate values of the hypotheses parameter that for the priorp(θ)a uniform distribution is a practical choice and using

the independence described by Eq. (1.81), Eq. (1.80) becomes p(θ_|d)_∝p(d_|θ) =Y

L(θ;di). (1.82)

Therefore, the maximum of the posterior probability p(θ_|d), which is the interesting

probability of the model given the data, can be found by maximising the likelihood L(θ;d)– maximising the probability of the data given the model. This consideration leads to themaximum likelihood principle.

In order to derive a least-squares formulae (Eq. 1.51) as presented in Sect. 1.4.1, one considers the likelihood to be described by a Gaussian distribution. Assuming that the data are independent consisting of pairs{xi, yi}, whose true value{µxi, µyi}are

related by a deterministic functionµyi = y(µxi,θ)and with Gaussian errors σi only

inyi (i.e.xi≈µxi) then the likelihood function is a multivariate Gaussian

p(θ_|x,y) =_L(θ;x,y) _∝ Y i exp " −(yi−y(xi,θ)) 2 2σ2 i # (1.83) = exp −1 2χ 2₍_θ₎_, _(1.84) where χ2(θ) =X i (yi−y(xi,θ))2 σ_i2 , (1.85)

which is the same as Eq. (1.51). Maximising the likelihood function is equivalent to minimisingχ2₍_θ₎_{with respect to}_θ_{. The interesting point is that this equation holds for}

independent variables and Gaussian distributions. One should keep these assumptions in mind when applying this method to data sets.

The uncertainty inθis determined by considering the covariance matrixV

(V−1)ij(θ) = 1 2 ∂2χ2 ∂θi∂θj    _θ₌_θ m , (1.86)

whereθm is the set of parameters which minimise the χ2-function. This is a con-

sequence of the assumed multi-variate Gaussian distribution ofθ. Expanding χ2 in series around its minimumχ2(θm)

χ2(θ)_≈χ2(θm) + 1 2∆θ T ∂2χ2 ∂θi∂θj ∆_θ_, (1.87)

where∆_θ is the difference _θ₋_θ_m. Using Eq. (1.86), and inserting Eq. (1.87) into

Eq. (1.83) and applying an appropriate normalisation results in the Likelihood function L(θ;x,y)_≈ 1 (2π)n/2_(det_V₎1/2 exp −1₂∆_θT_V−1∆_θ , (1.88)

wherenis the dimension ofθ anddetV indicates the determinant. It is noteworthy that Eq. (1.88) is exact wheny(µxi,θ)depends linearly on the variousθi.

The likelihood function can also be used as a power spectrum estimator since such an estimator has to minimise the variance. If one identifies the covariance matrix as defined by Eq. (1.60) and assuming the mean to be zero hyii = 0, the likelihood

function changes L(θ;x) = 1 (2π)N/2_(det_C₎1/2 exp −1₂∆T_C−1∆ , (1.89)

where C is the covariance matrix, which expresses the theoretical expectations and dependents on the model parameters, ∆i are the data andN is the dimension of the

covariance matrix.

In document Vogt, Corina (2004): Investigations of Faraday Rotation Maps of Extended Radio Sources in order to determine Cluster Magnetic Field Properties. Dissertation, LMU München: Fakultät für Physik (Page 39-42)