• No results found

Model specification

4.9 Uncertainty measures and inference

4.9.3 Statistical inference

Statistical inference is mainly concerned with the induction of general statements on a population of interest, on the basis of the observed sample. First we need to first derive the expression for the distribution function for a sample of observa- tions from a statistical model. A sample of nobservations on a random variable

X is a sequence of random variablesX1, X2, . . . , Xn that are distributed identi-

cally toX. In most cases it is convenient to assume that the sample is a simple random sample, with the observations drawn with replacement from the popula- tion modelled by X. Then it follows that the random variablesX1, X2, . . . , Xn

are independent and therefore constitute a sequence of independent and iden- tically distributed (i.i.d.) random variables. Let X denote the random vector formed by such a sequence of random variables, X=(X1, X2, . . . , Xn), and x=(x1, x2, . . .,xn)denote the sample value actually observed. It can be shown

that, if the observations are i.i.d., the cumulative distribution ofXsimplifies to

F (x)=

n ,

i=1

F (xi),

with F (xi) the cumulative distribution ofX, evaluated for each of the sample

values (x1, x2, . . . , xn). If x=(x1, x2, . . . , xn)are the observed sample values,

this expression gives a probability, according to the assumed statistical model, of observing sample values less than or equal to the observed vales. Furthermore, when Xis a continuous random variable,

f (x)=

n ,

i=1

f (xi),

wheref is the density function ofX. And whenXis a discrete random variable,

p(x)=

n ,

i=1

p(xi),

where p is the discrete probability function of X. If x=(x1, x2, . . . , xn) are

the assumed statistical model, of observing sample values exactly equal to the observed vales. In other words, it measures how good the assumed model is for the given data. A high value of p(x), possibly close to 1, implies that the data are well described by the statistical model; a low value ofp(x) implies that the data are poorly described. Similar conclusions can be drawn forf(x) in the continuous case. The difference is that the sample densityf(x) is not constrained to be in [0,1], unlike the sample probabilityp(x). Nevertheless, higher values of

f(x) also indicate that the data are well described by the model, and low values indicate that the data are poorly described. In both cases we can say thatp(x) or

f(x) express the likelihood of the model for the given data.

These are fundamentals ideas when considering inference. A statistical model is a rather general model, in the sense that once a model is assumed to hold, it remains to specify precisely the distribution function or, if the model is para- metric, the unknown parameters. In general, there remain unknown quantities to be specified. This can seldom be done theoretically, without reference to the observed data. As the observed data are typically observed on a sample, the main purpose of statistical inference is to ‘extend’ the validity of the calculations obtained on the sample to the whole population. In this respect, when statis- tical summaries are calculated on a sample rather than a whole population, it is more correct to use the term ‘estimated’ rather than ‘calculated’, to reflect the fact that the values obtained depend on the sample chosen and may there- fore be different if a different sample is considered. The summary functions that produce the estimates, when applied to the data, are called statistics. The simplest examples of statistics are the sample mean and the sample variance; other examples are the statistical indexes in Chapter 3, when calculated on a sample.

The methods of statistical inference can be divided into estimation methods and hypothesis testing procedures. Estimation methods derive statistics, called estimators, of the unknown model quantities that, when applied to the sample data, can produce reliable estimates of them. Estimation methods can be divided into point estimate methods, where the quantity is estimated with a precise value, and confidence interval methods, where the quantity is estimated to have a high frequency of lying within a region, usually an interval of the real line. To provide a confidence interval, estimators are usually supplemented by measures of their sampling variability. Hypothesis testing procedures look at the use of the statistics to take decisions and actions. More precisely, the chosen statistics are used to accept or reject a hypothesis about the unknown quantities by constructing useful rejection regions that impose thresholds on the values of the statistics.

We briefly present the most important inferential methods. For simplicity, we refer to a parametric model. Starting with estimation methods, consider some desirable properties for an estimator. An estimatorT is said to be unbiased, for a parameterθ, ifE(T )=θ. The differenceE(T )θ is called bias of an estimator and is null if the estimator is unbiased. For example, the sample mean

X= 1 n

n

i=1

is always an unbiased estimator of the unknown population meanμ, as it can be shown thatE(X)=μ. On the other hand, the sample variance

S2= 1 n

n

i=1

(XiX)2

is a biased estimator of the sample varianceσ2, as

E(S2)= n−1 n σ

2. Its bias is therefore

bias(S2)= −1

2.

This explains why an often used estimator of the population variance is the unbiased sample variance

S2= 1 n−1

n

i=1

(XiX)2.

A related concept is the efficiency of an estimator, which is a relative concept. Among a class of estimators, the most efficient estimator is usually the one with the lowest mean squared error (MSE), which is defined on the basis of the Euclidean distance by

MSE(T )=E[(Tθ )2].

It can be shown that

MSE(T )=[bias(T )]2+Var(T ).

MSE thus has two components: the bias and the variance. As we shall see in Chapter 5, there is usually a trade-off between these quantities: if one increases, the other decreases. The sample mean can be shown to be the most efficient estimator of the population mean. For large samples, this can be easily seen by applying the definition.

Finally, an estimator is said to be consistent (in quadratic mean) if, forn→ ∞, lim(MSE(T ))=0. This implies that, for n→ ∞, P (lim|Tθ|)=1; that is, for a large sample, the probability that the estimator lies in an arbitrarily small neighbourhood of θ approximates to 1. Notice that both the sample mean and the sample variances introduced above are consistent.

In practice, the two most important estimation methods are the maximum likelihood and Bayesian methods.

Maximum likelihood methods

Maximum likelihood methods start by considering the likelihood of a model which, in the parametric case, is the joint density of X, expressed as a function

of the unknown parametersθ: p(x;θ )= n , i=1 p(xi;θ ),

whereθ are the unknown parameters andXis assumed to be discrete.

The same expression holds for the continuous case, withp replaced byf. In the rest of the text we will therefore use the discrete notation, without loss of generality, case.

If a parametric model is chosen, the model is assumed to have a precise form, and the only unknown quantities left are the parameters. Therefore the likelihood is in fact a function of the parametersθ. To stress this fact, the previous expression can be also denoted by L(θ; x). Maximum likelihood methods suggest that, as estimators of the unknown parameterθ, we take the statistics that maximiseL(θ;

x) with respect toθ. The heuristic motivation for maximum likelihood is that it selects the parameter value that makes the observed data most likely under the assumed statistical model. The statistics generated using maximum likelihood are known as maximum likelihood estimators (MLEs) and have many desirable properties. In particular, they can be used to derive confidence intervals. The typical procedure is to assume that a large sample is available (this is often the case in data mining), in which case the MLE is approximately distributed as a Gaussian (normal) distribution. The estimator can thus be used in a simple way to derive an asymptotic (valid for large samples) confidence interval. For example, letT be an MLE and let Var(T )be its asymptotic variance. Then a 100 (1−θ )% confidence interval is given by

Tz1−α/2 - Var(T ), T +z1−α/2 - Var(T ) ,

wherez1−α/2 is the 100(1−α/2) percentile of the standardised normal distribu- tion, such that the probability of obtaining a value less than z1−α/2 is equal to 1−α/2. The quantity 1−αis also known as the confidence level of the interval, as it gives the confidence that the procedure is correct: in 100(1−α)% of cases the unknown quantity will fall within the chosen interval. It has to be specified before the analysis. For the normal distribution, the estimator ofμis the sample mean,X=n−1Xi. So a confidence interval for the mean, when the variance

σ2 is assumed to be known, is given by

Xz1−α/2 . Var(X), X+z1−α/2 . Var(X) , where Var(X)= σ 2 n .

When the distribution is normal from the start, as in this case, the expression for the confidence interval holds for any sample size. A common procedure in confidence intervals is to assume a confidence level of 95%; in a normal distribution this leads to z1−α/2=1.96.

Bayesian methods

Bayesian methods use Bayes’ rule, which provides a powerful framework for combining sample information with (prior) expert opinion to produce an updated (posterior) expert opinion. In Bayesian analysis, a parameter is treated as a ran- dom variable whose uncertainty is modelled by a probability distribution. This distribution is the expert’s prior distribution p(θ), stated in the absence of the sampled data. The likelihood is the distribution of the sample, conditional on the values of the random variable θ, p(x|θ ). Bayes’ rule provides an algorithm to update the expert’s opinion in light of the data, producing the so-called posterior distribution p(θ|x):

p(θ|x)=c−1p(x|θ )p(θ ),

with c=p(x), a constant that does not depend on the unknown parameter θ. The posterior distribution represents the main tool of Bayesian inference. Once it is obtained, it is easy to obtain any inference of interest. For instance, to obtain a point estimate, we can take a summary of the posterior distribution, such as the mean or the mode. Similarly, confidence intervals can be easily derived by taking any two values ofθ such that the probability ofθbelonging to the interval described by those two values corresponds to the given confidence level. As θ

is a random variable, it is now correct to interpret the confidence level as a probability statement: 1−αis the coverage probability of the interval, namely, the probability that θ assumes values in the interval. The Bayesian approach is thus a coherent and flexible procedure. On the other hand, it has the disadvantage of requiring a more computationally intensive approach, as well as more careful statistical thinking, especially in providing an appropriate prior distribution.

For the normal distribution example, assuming as a prior distribution for θ a constant distribution (expressing a vague state of prior knowledge), the posterior mode is equal to the MLE. Therefore, maximum likelihood estimates can be seen as a special case of Bayesian estimates. More generally, it can be shown that, when a large sample is considered, the Bayesian posterior distribution approaches an asymptotic normal distribution, with the maximum likelihood estimate as expected value. This reinforces the previous conclusion.

An important application of Bayes’ rule arises in predictive classification prob- lems. As explained in Section 4.4, the discriminant rule establishes that an observationxis allocated to the class with the highest probability of occurrence, on the basis of the observed data. This can be stated more precisely by appealing to Bayes’ rule. LetCi, fori=1, . . . , k, denote a partition of mutually exclusive

and exhaustive classes. Bayes’ discriminant rule allocates each observation x to the classCi that maximises the posterior probability:

p(Ci|x)=c−1p(x|Ci)p (Ci) , where c=p(x)= k

i=1

p(Ci)p(x|Ci),

andxis the observed sample. Since the denominator does not depend onCi, it is

sufficient to maximise the numerator. If the prior class probabilities are all equal tok−1, maximisation of the posterior probability is equivalent to maximisation of

the likelihoodp(x|Ci). This is the approach often followed in practice. Another

common approach is to estimate the prior class probabilities with the observed relative frequencies in a training sample. In any case, it can be shown that the Bayes discriminant rule is optimal, in the sense that it leads to the least possible misclassification error rate. This error rate is measured as the expected error probability when each observation is classified according to Bayes’ rule:

pB= ) 1−max i p(Ci|x) p(x)dx,

also known as the Bayes error rate. No other discriminant rule can do better than a Bayes classifier; that is, the Bayes error rate is a lower bound on the misclassi- fication rates. Only rules that derive from Bayes’ rule are optimal. For instance, the logistic discriminant rule and the linear discriminant rule (Section 4.4) are optimal, whereas the discriminant rules obtained from tree models, multilayer perceptrons and nearest-neighbour models are optimal for a large sample size.

Hypothesis testing

We now briefly consider procedures for hypothesis testing. A statistical hypoth- esis is an assertion about an unknown population quantity. Hypothesis testing is generally performed in a pairwise way: a null hypothesisH0specifies the hypoth- esis to be verified, and an alternative hypothesisH1specifies the hypothesis with which to compare it. A hypothesis testing procedure is usually constructed by finding a rejection (critical) rule such thatH0is rejected, when an observed sam- ple statistic satisfies that rule, and vice versa. The simplest way to construct a rejection rule is by using confidence intervals. Let the acceptance region of a test be defined as the logical complement of the rejection region. An acceptance region for a (two-sided) hypothesis can be obtained from the two inequalities describing a confidence interval, swapping the parameter with the statistic and setting the parameter value equal to the null hypothesis. The rejection region is finally obtained by inverting the signs of the inequalities. For instance, in our normal distribution example, the hypothesisH0: μ=0 will be rejected against the alternative hypothesis H1: μ =0 when the observed value of X is outside the interval 0−z1−α/2 . Var(X),0+z1−α/2 . Var(X) .

The probability α has to be specified a priori and is called the significance level of the procedure. It corresponds to the probability of a type I error, that is, the probability of rejecting the null hypothesis when it is actually true. A common assumption is to takeα=0.05, which corresponds to a confidence level of 0.95. The probability is obtained, in this case, by summing two probabilities relative to the random variable X: the probability thatX <0−z1−α/2

-

Var(X)

and the probability thatX > 0+z1−α/2

-

Var(X). Notice that the rejection region is derived by setting μ=0. The significance level is calculated using the same assumption. These are general facts: statistical tests are usually derived under the

assumption that the null hypothesis is true. This is expressed by saying that the test holds ‘under the null hypothesis’. The limits of the interval are known as critical values. If the alternative hypothesis were one-sided, the rejection region would correspond to only one inequality. For example, ifH1:μ >0, the rejection region would be defined by the inequality X >0+z1−α

-

Var(X). The critical value is different because the significance level is now obtained by considering only one probability.

There are other methods for deriving rejection rules. An alternative approach to testing the validity of a certain null hypothesis is by calculating the p-value of the test. Thep-value can be described as the probability, calculated under the null hypothesis, of observing a test statistic more extreme than actually observed, assuming the null hypothesis is true, where ‘more extreme’ means in the direction of the alternative hypothesis. For a two-sided hypothesis, thep-value is usually taken to be twice the one-sided p-value. In our normal distribution example, the test statistic is X. Letx be the observed sample value of X. The p-value would then be equal to twice the probability thatXis greater thanx:p-value= 2P (X > x). A smallp-value will indicate thatxis far from the null hypothesis, which is thus rejected; a largep-value will mean that the null hypothesis cannot be rejected. The threshold value is usually the significance level of the test, which is chosen in advance. For instance, if the chosen significance level of the test is α=0.05, a p-value of 0.03 indicates that the null hypothesis can be rejected, whereas a p-value of 0.18 indicates that the null hypothesis cannot be rejected.