Statistical models - Uncertainty measures and inference

Statistical data mining

5.1 Uncertainty measures and inference

5.1.2 Statistical models

Suppose that, for the problem at hand, we have deﬁned all the possible elementary events, as well as the event spacea. Suppose also that, on the basis of one of the operational notions of probability, we have constructed a probability measure

P. The triplet (,a, P) deﬁnes a probability space; it is the basic for deﬁning a random variable, hence for building a statistical model.

Given a probability space(,a, P), a random variable is any functionX(ω),

ω∈, with values on the real line. The cumulative distribution of a random variable X, denoted by F, is a function deﬁned on the real line, with values on [0,1], that satisﬁes F (x)=P (X≤x) for any real number x. The cumulative distribution function, often called the distribution function, characterises the

probability distribution forX. It is the main tool for deﬁning a statistical model of the uncertainty in a variableX.

We now examine two important special cases of random variables and look at their distribution functions. A random variable is discrete if it can take only a ﬁnite, or countable, set of values. In this case

F (x)=

X≤x

p(x)withp(x)=P (X=x)

Therefore in this case p(x), called the discrete probability function, also characterises the distribution. Both quantitative discrete variables and qualitative variables can be modelled using a discrete random variable, provided that numerical codes are assigned to qualitative variables. They are collectively known as categorical random variables.

A random variable is said to be continuous if there exist a functionf, called the density function, such that the distribution function can be obtained from it:

F (x)= x

−∞f (u)dufor any real number x

Furthermore, the density function has these two properties:

f (x)≥0,∀x ∞

−∞f (x)dx=1

In view of its deﬁnition, the density function characterises a statistical model for continuous random variables.

By replacing relative frequencies with probabilities, we can treat random variables like the statistical variables in Chapter 3. For instance, the discrete probability function can be taken as the limiting relative frequency of a discrete random variable. On the other hand, the density function corresponds to the height of the histogram of a continuous variable. Consequently, the concepts in Chapter 3 – mean, variance, correlation, association, etc. – carry over to random variables. For instance, the mean of a random variable, usually called the expected value, is deﬁned by

µ=xipi ifXis categorical

µ=

xf (x)dx ifX is continuous

The concept of a random variable can be extended to cover random vectors or other random elements, thereby deﬁning a more complex statistical model. From here on, we use notation for random variables, but without loss of generality.

In general, a statistical model of uncertainty can be defined by the pair (X, F (x)), whereXis a random variable, andF (x)is the cumulative distribution attached to it. It is often convenient to specifyF directly, choosing it from a cat- alogue of models available in the statistical literature, models which have been constructed specifically for certain problems. These models can be divided into three main classes: parametric models, for which the cumulative distribution is completely specified by a finite set of parameters, denoted byθ; non-parametric models, which, require the whole specification of F; and semiparametric models, where the specification of F is eased by having some parameters but these parameters do not fully specify the model.

We now examine the most used parametric model, the Gaussian distribution; Section 5.2 looks at non-parametric and semiparametric models. Let Z be a continuous variable with real values.Zis distributed according to a standardised Gaussian (or normal) distribution if the density function is

f (z)= √1

2πe

−z₂2

This is a bell-shaped distribution (Section 3.1), with most of the probability around its centre, which coincides with the mean, the mode and the median of the distribution (equal to zero for the standardised Gaussian distribution). Since the distribution is symmetric, the probability of having a value greater than a certain positive quantity is equal to the probability of having a value lower than the negative of the same quantity, i.e.P (Z >2)=P (Z <−2). Having deﬁned the Gaussian as our reference model, we can use it to calculate some probabilities of interest; these probabilities are areas under the density function. We cannot calculate them in closed form, so we must use numerical approximation. In the past this involved statistical tables but now it can be done with all the main data analysis packages. Here is a ﬁnancial example.

Consider the valuation of the return of a certain financial activity. Suppose, as is often done in practice, that the future distribution of this return,Z, expressed in euros, follows the standardised Gaussian distribution. What is the probability of observing a return greater than 1 euro? To solve this problem it is sufficient to calculate the probabilityP (Z >1). The solution is not expressible in closed form, but using statistical software we find that the probability is equal to about 0.159. Now suppose that a financial institution has to allocate an amount of capital to be protected against the risk of facing a loss on a certain portfolio. This problem is a simplified version of a problem that daily faces credit operators – calculating value at risk (VaR). VaR is a statistical index that measures the maximum loss to which a portfolio is exposed in a holding period t and with a fixed levelα of desired risk. LetZbe the change in value of the portfolio during the considered period, expressed in standardised terms. The VaR of the portfolio is then the loss (corresponding to a negative return), implicitly defined by

suppose the desired level of risk is 5%. This corresponds to ﬁxing the right- hand side at 0.95; the value of the area under the standardised density curve to the right of the value VaR (i.e. to the left of the value −VaR) is then equal to 0.05. Therefore the VaR is given by the point on the x-axis of the graph that corresponds to this area. The equation has no closed-form solution. But statistical software easily computes that VaR=1.64. Figure 5.1 illustrates the calculation. The histogram shows the observed returns and the continuous line is the standard Gaussian distribution, used to calculate the VaR. In quantitative risk management this approach is known as the analytic approach or the delta normal approach, in contrast to simulation-based methods.

So far we have considered the standardised Gaussian distribution, with mean 0, and variance 1. It is possible to obtain a family of Gaussian distributions that differ only in their values for mean and variance. In other words, the Gaussian distribution is a parametric statistical model, parameterised by two parameters. Formally, ifZis a standard Gaussian random variable andX=σ Z+µthenX

is distributed according to a Gaussian distribution with meanµand varianceσ2_.

The family of Gaussian distributions is closed with respect to linear transforma- tions; that is, any linear transformation of a Gaussian variable is also Gaussian. As a result, the Gaussian distribution is well suited to situations in which we hypothesize linear relationships among variables.

Our deﬁnition of the Gaussian distribution can be extended to the multivariate case. The resulting distribution is the main statistical model for the inferential analysis of continuous random vectors. For simplicity, here is the bivariate case. A bidimensional random vector (X1, X2) is distributed as a bivariate Gaussian distribution if there exist six real constants:

aij,1≤i, j ≤2

µi, i=1,2

and two independent standardised Gaussian random variables, Z1 and Z2, such that

X1=µ1+a11Z1+a12Z2 X2=µ2+a21Z1+a22Z2

In matrix terms, the previous equation can be stated as X =µ+AZ, which easily extends to the multivariate case. In general, a multivariate Gaussian distribution is completely speciﬁed by two parameters, the mean vector µand the variance–covariance matrix=AA.

Using the Gaussian distribution, we can derive three distributions, of special importance for inferential analysis: the chi-squared distribution, the Student’s t

distribution and theF distribution.

The chi-squared distribution is obtained from a standardised Gaussian distribution. IfZis a standardised Gaussian distribution, the random variable deﬁned by

Z2_{is said to follow a chi-squared distribution with 1 degree of freedom; it is indi-}

cated by the symbol χ2_{(1). More generally, a parametric family of chi-squared}

distributions, indexed by one parameter, is obtained from the fact that the sum of n independent chi-squared distributions is a chi-squared distribution with n

degrees of freedom:χ2_(n)_{. The chi-squared distribution has positive density only}

for positive real values. Probabilities from it have to be calculated numerically, as for the Gaussian distribution. Finally, the chi-squared value has an expected value equal to nand a variance equal to 2n.

The Student’s t distribution is characterised by a density symmetric around zero, like the Gaussian distribution but more peaked (i.e. with a higher kurtosis). It is described by one parameter, the degrees of freedom, n. Asn increases, the Student’st distribution approaches the Gaussian distribution. Formally, letZbe a standard Gaussian (normal) distribution, in symbolsZ∼N(0,1), and letUbe a chi-squared distributionT = √Z

U /n ∼t (n)withndegrees of freedom,U ∼χ 2

If ZandU are independent, then,

T = √Z

U/n ∼t (n)

that is,T is a Student’st distribution withndegrees of freedom. It can be shown that the Student’s t distribution has an expected value of 0 and a variance given by

VaT(T )=n/(n−2)forn >2

Finally, the F distribution is also asymmetric and deﬁned only for positive values, like the chi-squared distribution. It is obtained as the ratio between two

independent chi-squared distributions,U andV, with degrees of freedommand

n, respectively:

F = U/m

V /n

TheF distribution is therefore described by two parameters,mand n; it has an expected value equal to n/(n−2)and a variance that is a function of both m

andn. AnF distribution withm=1 is equal to the square of a Student’st with

ndegrees of freedom.

In document Applied Data Mining Statistical Methods for Business and Industry Giudici P (2003) pdf (Page 146-151)