Statistical data mining
5.1 Uncertainty measures and inference
5.1.2 Statistical models
Suppose that, for the problem at hand, we have defined all the possible elementary events, as well as the event spacea. Suppose also that, on the basis of one of the operational notions of probability, we have constructed a probability measure
P. The triplet (,a, P) defines a probability space; it is the basic for defining a random variable, hence for building a statistical model.
Given a probability space(,a, P), a random variable is any functionX(ω),
ω∈, with values on the real line. The cumulative distribution of a random variable X, denoted by F, is a function defined on the real line, with values on [0,1], that satisfies F (x)=P (X≤x) for any real number x. The cumula- tive distribution function, often called the distribution function, characterises the
probability distribution forX. It is the main tool for defining a statistical model of the uncertainty in a variableX.
We now examine two important special cases of random variables and look at their distribution functions. A random variable is discrete if it can take only a finite, or countable, set of values. In this case
F (x)=
X≤x
p(x)withp(x)=P (X=x)
Therefore in this case p(x), called the discrete probability function, also char- acterises the distribution. Both quantitative discrete variables and qualitative variables can be modelled using a discrete random variable, provided that numer- ical codes are assigned to qualitative variables. They are collectively known as categorical random variables.
A random variable is said to be continuous if there exist a functionf, called the density function, such that the distribution function can be obtained from it:
F (x)= x
−∞f (u)dufor any real number x
Furthermore, the density function has these two properties:
f (x)≥0,∀x ∞
−∞f (x)dx=1
In view of its definition, the density function characterises a statistical model for continuous random variables.
By replacing relative frequencies with probabilities, we can treat random variables like the statistical variables in Chapter 3. For instance, the discrete probability function can be taken as the limiting relative frequency of a dis- crete random variable. On the other hand, the density function corresponds to the height of the histogram of a continuous variable. Consequently, the concepts in Chapter 3 – mean, variance, correlation, association, etc. – carry over to ran- dom variables. For instance, the mean of a random variable, usually called the expected value, is defined by
µ=xipi ifXis categorical
µ=
xf (x)dx ifX is continuous
The concept of a random variable can be extended to cover random vectors or other random elements, thereby defining a more complex statistical model. From here on, we use notation for random variables, but without loss of generality.
In general, a statistical model of uncertainty can be defined by the pair (X, F (x)), whereXis a random variable, andF (x)is the cumulative distribution attached to it. It is often convenient to specifyF directly, choosing it from a cat- alogue of models available in the statistical literature, models which have been constructed specifically for certain problems. These models can be divided into three main classes: parametric models, for which the cumulative distribution is completely specified by a finite set of parameters, denoted byθ; non-parametric models, which, require the whole specification of F; and semiparametric mod- els, where the specification of F is eased by having some parameters but these parameters do not fully specify the model.
We now examine the most used parametric model, the Gaussian distribution; Section 5.2 looks at non-parametric and semiparametric models. Let Z be a continuous variable with real values.Zis distributed according to a standardised Gaussian (or normal) distribution if the density function is
f (z)= √1
2πe
−z22
This is a bell-shaped distribution (Section 3.1), with most of the probability around its centre, which coincides with the mean, the mode and the median of the distribution (equal to zero for the standardised Gaussian distribution). Since the distribution is symmetric, the probability of having a value greater than a certain positive quantity is equal to the probability of having a value lower than the negative of the same quantity, i.e.P (Z >2)=P (Z <−2). Having defined the Gaussian as our reference model, we can use it to calculate some probabilities of interest; these probabilities are areas under the density function. We cannot calculate them in closed form, so we must use numerical approximation. In the past this involved statistical tables but now it can be done with all the main data analysis packages. Here is a financial example.
Consider the valuation of the return of a certain financial activity. Suppose, as is often done in practice, that the future distribution of this return,Z, expressed in euros, follows the standardised Gaussian distribution. What is the probability of observing a return greater than 1 euro? To solve this problem it is sufficient to calculate the probabilityP (Z >1). The solution is not expressible in closed form, but using statistical software we find that the probability is equal to about 0.159. Now suppose that a financial institution has to allocate an amount of capital to be protected against the risk of facing a loss on a certain portfolio. This problem is a simplified version of a problem that daily faces credit operators – calculating value at risk (VaR). VaR is a statistical index that measures the maximum loss to which a portfolio is exposed in a holding period t and with a fixed levelα of desired risk. LetZbe the change in value of the portfolio during the considered period, expressed in standardised terms. The VaR of the portfolio is then the loss (corresponding to a negative return), implicitly defined by
suppose the desired level of risk is 5%. This corresponds to fixing the right- hand side at 0.95; the value of the area under the standardised density curve to the right of the value VaR (i.e. to the left of the value −VaR) is then equal to 0.05. Therefore the VaR is given by the point on the x-axis of the graph that corresponds to this area. The equation has no closed-form solution. But statistical software easily computes that VaR=1.64. Figure 5.1 illustrates the calculation. The histogram shows the observed returns and the continuous line is the standard Gaussian distribution, used to calculate the VaR. In quantitative risk management this approach is known as the analytic approach or the delta normal approach, in contrast to simulation-based methods.
So far we have considered the standardised Gaussian distribution, with mean 0, and variance 1. It is possible to obtain a family of Gaussian distributions that differ only in their values for mean and variance. In other words, the Gaussian distribution is a parametric statistical model, parameterised by two parameters. Formally, ifZis a standard Gaussian random variable andX=σ Z+µthenX
is distributed according to a Gaussian distribution with meanµand varianceσ2.
The family of Gaussian distributions is closed with respect to linear transforma- tions; that is, any linear transformation of a Gaussian variable is also Gaussian. As a result, the Gaussian distribution is well suited to situations in which we hypothesize linear relationships among variables.
Our definition of the Gaussian distribution can be extended to the multivariate case. The resulting distribution is the main statistical model for the inferential analysis of continuous random vectors. For simplicity, here is the bivariate case. A bidimensional random vector (X1, X2) is distributed as a bivariate Gaussian distribution if there exist six real constants:
aij,1≤i, j ≤2
µi, i=1,2
and two independent standardised Gaussian random variables, Z1 and Z2, such that
X1=µ1+a11Z1+a12Z2 X2=µ2+a21Z1+a22Z2
In matrix terms, the previous equation can be stated as X =µ+AZ, which easily extends to the multivariate case. In general, a multivariate Gaussian dis- tribution is completely specified by two parameters, the mean vector µand the variance–covariance matrix=AA.
Using the Gaussian distribution, we can derive three distributions, of special importance for inferential analysis: the chi-squared distribution, the Student’s t
distribution and theF distribution.
The chi-squared distribution is obtained from a standardised Gaussian distribu- tion. IfZis a standardised Gaussian distribution, the random variable defined by
Z2is said to follow a chi-squared distribution with 1 degree of freedom; it is indi-
cated by the symbol χ2(1). More generally, a parametric family of chi-squared
distributions, indexed by one parameter, is obtained from the fact that the sum of n independent chi-squared distributions is a chi-squared distribution with n
degrees of freedom:χ2(n). The chi-squared distribution has positive density only
for positive real values. Probabilities from it have to be calculated numerically, as for the Gaussian distribution. Finally, the chi-squared value has an expected value equal to nand a variance equal to 2n.
The Student’s t distribution is characterised by a density symmetric around zero, like the Gaussian distribution but more peaked (i.e. with a higher kurtosis). It is described by one parameter, the degrees of freedom, n. Asn increases, the Student’st distribution approaches the Gaussian distribution. Formally, letZbe a standard Gaussian (normal) distribution, in symbolsZ∼N(0,1), and letUbe a chi-squared distributionT = √Z
U /n ∼t (n)withndegrees of freedom,U ∼χ 2
n.
If ZandU are independent, then,
T = √Z
U/n ∼t (n)
that is,T is a Student’st distribution withndegrees of freedom. It can be shown that the Student’s t distribution has an expected value of 0 and a variance given by
VaT(T )=n/(n−2)forn >2
Finally, the F distribution is also asymmetric and defined only for positive val- ues, like the chi-squared distribution. It is obtained as the ratio between two
independent chi-squared distributions,U andV, with degrees of freedommand
n, respectively:
F = U/m
V /n
TheF distribution is therefore described by two parameters,mand n; it has an expected value equal to n/(n−2)and a variance that is a function of both m
andn. AnF distribution withm=1 is equal to the square of a Student’st with
ndegrees of freedom.