Methods for Comparison
7.3.3 Information theoretic measures
For the most part, the statistical measures above were based on the assumption of continuous attributes. The measures we discuss now are motivated by information theory and are most appropriate for discrete (and indeed categorical) attributes, although they are able to deal with continuous attributes also. For this reason, these measures are very much used by the machine learning community, and are often used as a basis for splitting criteria when building decision trees. They correspond to the deviance statistics that arise in the analysis of contingency tables (McCullagh & Nelder, 1989). For a basic introduction to the subject of information theory, see, for example, Jones (1979).
Entropy of attributes,c1YÛ
g
Entropy is a measure of randomness in a random variable. In general terms the entropy cBY
g
of a discrete random variable X is defined as the sum c1Y g À ù log ù whereù
is the probability thatY takes on the i’th value. Conventionally, logarithms are to base 2, and entropy is then said to be measured in units called "bits" (binary information units). In what follows, all logarithms are to base 2. The special cases to remember are:
Equal probabilities (uniform distribution). The entropy of a discrete random variable is maximal when all ù
are equal. If there ared possible values forY , the maximal entropy is logd .
Continuous variable with given variance. Maximal entropy is attained for normal variables, and this maximal entropy is°
± logc³.Ü Å g .
In the context of classification schemes, the point to note is that an attribute that does not vary at all, and therefore has zero entropy, contains no information for discriminating between classes.
The entropy of a collection of attributes is not simply related to the individual entropies, but, as a basic measure, we can average the entropy over all the attributes and take this as a global measure of entropy of the attributes collectively. Thus, as a measure of entropy of the attributes we take thec1Y
g
averaged over all attributesY |3~D½½ó~ Y ê : Û c1Y g ¥ Þ | À c1Y g
Sec. 7.3] Characterisation of datasets 117
This measure is strictly appropriate only for independent attributes.
The definition of entropy for continuous distributions is analogous to the discrete case, with an integral replacing the summation term. This definition is no use for empirical data, however, unless some very drastic assumptions are made (for example assuming that the data have a normal distribution), and we are forced to apply the discrete definition to all empirical data. For the measures defined below, we discretised all numerical data into equal-length intervals. The number of intervals was chosen so that there was a fair expectation that there would be about ten observations per cell in the two-way table of attribute by class. As there are
ù
cells in a two-way table of attribute (with
discrete levels) by class (withù
classes), and there are
examples, this means choosing
h ù f ° . The simplest, but not the best, procedure is to divide the range of the attribute into
equal intervals. A more refined procedure would have the number and width of intervals varying from attribute to attribute, and from dataset to dataset. Unless the data are very extensive, the estimated entropies, even for discrete variables, are likely to be severely biased. Blyth (1958) discusses methods of reducing the bias.
Entropy of classes, c { g
In many of our datasets, some classes have very low probabilities of occurrence, and, for practical purposes, the very infrequent classes play little part in the assessment of classification schemes. It is therefore inappropriate merely to count the number of classes and use this as a measure of complexity. An alternative is to use the entropyc
{ g of the class probability distribution:
c { g À Ü logÜ
whereÜ$ is the prior probability for classÛÇ. Entropy is related to the average length of a variable length coding scheme, and there are direct links to decision trees (see Jones, 1979 for example). Since class is essentially discrete, the class entropyc
{
g
has maximal value when the classes are equally likely, so thatc
{
g
is at most logù
, whereù
is the number of classes. A useful way of looking at the entropyc
{ g is to regard³ \>@ÝD as an effective number of classes.
Joint entropy of class and attribute,c {1~
Y
g
The joint entropyc {1~
Y
g
of two variables{
andY is a measure of total entropy of the combined system of variables, i.e. the pair of variables c
{H~
Y
g
. If¥
denotes the joint probability of observing classÛ and the -th value of attributeY , the joint entropy is defined to be: c {1~ Y g À ¥¹ log¥¹
This is a simple extension of the notion of entropy to the combined system of variables. Mutual information of class and attribute,vcÛ
{H~
Y
g
The mutual informationvc {H~
Y
g
of two variables{
andY is a measure of common infor- mation or entropy shared between the two variables. If the two variables are independent, there is no shared information, and the mutual informationvc
{H~ Y g is zero. If¥¹ denotes the joint probability of observing classÛÇ and the -th value of attributeY , if the marginal probability of classÛ isÜ$, and if the marginal probability of attributeY taking on its -th value isù
118 Methods for comparison [Ch. 7 vc {1~ Y g À ¥O logc ¥ Ü ù g
Equivalent definitions are: vc {1~ Y g c { g ecBY g Þc {H~ Y g vc {1~ Y g c { g }c { ¼ Y g vc {1~ Y g c1Y g }cBY ¼{ g
The conditional entropyc { ¼
Y
g
, for example, which we have not yet defined, may be defined formally by the equation in which it appears above, but it has a distinct meaning, namely, the entropy (i.e. randomness or noise) of the class variable that is not removed by knowing the value of the attribute X. Minimum mutual informationvc
{H~
Y
g
is zero, and this occurs when class and attribute are independent. The maximum mutual information vc
{H~
Y
g
occurs when one of c
{ ¼ Y g or c1Y ¼ { g
is zero. Suppose, for example, that c
{
¼
Y
g
is zero. This would mean that the value of class is fixed (non-random) once the value of Y is known. Class
{
is then completely predictable from the attributeY , in the sense that attributeY contains all the information needed to specify the class. The corresponding limits ofvc {H~ Y g are ° Æ vc {H~ Y g+Æ minc6c { g ~ cBY gg
Since there are many attributes, we have tabulated an average of the mutual information vc
{H~
Y
g
taken over all attributesY |.~EDE~ Y ê : Û vc {1~ Y g ¥ Þ | À vc {H~ YH g
This average mutual information gives a measure of how much useful information about classes is provided by the average attribute.
Mutual information may be used as a splitting criterion in decision tree algorithms, and is preferable to the gain ratio criterion of C4.5 (Pagallo & Haussler, 1990).
Equivalent number of attributes, EN.attr
The information required to specify the class isc { g
, and no classification scheme can be completely successful unless it provides at least c
{ g
bits of useful information. This information is to come from the attributes taken together, and it is quite possible that the useful information vc
{H~
Y
g
of all attributes together (here Y stands for the vector of attributes c1Y
|3~D½ó½~ Y ê
g
) is greater than the sum of the individual informations vc {H~ Y | g e ½½½ evc {1~ Y ê g
. However, in the simplest (but most unrealistic) case that all attributes are independent, we would have
vc {1~ Y g vc {1~ Y | g e ½ó½ e!vc {H~ Yê g
In this case the attributes contribute independent bits of useful information for classification purposes, and we can count up how many attributes would be required, on average, by taking the ratio between the class entropyc
{ g
and the average mutual informationv Û c {H~
Y
g
. Of course, we might do better by taking the attributes with highest mutual information, but, in any case, the assumption of independent useful bits of information is very dubious in any case, so this simple measure is probably quite sufficient:
EN.attr c { g Û v÷c {H~ Y g
Sec. 7.3] Characterisation of datasets 119
Noisiness of attributes, NS.ratio
If the useful information is only a small fraction of the total information, we may say that there is a large amount of noise. Thus, takevcÛ
{1~
Y
g
as a measure of useful information about class, andcBYÛ
g
vcÛ {H~
Y
g
as a measure as non-useful information. Then large values of the ratio
NS.ratio cBYÛ g v Û c {H~ Y g Û vc {1~ Y g
imply a dataset that contains much irrelevant information (noise). Such datasets could be condensed considerably without affecting the performance of the classifier, for example by removing irrelevant attributes, by reducing the number of discrete levels used to specify the attributes, or perhaps by merging qualitative factors. The notation NS.ratio denotes the Noise-Signal-Ratio. Note that this is the reciprocal of the more usual Signal-Noise-Ratio (SNR).
Irrelevant attributes The mutual informationvc
{H~ YH
g
between class and attributeYH can be used to judge if attributeYH could, of itself, contribute usefully to a classification scheme. Attributes with small values ofvc
{1~ Y1
g
would not, by themselves, be useful predictors of class. In this context, interpreting the mutual information as a deviance statistic would be useful, and we can give a lower bound to statistically significant values for mutual information. Suppose that attribute Y and class are, in fact, statistically independent, and suppose that Y has
distinct levels. Assuming further that the sample size
is large, then it is well known that the deviance statistic ³
vc {H~
Y
g
is approximately equal to the chi- square statistic for testing the independence of attribute and class (for example Agresti, 1990). Therefore³ vc {H~ Y g has an approximate > Ë Þ | D@> Þ |
D distribution, and order of magnitude calculations indicate that the mutual information contributes significantly (in the hypothesis testing sense) if its value exceedsc
fDg c ù fEgh , whereù is the number of classes,
is the number of examples, and
is the number of discrete levels for the attribute.
In our measures,
is the number of levels for integer or binary attributes, and for continuous attributes we chose
hf
°
ù
(so that, on average, there were about 10 observations per cell in the two-way table of attribute by class), but occasionally the number of levels for so-called continuous attributes was less than
h$f
°
ù
. If we adopt a critical level for the
> Ë Þ | Dß> Þ |
D distribution as twice the number of degrees of freedom, for the sake of argument, we obtain an approximate critical level for the mutual information as³¹c ù fDg c fEgih ³
. With our chosen value of
, this is of orderfDhf
° for continuous attributes.
We have not quoted any measure of this form, as almost all attributes are relevant in this sense (and this measure would have little information content!). In any case, an equivalent measure would be the difference between the actual number of attributes and the value of EN.attr.
Correlated normal attributes
When attributes are correlated, the calculation of information measures becomes much more difficult, so difficult, in fact, that we have avoided it altogether. The above univariate
120 Methods for comparison [Ch. 7
measures take no account of any lack of independence, and are therefore very crude approx- imations to reality. There are, however, some simple results concerning the multivariate normal distribution, for which the entropy is
° ± logc³.Ü ¼ Ó ¼g where¼ Ó ¼
is the determinant of the covariance matrix of the variables. Similar results hold for mutual information, and there are then links with the statistical measures elaborated in Section 7.3.2. Unfortunately, even if such measures were used for our datasets, most datasets are so far from normality that the interpretation of the resulting measures would be very questionable.
7.4 PRE-PROCESSING
Usually there is no control over the form or content of the vast majority of datasets. Generally, they are already converted from whatever raw data was available into some “suitable” format, and there is no way of knowing if the manner in which this was done was consistent, or perhaps chosen to fit in with some pre-conceived type of analysis. In some datasets, it is very clear that some very drastic form of pre-processing has already been done – see Section 9.5.4, for example.