ESTIMATION AND MAP ESTIMATION
3. Unlike the perceptron algorithm, the method of least squares computes the deci-
sion boundary in one shot.
Figure 2.3 presents the results of the experiment performed on the double-moon patterns for the separation distance d 4, using the method of least squares. As expected, there is now a noticeable increase in the classification error, namely, 9.5%. Comparing this performance with the 9.3% classification error of the perceptron algo- rithm for the same setting, which was reported in Fig. 1.10, we see that the classification performance of the method of least squares has degraded slightly.
The important conclusion to be drawn from the pattern-classification computer experiments of Sections 1.5 and 2.5 is as follows:
Although the perceptron and the least-squares algorithms are both linear, they operate dif- ferently in performing the task of pattern classification.
x1 20 15 10 5 0 5 10 10 0 5 10 x2 5
Classification using least squares with dist 1, radius 10, and width 6
2.6 THE MINIMUM-DESCRIPTION-LENGTH PRINCIPLE
The representation of a stochastic process by a linear model may be used for synthesis or analysis. In synthesis, we generate a desired time series by assigning a formulated set of values to the parameters of the model and feeding it with white noise of zero mean and prescribed variance; the model so obtained is referred to as a generative model. In
analysis, on the other hand, we estimate the parameters of the model by processing a
given time series of finite length, using the Bayesian approach or the regularized method of least squares. Insofar as the estimation is statistical, we need an appropriate measure of the fit between the model and the observed data. We refer to this second problem as that of model selection. For example, we may want to estimate the number of degrees of freedom (i.e., adjustable parameters) of the model, or even the general structure of the model.
A plethora of methods for model selection has been proposed in the statistics lit- erature, with each one of them having a goal of its own. With the goals being different, it is not surprising to find that the different methods yield wildly different results when they are applied to the same data set (Grünwald, 2007).
In this section, we describe a well-proven method, called the minimum-description-
length (MDL) principle for model selection, which was pioneered by Rissanen (1978).
Inspiration for the development of the MDL principle is traced back to
Kolmogorov complexity theory. In this remarkable theory, the great mathematician Section 2.6 The Minimum-Description-Length Principle 79
FIGURE 2.3 Least-squares classification of the double-moon of Fig. 1.8 with distance d -4.
x2 8 6 4 10 5 0 5 10 15 20 2 0 2 4 6 8 10 12 x1
Kolmogorov defined complexity as follows (Kolmogorov, 1965; Li and Vitányi, 1993; Cover and Thomas, 2006; Grünwald, 2007):
The algorithmic (descriptive) complexity of a data sequence is the length of the shortest bi- nary computer program that prints out the sequence and then halts.
What is truly amazing about this definition of complexity is the fact that it looks to the computer, the most general form of data compressor, rather than the notion of proba- bility distribution for its basis.
Using the fundamental concept of Kolmogorov complexity, we may develop a theory of idealized inductive inference, the goal of which is to find “regularity” in a given data sequence. The idea of viewing learning as trying to find “regularity” provided the first insight that was used by Rissanen in formulating the MDL principle. The second insight used by Rissanen is that regularity itself may be identified with the “ability to compress.”
Thus, the MDL principle combines these two insights, one on regularity and the other on the ability to compress, to view the process of learning as data compression, which, in turn, teaches us the following:
Given a set of hypotheses,h, and a data sequence d, we should try to find the particular hypothesis or some combination of hypotheses in h, that compresses the data sequence d the most.
This statement sums up what the MDL principle is all about very succinctly. The sym- bol d for a sequence should not be confused with the symbol d used previously for de- sired response.
There are several versions of the MDL principle that have been described in the literature. We will focus on the oldest, but simplest and most well-known version, known as the simplistic two-part code MDL principle for probabilistic modeling. By the term “simplistic,” we mean that the codelengths under consideration are not determined in an optimal fashion. The terms “code” and “codelengths” used herein pertain to the process of encoding the data sequence in the shortest or least redundant manner.
Suppose that we are given a candidate model or model class . With all the ele- ments of being probabilistic sources, we henceforth refer to a point hypothesis as p rather than h. In particular, we look for the probability density function that best explains a given data sequence d. The two-part code MDL principle then tells us to look for the (point) hypothesis that minimizes the description length of p, which we denote by L1(p), and the description length of the data sequence d when it is encoded
with the help of p, which we denote as L2(d | p). We thus form the sum
and pick the particular point hypothesis that minimizes L12(p, d).
It is crucial that p itself be encoded as well here.Thus, in finding the hypothesis that compresses the data sequence d the most, we must encode (describe or compress) the data in such a way that a decoder can retrieve the data even without knowing the hypothesis in advance.This can be done by explicitly encoding a hypothesis, as in the fore- going two-part code principle; it can also be done in quite different ways—for example, by averaging over hypotheses (Grünwald, 2007).
p 僆 m
L12(p, d) = L1(p) + L2(d 冟p)
p 僆 m
p 僆 m
Model-Order Selection Let m(1),
m(2), ...,
m(k), ..., denote a family of linear regression models that are associated with the parameter vector , where the model order k 1,2,...;that is,the weight spaces w(1),
w(2), ...,
w(k), ..., are of increasing dimensionality. The issue of interest is to identify the model that best explains an unknown environment that is responsible for generat- ing the training sample {xi, di}Ni1, where xi is the stimulus and diis the corresponding re- sponse. What we have just described is the model-order selection problem.
In working through the statistical characterization of the composite length L12(p, d), the two-part code MDL principle tells us to pick the kth model that is the mimimizer
Error term Complexity term
(2.37)
where (w(k)) is the prior distribution of the parameter vector w(k), and the last term of the expression is of the order of model order k (Rissanen, 1989; Grünwald, 2007). For a large sample size N, this last term gets overwhelmed by the second term of the expression . The expression in Eq. (2.37) is usually partitioned into two terms:
• the error term, denoted by -log(p(di|w(k)) (w(k)), which relates to the model and the data;
• the hypothesis complexity term, denoted by , which relates to the model alone.
In practice, the O(k) term is often ignored to simplify matters when applying Eq. (2.37), with mixed results. The reason for mixed results is that the O(k) term can be rather large. For linear regression models, however, it can be explicitly and efficiently com- puted, and the resulting procedures tend to work quite well in practice.
Note also that the expression of Eq. (2.37) without the prior distribution (w(k)) was first formulated in Rissanen (1978).
If it turns out that we have more than one minimizer of the expression in Eq. (2.37), then we pick the model with the smallest hypothesis complexity term. And if this move still leaves us with several candidate models, then we do not have any further choice but to work with one of them (Grünwald, 2007).
Attributes of the MDL Principle
The MDL principle for model selection offers two important attributes (Grünwald, 2007):