Statistical hypothesis testing - Evaluation Methods

4. Evaluation Methods

4.3. Statistical hypothesis testing

The task ofinferential statistics is to measure how probable is the fact that a study

outcome was not achieved by chance. On the other side, it should be kept in mind that a significant difference between two distributions does not necessarily measure the dimension of this difference: for example, method A may outperform B with regard to the misclassification error, but the advantage could be so marginal that it does not make sense to replace B by A. Therefore, it is also reasonable to describe the differences and trends by means ofdescriptive statistics.

The choice of a proper test should be done very carefully. It depends on the data char- acteristics: the number of observations and their mathematical distributions, the value domain, the relation between the observations (paired and not paired), etc.

First, the test objectives should be considered. Usually, the following disjoint hypotheses are formulated (cf. [84], p. 2 and [36], p. 4):

Definition 4.2 Null hypothesis (H0) postulates that there is no difference between

the probability distributions of some study outcomes, i.e. they belong to the same sample probability distribution with unknown parameters.

Definition 4.3 Alternative hypothesis (H1) assumes that the distributions of the

4.3. Statistical hypothesis testing 83

(a sample set, which describes a study outcome, contains either only significantly larger values, or only significantly smaller values than another sample set), H1 is one-tailed.

If the direction does not play a role, H1 istwo-tailed.

Because it is usually desired to show that a certain algorithm modification or improvement leads to a higher, and not equal performance, the principle of the contradiction is applied: it is assumed that H0 is true. If this suggestion can be rejected for the estimated test statistic with a certain significance level, the alternative hypothesis H1 is accepted, otherwise H0 is kept.

Table 4.1.: Errors in hypothesis testing.

H0 is actually true H0 is actually false H0 is rejected Type I error correct decision

p = α p = 1 − β

H0 is not rejected correct decision Type II error

p = 1 − α p = β

Table4.1 illustrates the probabilities of the two possible errors, which may occur in this approach. By the choice of the significance level α, we can reduce the danger of a type I error. The most common value is α = 0.05. If H0 is then not rejected, it means that this was done correctly, and not by chance, with the probability of 95%.

The type II error occurs, if H0 is not rejected, but it does not actually hold. The probability of the correct rejection of H0, where H1 is indeed true, is 1 − β. This value is also called thetest power, since it usually corresponds to the acceptance of the desired suggestion

that the two distributions of the study outcomes are unequal (see above). The test plan should contain the following steps, according to [199,36]:

• As exact as possible description of the problem and the corresponding data. • Formulation of the hypotheses H0 and H1 with respect to Definitions 4.2to4.3. • Choice of α as a test risk level.

• Selection of the test statistic U , which should have different distributions for H0 and H1.

• Estimation of U distribution on the evidence of H0 (F0(U )), which depends on the

number of observations (degrees of freedom). Usually, these distributions are

listed in the corresponding tables or are calculated by statistical software.

• Selection of the_{critical area}AC under F0(U ), which will lead to rejection of H0,

if the test statistic would be in this area. • Estimation of the test statistic value T_S.

• Rejection of H0, if TS∈ AC, and acceptance of H0, if TS∈ A/ C.

• Interpretation and reporting of results.

Thep-valuecorresponds to the probability that the same or a more extreme test statistic

value would be achieved by chance for a repetition of the experiment. In case of H0 rejection, p ≤ α.

Nonparametric statistical testshave several advantages over parametric statistical

tests [84]:

• No assumptions about the probability distributions of the observations are required. Many parametric tests assume a Gaussian or other distribution.

• The application procedure is simple and easy to understand.

• These tests can be applied, when the sample sizes are relatively low. In our studies, we used 10 statistical repetitions for each experiment. Several references, which are mentioned in [36], suggest that the application of parametric tests requires more than 10 repetitions as the absolute minimum.

• The nonparametric tests are hardly influenced by outliers. We apply the two following tests in this work:

• _{Wilcoxon signed rank test}is used for paired observations, which are dependent on each other. The test statistic T_SW is estimated, as follows:

T_SW = X

∀i∈{1,...,A}:ui>vi

RW(|ui− vi|), where (4.25)

u and v are sample vectors of the same length A, and RW(·) is a rank function, which estimates the rank of its argument from all sorted values {|u1− v1|, ..., |uA− vA|}.

After the calculation of TW

S , it can be decided, if TSW ∈ AC. The two-tailed Wilcoxon

signed rank test rejects H0, if:

T_SW ≥ τ_α/2 or T_SW ≤ A(A + 1)

2 − τα/2, where (4.26)

τα is the critical value from the corresponding table for the Wilcoxon signed rank

test.

• _{Mann-Whitney U-test}, which is also referred to as Wilcoxon, Mann and Whitney test, compares two not paired observations. The sample vectors may have different dimensionalities. Let A be the length of u, and B the length of v. The test statistic T_SU is: T_SU = B X i=1 RU(vi), where (4.27)

RU(·) is the rank function, which estimates the rank of its argument from all sorted values {u1, ..., uA, v1, ..., vB}.

The two-tailed Mann-Whitney U-test rejects H0, if:

T_SU ≥ τ_α/2 or T_SU ≤ B (A + B + 1) − τ_α/2, where (4.28) τα is the critical value from the corresponding table for the Mann-Whitney U-test.

4.3. Statistical hypothesis testing 85

5. Application of Feature Selection

In document Improving supervised music classification by means of multi-objective evolutionary feature selection (Page 86-91)