Analysis of data
3.1 TESTING OF SAMPLE RANDOMNESS
When the decision is made to observe only a certain part of the population that was the point of interest, we must be sure that this separated representation will have all significant proper-ties of the population. Practically, a point of interest is one feature or a few features, and it is expected that this separated part in the form of a sample will characterise the whole population well, i.e. the sample will be representative. A long time ago, it was stated that a sample is repre-sentative when it is taken in a random way1. And that this is a necessary condition. Therefore, the subject under consideration here will be a certain property of a sample—its randomness.
Neglect here the problem of the definition of randomness. By studying some papers con-cerning randomness (starting from Kendall and Smith 1938 up to Wolfram 2002 p. 1067), some subtle differences can be noticed. It is also dependent on whether the consideration is in the area of mathematical statistics or whether attention is being paid to engineering prob-lems. Our approach to randomness is a typical engineering one—a sample was taken and we want to know whether it is random. In order to resolve this problem it is necessary to choose the appropriate statistical tool—a test. This problem can be solved in the area of the theory of statistical hypotheses.
It is said that there are not many tests that can be used in this case2. Here our analysis will be based on a number of series (runs) in a sample taken using the median of the variable that is being investigated3.
The procedure of the test is as follows.
3.1.1 The test procedure
The sample is in the form of a sequence of numbers that are successively noted accord-ing to their occurrence. One should order the sequence monotonically in order to estimate the median. The number in the middle is the estimation of the unknown median from the
1 If all of the samples of the same size have an equal chance of being selected from the general popula-tion, we say that the samples have a random character.
2 By the way, some tests for the randomness of a sample—for example ‘the rank correlation test for the randomness of a sample’ presented by Gopal (2006, test 71)—do not test the randomness of a sample.
Gopal’s test is, in fact, a test for the stationarity of a sequence. Similar examples of improper statements can be found in the literature related to the subject.
3 Recall, the median was defined by formula (1.34).
Book.indb 55
Book.indb 55 12/9/2013 12:23:20 PM12/9/2013 12:23:20 PM
population. When the sample has an even number of items, then the arithmetic mean of the two middle numbers is an estimate of the median. Now, the original sequence of numbers translates into a sequence of plus and minus signs; each number will have a sign. A plus sign will be given to all numbers greater than the median and a minus sign will be associated with all numbers lower than the median. Any numbers equal to the median should be rejected. The next step in the test is the calculation of the number of series, i.e. the number of monomial signs. Denote the number of + signs by n+ and the number of − signs by n−.
Now, the statistical hypothesis H0 is formulated, which proclaims that the elements of the sam-ple were selected in a random manner, whereas the alternative hypothesis H
1 rejects H
0. In order to verify the premise H0, one compares the number of series in the sample with the critical value that is taken from the statistical table for the given number of signs and a presumed level of sig-nificance α. The critical region consists of two sub-regions: the left side and the right side, which means that there are two limited values: the minimum number of series (the critical region for α/2) and the maximum number of series (the critical region for 1−α2). If the number of series falls between these two critical values, we have no ground to reject the null hypothesis. Otherwise, the alternative hypothesis H1 is the true one. This means that the rejection of the verified supposition is a consequence of the fact that there are either too many series in the sample or too few series.
Consider an example.
■ Example 3.1
A reliability investigation of selected machines was carried out in an underground copper mine. The sequence of the repair times of one LHD machine was noted:
2.5; 1.4; 4.3; 0.8; 3.2; 0.4; 2.2; 3.4; 5.4; 7.2; 0.9; 2.8; 2.9; 1.8 h.
Verify the hypothesis that the observed sequence is random.
By arranging the sequence monotonically we have:
0.4; 0.8; 0.9; 1.4; 1.8; 2.2; 2.5; 2.8; 2.9; 3.2; 3.4; 4.3; 5.4; 7.2
The sample contains 14 elements. Calculate the median of the sample:
Me=2 5+2 8=
2 2 65
.5+2. .
Convert the original sample into a sequence of signs. We have:
– – + – + – – + + + – + + –
The number of the series is 9, the number of signs n+= n+= 7. Presuming a level of signifi-cance α = 0.05 and using Table 9.8, we have:
Kα/2(7, 7) = 3 and K1−α/2(7, 7) = 12
Book.indb 56
Book.indb 56 12/9/2013 12:23:20 PM12/9/2013 12:23:20 PM
The empirical number of the series fulfils the inequality 3 < 9 < 12, thus we have no ground to reject the verified hypothesis H0. We can now agree with the statement that the sample has
a random property. ◀
The statistic that is the number of series also has an application in the verification of the hypothesis that proclaims that the two samples are from the same population.
As a rule, tables of the critical values for a series comprise up to 20 signs so the problem arises of what to do when a sample size is greater than 20. For large n+ and n− the series number distribution can be satisfactorily described by the normal distribution N(m, σ) of the parameters determined by the formulas:
m n n
n n
= ++ −n +
+ n−
2 1 (3.1)
and
σ =
(
−)
(
+ −+) (
+ −)
2
(
2
n++ −
++ ++ . (3.2)
The above relationships can be used for approximate calculations.
3.1.2 Results of a randomness investigation
The finding that a given sample is non-random in mining practice does not occur frequently.
This regularity is undoubtedly associated with the fact that studies are usually prepared with certain insight and diligence bearing in mind observations of the conditions for proper inves-tigations. Nevertheless, there are some realisations of random variables in mining engineering that are non-stationary ones, e.g. the total number of wire breaks versus the time in the hoist head ropes (or better—versus the number of hoist cycles executed). In this case, one observes the realisation of a non-stationary random process and randomness testing makes no sense.
If—as the result of statistical testing, the non-randomness of the sample was stated—we cannot make any further statistical inference concerning the random variable except to trace why this regularity has been noticed.
There are many reasons for such a set of circumstances. One possibility is the existence of a cyclic component in the realisation of the observed random variable. The opera-tion of many pieces of equipment in mining has a cyclic character and this periodicity can generate a cyclic component in the process of their exploitation. A stream of rock that is being excavated—it does not matter whether it is continuous or discrete—has a periodic character because of the cyclic character of a mining operation. And again, this can have an influence on processes that are running in mining. Another possibility is that during the repair of a technical object, a certain assembly has been replaced by an assem-bly from a different machine. As a rule, this new item is much more sensitive to failures than the original one. It can be much more susceptible to the periodic character of the operation. Generally, all these ‘abnormal’ events can generate non-randomness in the data observed.
For a researcher carrying out an investigation, information about the non-randomness of the sample should be a clear signal that something untypical was noticed. Finding the reasons for this untypical regularity is by all means recommended. It may be the source of significant information on the object being investigated and it does not matter whether it is a technical item, a process or a property of the surrounding rocks. Sometimes, the reason can be prosaic—an informatics error in the system that is collecting the data.
Book.indb 57
Book.indb 57 12/9/2013 12:23:22 PM12/9/2013 12:23:22 PM