HOMOGENEITY OF DATA - be a random variable with a finite expected value and a finite non-zero v

Let X be a random variable with a finite expected value and a finite non-zero variance

3.7 HOMOGENEITY OF DATA

The statement that something is homogeneous means that it is alike, similar or uniform from a certain point of view. In the statistical sense, homogeneity is a property of a data set and relates to the validity of the very advantageous assumption that the statistical properties of the samples that were taken are identical to the whole population.

There are some areas of analysis in mining engineering in which we cannot complain that the data are poor or small in number. This often concerns analyses from ore dressing areas.

A similar situation applies to gathering data on the regular vibrations of surrounding rocks of a mine. Moreover, there are also some different fields of mining engineering interest where the information that is collected is usually poor. In reality, the machines that are in opera-tion are often of a high quality and reliability and failures do not occur very often. There-fore, in order to gather an appropriately large enough sample of information on how a given machine fulfils its duties from a reliability point of view is difficult; such a technical object should operate for a long time—frequently too long in comparison to mining reality. In some other cases, when research concerns destructive tests, we cannot permit so many items to be destroyed. However, in many cases, it is possible to examine a certain number of similar objects operating in similar conditions and it can be expected that data that are obtained will be homogeneous; all of the observations can be gathered together in order to create a large sample so that statistical inference will have a strong foundation. In other words—we have observed a certain number of stochastic copies of the same phenomenon and these data cre-ate the entirety.

The homogeneity of random variables is also of interest in comparative studies. One has two slightly different technical objects or slightly different processes and the point of interest is to answer the question of whether the difference that exists is significant or not.

In engineering studies of a probabilistic nature, homogeneity is usually understood in two ways, namely:

a. As the equality of distributions, i.e. the probability distributions of random variables that are the subject of interest are identical ones

b. As the equality of parameters; the parameters of a statistical nature characterising the selected properties of the object of investigation, e.g. average values, standard deviations, probabilities of occurrence of determined events or states etc., differ from each other only negligibly from a statistical point of view.

Consider case (a).

There are a number of statistical tests that allow the hypothesis that the data are homogeneous to be verified from the distribution point of view. Divide this problem into two separate cases:

i. There are two samples

ii. There are three or more samples

and we are interested whether we can assume that they come from the same population.

This division is connected with the properties of the tests that are applied.

When the data in hand comprise two samples, using the Smirnov test is recommended²¹. The variable tested should be a continuous one. A model considered for this test is as follows.

21 There is also the Smirnov test, which allows three samples to be compared (Birnbaum and Hall 1960).

Book.indb 98

Book.indb 98 12/9/2013 12:24:27 PM12/9/2013 12:24:27 PM

Let (X

n) be two simple samples. A verified null hypothesis states that both samples come from the same population.

Construct two empirical distribution functions for both random variables according to the following patterns:

The statistic that measures the distance between distributions F_n(x) and G_m(y) is deter-mined in one of the following ways:

D G F

which means that the point of interest is the maximum of the mismatch. In other words, if you plot the sorted values of sample x against the sorted values of sample y as a series of increasing steps then the test statistic is the maximum vertical gap between these two plots.

Obviously, D_n,m= Dm,n. Both statistics DD_{n m}⁺_, and DD_{n m}⁻_, have the same distribution. Let us devote our attention to only one of them.

Denote by DD_{n m}⁺_, ( ) and D_n,m(α) the critical values for both statistics if the level of signifi-cance in the test is α. Due to the discontinuity of the statistic distribution Dn,m, the corre-sponding critical value is determined by the formula:

D_{n m}_,_m( )) inf{inf{ : (dd P((DDD_{n m}_n_{n m}_{n m}_, ≥dd)) } The critical value D_{n m}⁺_, ( ) is determined analogically.

In practice statistics are calculated by applying one of the formulas:

D j

To apply the above test it is necessary to make use of Table 9.16a, which gives the critical values D_n,m(α). These values are valid for n = 3(1)20, m = 2(1)n and α = 0.01; 0.02; 0.05; 0.10.

The intersection of the line that corresponds with data n and m and the column that corre-sponds with the probability α determines two numbers: the integer dn,m(α) and the fractional

Book.indb 99

Book.indb 99 12/9/2013 12:24:27 PM12/9/2013 12:24:27 PM

number α*. At the same intersection of the line with the column denoted by k read the integer k_n,m. The critical value is given by the formula:

D d

n m k

D d^{n m}

kn m ,

( ) ( )

= (3.52)

The number α* is the real level of significance of the test, in which α* ≤ α. The differences between α* and α come from the discontinuity of the distribution of statistic D_n,m.

The distribution of statistic D_n,m for n = m = 1(1)40 is presented in Table 9.16b. For a given n and k = 1(1)12, the probability P{D_n,n≤ k/n} can be read off. Due to the fact that for α ≤ 0.10, the following approximate equality holds:

D_{n m} D D⁺_,_m( )) DD_{n m}_n_{n m}_, (( ) There is no table for the statistic DD_{n m}⁺_, ( ).

■ Example 3.10

A durability investigation of a certain mechanical part of an articulated dump truck was carried out. The point of interest was the number of load cycles but not as related to the fail-ure occurrence but the number of load cycles that were the difference between the assumed level and the number achieved. Two parts were tested and for this reason two samples were obtained. They were as follows:

(0.46 0.14 2.45 −0.32 −0.07 0.30) × 10³ cycles

(0.06 −2.53 −0.53 −0.19 0.54 −1.56 0.19 −1.19 0.02) × 10³ cycles

A hypothesis was formulated stating that these two samples came from the same popula-tion. The alternative supposition rejects this.

The calculation procedure is as follows.

1. Construction of the empirical distribution F_n(x) for the first sample:

i X F_n(x)

1 −0.32 0

2 −0.07 1/6

3 0.14 2/6

4 0.30 3/6

5 0.46 4/6

6 2.45 5/6

Book.indb 100

Book.indb 100 12/9/2013 12:24:34 PM12/9/2013 12:24:34 PM

2. Construction of the empirical distribution G

m(y) for the second sample:

j y G_m(y)

1 −2.53 0

2 −1.56 1/9 3 −1.19 2/9 4 −0.53 3/9 5 −0.19 4/9 6 0.02 5/9 7 0.06 6/9 8 0.19 7/9 9 0.54 8/9

3. Sort values of both samples to get distribution functions F_n(x) and G_m(y) as k/r where r is the minimum common multiple for numbers n and m

4. Further calculation procedures are as follows:

u F_n(u) G_m(u) F_n(u) − Gm(u)

−2.53 0 0 0

−1.56 0 2/18 −2/18

−1.19 0 3/18 −3/18

−0.53 0 4/18 −4/18

−0.32 0 8/18 −8/18

−0.19 3/18 8/18 −5/18

−0.07 3/18 10/18 −7/18 0.02 6/18 10/18 −4/18 0.06 6/18 12/18 −6/18 0.14 6/18 14/18 −8/18 0.19 9/18 14/18 −5/18 0.30 9/18 16/18 −7/18 0.46 12/18 16/18 −4/18 0.54 15/18 16/18 −1/18

2.45 15/18 1 −3/18

5. Look for the maximum inconsistency in the last column. Here we have:

D_{6 9} u u

D_, a |F|FFFFF₆₆₆((uuu) GGG₉₉( ) | =8 18/ From Table 9.16a one gets the critical value:

D_{6 9}

D_, ( .005) 13/18

By looking at both values, it is easy to conclude that there is no ground to reject the verified hypothesis proclaiming that both samples come from the same population. ◀

Remark. If n, m → ∞ then the statistic:

D nm

n m Dn m⁺_,

+ has χ² distribution with 2 degrees of freedom.

Book.indb 101

Book.indb 101 12/9/2013 12:24:36 PM12/9/2013 12:24:36 PM

The statistic:

D nm

n m Dn m_,

+ has a Kolmogorov K(y) distribution²².

We continue to consider case (a) but the number of samples is two, three or more.

In statistics there are a few tests that can be applied in such a case but the most popular in engineering practice seems to be the Kruskal-Wallis test based on the sum of ranks (Kruskal-Wallis 1952). It is a non-parametric test and is used to compare more than two samples that are independent, or not related. The model considered for this test is as follows.

Presume that k objects are observed with regard to a certain feature and therefore k sam-ples are obtained. A convenient feature of the test is that the samsam-ples can have different sizes.

Denote them by n_i; i = 1, 2, …, k. Assume that the random values of a measure of the feature can be described by a certain probability distribution F(x) and a statistical hypothesis H₀ is formulated that all probability distributions are identical, which is:

H₀ ₁ x x _k x

H :FFF₁( ) FFFFFF₂₂₂((x) FFF_k( ) The alternative hypothesis H₁ rejects the null supposition.

3.7.1 The test procedure

All elements of all of the samples are gathered together and ranks are assigned for the monot-onically ordered set—from 1 to N, where N is the total number of elements in all samples,

∑i₌ =

i=N

1 . If tied values exist, the average of ranks must be assigned to tied values. Next, the value of the following statistic is calculated:

K T

Looking at formula (3.53), it is easy to notice that if there are more differences between the average sample ranks and the general mean rank, statistic K_N is larger. A low dispersion in this regard, in turn, will be favourable for the hypothesis H₀—providing that there is no ground to reject it.

Kruskal and Wallis observed that if k grows and if the sizes n_i increase, the random vari-able K_N has the asymptotic probability distribution of χ² with k − 1 degrees of freedom.

Therefore, if the following inequality holds:

N ≥ χα2(kk 1 (3.54)− )

where χα2( 1 is the critical value for the assumed level of significance ) α, the verified null hypothesis should be rejected.

For a large k, the random variable 2χ²k−₁ 22((kk−1) 1 has approximately the standard-ised normal distribution N(0, 1). Accordingly, for a large k, the following approximations can be applied:

22 The Kolmogorov distribution has a cumulative function: P y y i

χα2 α

Large amounts of computing resources are required to calculate the exact probabilities for the Kruskal-Wallis test. Existing software only provides exact probabilities for sample sizes of less than about 30 participants. These software programs rely on asymptotic approximation for larger sample sizes. Exact probability values for larger sample sizes are actually available. Spurrier (2003) published exact probability tables for samples with as many as 45 participants. Meyer and Seaman (2006) made precise probability distributions for samples as large as 105 participants.

If some of n_i values are small (that is, less than 5), the probability distribution of K_N can be quite different from this Chi-square distribution.

In order to obtain a more precise reasoning when tied ranks are in samples, a correction should be done by first calculating the following measure:

υ = −⎛

where g is the number of groups of tied ranks, and t_j is the number of tied ranks in j-th group.

Then multiply number υ by the estimate K_N. It can be proved that υ > 1 always and for this reason the new value of statistic (3.53) will be greater than the one calculated without correc-tion. This means that by taking the correction into account the chance of the rejection of the verified null hypothesis increases.

■ Example 3.11

The investigation concerned four scrapers with the same parameters, made by the same pro-ducer. Repair times were noted and the following data were gathered:

Machine I: 85, 150, 430, 30, 170, 600, 210 Machine II: 50, 80, 750, 140, 320, 260, 360, 180 Machine III: 135, 90, 490, 110, 145, 190

Machine IV: 580, 120, 330, 100, 160, 240.

All times are given in minutes.

Here, we have small size samples. In order to create a large sample, a hypothesis was for-mulated that all of these data are homogeneous.

Book.indb 103

Book.indb 103 12/9/2013 12:24:43 PM12/9/2013 12:24:43 PM

The Kruskal-Wallis was applied to verify this supposition. The calculation procedure was as follows.

No.

Sample I Sample II Sample III Sample IV

Time

Calculate the estimate for K_N statistic. We have:

K_N= ⎛ + + +

Compare this value with the critical one. From the table of χ² critical values (Table 9.4), we have χ_α² (3) = 7.8 for the presumed level of significance α = 0.05. Because the critical value is greater than the empirical value, there is no ground to reject the verified hypothesis. This means that all of the data can be treated as one sample. If so, this new sample has 27 ele-ments. Now, we can try to find a theoretical probability distribution that will satisfactorily

describe the empirical data. ◀

Our previous considerations on homogeneity of data are important for two reasons at least.

• If there is no ground to reject the hypothesis that the investigated data are homogeneous, there is a possibility to gather all of the data and to create a large sample. This is important because the unit samples in some cases can be small. Working with a large sample, there is a greater likelihood that stronger statistical inferences can be made.

• If there is a basis to reject the hypothesis on homogeneity in the data, a further investiga-tion should be done to find the reason why the data are inhomogeneous and which object has ‘made’ it so. Discovering the reason that is generating this ‘unfitness’ can give valuable information from an operational point of view.

Let us now consider the second of the cases listed—(b); some parameters are the points of our interest. It was stated that our interest in homogeneity is not always as strong as the equality of distributions. One can be interested in the identity of certain parameters of a sta-tistical nature that characterise specific features of the object of interest. In reliability investi-gations of mine equipment, we can be interested, for instance, in whether some probabilities of failure occurrence are identical from a stochastic point of view.

Let us study the following probabilistic model.

We investigate k technical objects and we are interested in the number of work cycles that are executed by these objects. Let n₁, n₂, …, n_k denote these numbers till the moment of the occurrence of m-th failure. Denote by Q the probability of the appearance of one failure in one work cycle. We would like to check whether the following hypothesis holds:

H₀: Q₁= Q2= ... = Qk

Book.indb 104

Book.indb 104 12/9/2013 12:24:45 PM12/9/2013 12:24:45 PM

which states that the probability of the occurrence of a failure in a work cycle for all of the objects is the same. An alternative hypothesis rejects this.

To verify the basic supposition, one can apply the Cochran’s test²³. A measure in this test is the statistic:

Θ = −

+ + max(n n, , )

n + n

k m− 1, n, 2 1+n2

… 1

(3.57)

The verified hypothesis should be rejected if the following inequality holds:

Θ ≥ qk,m( − ) (3.58)

where q_k,2m(1 – α) is the quantile of order (1 – α) of the Cochran’s statistic (Table 9.9).

In the literature on the subject it is recommended (see for instance Migdalski 1992) that the Hartley’s test should be applied when the number of objects being observed is small (k ≤ 12).

A measure in this study is:

Λ = −

− max

min

( , , , ) ( , , , )

, n

k 1, n, 2 1, n, 2

1 (3.59)

The verified hypothesis should be discarded if the following inequality:

Λ ≥ ηk,m( −α) (3.60)

where ηk,2m(1 – α) is the quantile of order (1 – α) of the Hartley’s statistic is the true one.

■ Example 3.12

In a certain quarry seven wheel loaders operated that loaded blasted rock into the crushers and onto dumpers; they were also used in some auxiliary works. The operating machines came from two different producers. However, their reliability was similar. In a different quarry, owned by the same contractor, it was planned to replace some machines of this type but the problem was which producer has better machines.

Let us ignore a problem of negotiations, the possible discounts offered by producers, con-ditions of payment, realisation of the purveyance and assurance of spare parts—all of which are connected with the potential transaction, and let us devote our attention to the reliability of the equipment that will be purchased.

23 There are a few tests in mathematical statistics that are connected with the name of William Cochran.

Book.indb 105

Book.indb 105 12/9/2013 12:24:46 PM12/9/2013 12:24:46 PM

A reliability study of machines comprising maintenance problems and especially repairs was done. The applied statistical test has no ground to reject the hypothesis stating that all of the repair times could be satisfactorily described by one probability distribution. It was presumed that a satisfactory deciding criterion would be satisfactory frequency of satisfac-tory occurrence of failures. A new reliability investigation was performed with this criterion in mind and a day was presumed as a basic elementary period of operation.

A decision was made to observe machines up to the moment of the 10-th failure occurrence.

For machines from the first producers, the following sequence was noted:

22.6 18.0 26.0 19.8 days and

10.9 18.6 16.7 days was noted for machines from the second producer.

At first glance, the reliability of the machines of the first producer looks better than those of the second one. However, formulating the problem from a statistical point of view, we should answer the question of whether this ‘difference’ is significant statistically or not.

Because there are only seven machines in operation, we should apply the Hartley’s test.

Calculate an estimate of the statistic:

max min

( , , , )

( ,¹, ², , ) . .

1, 2

1 25 9 . 2 5. 2 n₂, ,n

,, n

− = =

This value should be compared with the critical value. Presume a level of signifi-cance α = 0.05 as usual. The corresponding critical value (Table 9.10) is:

ηk,m( αααα)α))))) ηηηη,20(( αα))=3 9.. 4

Comparing these two values, we have no doubts that we have no basis to reject the verified hypothesis. The observed differences in values are not statistically significant. ◀

Let us notice that the above test rather carelessly used the information that was in hand. It only takes into consideration the maximum and minimum values. The rest of the information is ‘useless’. It looks more proper if the hypothesis will be formulated stating that the times of repair can be described by one probability distribution and to apply an appropriate statistical procedure to verify this supposition. Another approach can be used to check whether our guess that the data can be treated as homogeneous one is correct.

We can also investigate, using the Wilcoxon-Mann-Whitney’s test (see for instance Lehmann & D’Abrera 2006), whether the supposition stating that the data come from two different populations having the same expected values is a true one. The condition of the application of the test is in the form of a probability distribution that should be similar to the probability distribution of the random variables tested.

■ Example 3.13

In the seismic station of a certain underground coal mine, tremors of the rock surrounding the mine were noted. A point of interest was the rock vibrations whose energy exceeded 10⁵J.

In the selected period of observation, 33 events were recorded:

Book.indb 106

Book.indb 106 12/9/2013 12:24:49 PM12/9/2013 12:24:49 PM

(7, 15, 280, 190, 900, 8, 8, 100, 10, 2, 800, 2000, 1000, 900, 850, 95, 25, 100, 950, 9, 320, 210, 20, 20, 40, 6, 600, 40, 5, 7, 105, 2, 80) × 10⁵J

The randomness of the sample was tested first. The sample has 33 elements and for this reason the 17-th element is the sample median for the sample arranged monotonically. This element is recorded as the last one: 80 × 10⁵ J.

Converting the sample into a sequence of signs we have:

− − + + + − − + − − + + + + + + − + + − + + − − − − + − − − + −

The number of series in this sequence is 15. The number of n₊ signs = n₊= 16. Presuming a level of significance α = 0.05 and using Table 9.8, we have two critical values:

K_α/2(16, 16) = 11 and K_1−α/2(16, 16) = 22

The empirical number of the series is above the critical numbers and for this reason we have no ground to reject the hypothesis stating randomness. We can presume that the sample has a random property.

Figure 3.21 is an illustration of the sequence of tremors noted taking into account their energy.

Let us now investigate whether this sequence is a stationary one. Apply the test based on the Spearman’s rank correlation coefficient. Using the procedure described in Chapter 3.3, we can construct a table which facilitates further reasoning (Table 3.3). It contains the sum of squares of differences that equal 6975.5, the number that is important for further analysis.

Calculate the Spearman’s rank correlation coefficient—formula (3.18) supported by (3.19).

We have

r_S

rr = −1 6 6975 5× = −

33 ₂ . 0 166

(33²−1) .

Formulate hypothesis H₀ stating there is no dependence between the values with respect to time. This hypothesis is set against a hypothesis H₁ : ρ ≠ 0, stating that the values of the variable depend on time.

2000 1800 1600 10⁵ J 1400 1200 1000 800 600 400 200 0

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 Figure 3.21. The sequence of the energy of several seismic tremors (×10⁵ J).

Book.indb 107

Book.indb 107 12/9/2013 12:24:50 PM12/9/2013 12:24:50 PM

To verify the null supposition the critical value r_α(n) should be taken from Table 9.14 for a given level of significance α, and the sample size n = 33. However, the sample size is large and in such case the approximation can be applied, i.e.

r n u

S n

rr ( , )≈ ^α

−

− 1

where u_1−α is the quantile of order (1 – α) of the standardized normal distribution N(0, 1).

In our case we have:

r_S

rr ( , ) .

≈ . n

, ) 1 9. 6= 32 0 346

Looking at the empirical value and the corresponding critical one we have no ground to reject the null hypothesis. We can assume that the sequence noted is free from dependence of time.

Table 3.3. Auxiliary calculations.

No. Value Rank (v_i− i)²

1 7 5,5 20,25

2 15 11 81

3 280 24 441

4 190 22 324

5 900 29,5 600,25

6 8 7,5 2,25

7 8 7,5 0,25

8 100 19,5 132,25

9 10 10 1

10 2 1,5 72,25

11 800 27 256

12 2000 33 441

13 1000 32 361

14 900 29,5 240,25

15 850 28 169

16 95 18 4

17 25 14 9

18 100 19,5 2,25

19 950 31 144

20 9 9 121

21 320 25 16

22 210 23 1

23 20 12,5 110,25

24 20 12,5 132,25

25 40 15,5 90,25

26 6 4 484

27 600 26 1

28 40 15,5 156,25

29 5 3 676

30 7 5,5 600,25

31 105 21 100

32 2 1,5 930,25

33 80 17 256

Σ 6975,5

CH03.indd 108

CH03.indd 108 12/12/2013 2:46:04 PM12/12/2013 2:46:04 PM

Let us conduct our consideration further. Check whether a memory exists in the realisation

In document Statistics for Mining Engineering-(2014) (Page 113-126)