Proposing new modified formulas for statistics estimation to reduce

2.3 Techniques to treat Small Sample Size problem

2.3.3 Proposing new modified formulas for statistics estimation to reduce

duce the error caused by SSS

The other possible technique to reduce the error resulted from SSS is to utilize additional information to modify the conventional formulas for calculating the statistics of a sample.

The additional information can be obtained from the sample size, median, quartiles, minimum and maximum values. Different studies have used various subsets of the mentioned additional information to modify the common formulas for estimating the statistics of a sample (e.g. mean and standard deviation of the sample).

Considering the sample size, median, minimum and maximum, Hozo et al. [25] modified the mean and standard deviation formulas. Their proposed formula for estimating the mean of a sample has the following form.

mean= a+ 2m + b

4 . (2.20)

in which, a, b and m are the minimum, maximum and median of the sample, respectively. The authors also proposed the following formula to estimate the standard deviation of a sample. S2≈ 1 n− 1 a 2_{+ m}2_{+ b}2₊(n − 3) 2 (a + m)2+ (b + m)2 4 − n(a + 2m + b)2 16 ! . (2.21)

where, n is the sample size.

Bland [188] enhanced the Hozo et al.’s method by adding information about first quartile and third quartile of a sample. Their proposed formulas for mean and standard deviation of a sample are as following,

mean= a+ 2q1+ 2m + 2q3+ b 4 . (2.22) S2≈ 1 16 a 2_{+ 2q}2 1+ 2m2+ 2q23+ b2 + 1 8(aq1+ q1m+ mq3+ q3b) − 1 64 a+ 2q1+ 2m + 2q3+ b 22 . (2.23)

In [23], Wan et al. showed that Hozo et al.’s and Bland et al.’s methods are unstable and overestimate the statistics of a sample in the large sizes. Wan et al.’s method considers a more complex dependency to sample size and provides more stable and accurate estimation for the statistics of a sample.

Kwon et al. [26] used the Approximate Bayesian Computation (ABC) method to approximate a likelihood function for the sample. The likelihood function for parameter θ and observed data of D is denoted by f (D|θ ) and can be used to estimate the posterior distribution of θ by,

where p (θ ) is the prior distribution of θ . It needs to be noted that the Eq. 2.24 becomes a complete equality when we consider the normalization term in the denominator of the left side of this equation. The omitted term is p (D). Adding this term makes the Eq. 2.24 the same as the familiar Bayes rule formula. In brief, Eq. 2.24 states that the probability of observing variable θ given D has been observed is equal to the ratio of the probabilities of observing θ and D separately multiplied into the probability of observing D given θ is observed. Probability of observing θ is considered as prior probability, probability of observing variable θ given D is considered as posteror and probability of observing D given θ is considered as likelihood.

Kwon et al. utilized the above formula to estimate the mean and standard deviation of a sample.

For the situations in which the estimation of the likelihood can be problematic, Kown et al. suggested that ABC [189] approach performs well when estimating the likelihood. The authors conducted a simulation based study to test the performance of their proposed technique and showed that their technique performs better than the above-mentioned methods except Wan et al.’s [23] for normally-distributed samples.

Another useful information that can be utilized to correct the estimated values for mean and standard deviation is the information about the statistics of the population. Maximum a Posteriori(MAP) technique and this extra piece of information can be utilized to correct the mean and standard deviation values [24]. In this method, the mean of a sample is modified based on the following formula,

µ = α · _bµ + (1 − α ) · bM, (2.25)

where, the parameter α is a weight parameter and its optimum value needs to be trained in the training dataset;µ is the actual mean of the sample with small size; and the µ is theb modified mean of a sample.

The proposed combination method in this thesis utilizes the information obtained from other samples to correct the estimated statistics (e.g. mean and standard deviation); therefore, the method belongs to the third class. In comparison with the current existing methods, the proposed method has two major novelties.

1. For each variable and based on two conditions, the proposed method selects a subset of variables that are more likely to cause statistical improvement and combines the selected subset of variables with the targeted variable.

2. The proposed method defines a bias parameter between every two variables and makes its decision about combination of two variables by the bias parameter as well as the sample size available for the two variables.

Although all the methods, discussed in this section, are able to reduce the SaE and can be useful in the field of humans’ behavior modeling, almost none of them has been used in this field. The present study has concentrated on reducing SaE for a model developed to predict human behaviors. However, our proposed combination method can also be beneficial for other fields of research in which the SSS problem can be relatively a big source of error.

In document Pattern profiling of users' behaviour. (Page 74-77)