This section is dedicated to evaluate the performance of the proposed combination method for biased samples in the synthetic and real-world datasets.
5.4.1
Evaluation by synthetic data
In the first part of our evaluation, we utilize a number of generated datasets that have normal or binomial distributions that can be considered as examples for real-world datasets. The normally distributed datasets are examples of the continuous datasets, such as datasets containing users’ mobility. The binomially distributed datasets can be considered as the examples of the discrete datasets, such as datasets containing the video requests of users of a content provider.
The datasets also have different sample sizes and different number of dimensions. Since we already know the true statistics of the generated datasets, we are able to assess the performance of the proposed combination method in terms of finding true statistics.
The Relative Error (RE), used in [23], is utilized to assess the performance of our combination method. The RE is defined as,
RE= |true mean − estimated mean|
To show the performance of our proposed model, we report the Percentage of Improvement (denoted as PoI) with respect to the standard method of mean calculation. PoI is defined as,
PoI=
1 − mean o f REproposed method REStandard method
× 100, (5.11)
The values of PoI is represented in a graph against the absolute value of the true bias. In what follows, we explain the process of the dataset generation and the results of our evaluation.
Evaluation by normally distributed dataset
For the normally distributed dataset, the process of data generation is performed under different scenarios. Each scenario takes a set of fixed values for the bias, number of samples and number of dimensions and is repeated 10000 times. To obtain the average performance, we average over all the obtained results for each set of the values.
In every repetition, the parameters, i.e. mean and standard deviation (denoted by std), of the normal distribution, which is used to generate samples, differ from the parameters in the other repetitions. Similarly, the absolute value of the bias increases from 0 to 1.5 by 0.1 steps. To take the sign of bias into account, the sign of the true bias changes in each repetition. Table 5.2 summarizes all the information about the values of the simulation parameters used for our evaluation.
Table 5.2 The range of the values of simulation parameters for normal distribution Parameter Range of values
Numbero f repititions 10000
True mean o f samples1 Extracted from a N(20, 50) True value o f bias Increasing from 0 to 1.50 True mean o f samples2 −→µ1 + bias
True ST D o f samples1 and 2 Extracted from a N(0, 200) Number o f samples 30 and 30
Number o f dimensions 20 and 40
The figure 5.2a presents the graph of PoI, defined in Eq. 5.11, for different values of bias. As can be expected, the performance of the proposed combination model decreases as the magnitude of bias grows. When the sample size and the number of dimensions equal to 30 and 20 , respectively, the PoI is more than ∼ 45% for bias = 0.1 and then decreases to ∼ 30% for bias = 1.5. The figure 5.2b shows that increasing the number of dimensions to 40 in general improves the performance of model which is due to improvement on bias estimation. For bias = 0.1, the performance of the proposed combination method is close to
the 20-dimensional case; but as the bias increases, the performance of the proposed method is better than the 20-dimensional case. For 40-dimensional case when bias = 1.5, the PoI is ∼ 36% which is ∼ 6% more than the 20-dimensional case.
(a) The PoI against bias for the sample size and dimensions equal to 30 and 20, respectively.
(b) The PoI against bias for the sample size and dimension equal to 30 and 40, respectively.
Fig. 5.2 Graphs represent PoI against different values of bias for two normally distributed random variables.
Evaluation by binomially distributed dataset
In this section, we utilize a synthesized dataset containing a number of multi-dimensional samples generated by a binomial distribution. The samples in each dimension are a sequence of 0 and 1. The mean of each sequence gives the probability of observing 1. Similar to the normally distributed dataset, we repeat our simulation 10000 times; and in each repetition, consider different values for the main parameters of the samples. Table 5.3 summarizes the range of values of the parameters,
Table 5.3 The range of the values of simulation parameters for binary distribution Parameter Range of values
Numbero f repititions 10000
True value o f bias Increasing from 0 to 0.3 True mean o f samples1 Drawn from a U (0, 1) True mean o f samples2 1 + bias
Number o f samples 30 and 30 Number o f dimensions 20 and 40
Figures 5.3a and 5.3b show how the PoI changes as a function of bias for a binary sample. The only difference between the synthesized datasets used for these graphs is the number of dimensions (20 and 40). As can be seen, there is not a significant difference between the
PoIs observed for these two samples. For both samples, the PoI equals ∼ 50% for bias = 0.02 and decreases to ∼ 15% for bias = 0.30.
(a) The PoI against bias for the sample size and dimension equal to 30 and 20, respectively.
(b) The PoI against bias for the sample size and dimension equal to 30 and 40, respectively.
Fig. 5.3 Graphs represent PoI against different values for two binomially distributed random variables.
The discussion presented in this section shows that the proposed combination method for estimation of statistics performs better in comparison with the standard method of statistics estimation for samples that have binary and normal distributions. In the following section, the proposed method is evaluated using a real-world dataset.
5.4.2
Evaluation by real-world data
In the second part of our evaluation, we use a dataset containing one-month content request records of 787 users of BBC iPlayer. The raw dataset contained what episodes of the programmes each user has watched during that month. Initially, the dataset was divided into two parts. We randomly extracted 33% of the dataset and considered it as an input sample with small size for the proposed combination method. Then, the method is utilized to estimate whole statistics (for all 100%) of the dataset. Using this dataset, we extracted the data related to each user which contained the set of the watched episodes and the set of the episodes that could be watched in both mentioned datasets.
Each user in the smaller and whole dataset were represented as a vector whose elements are either 0 or 1 and its dimension equals the total number of the existing episodes in that dataset. If the episode had been watched by the user, the corresponding element in the representator vector was considered as 1 and otherwise as 0. Then, the constructed representator vector for the users can be considered as a binary random variable.
The probability of the occurrence of 1 in the representer vector of a user is defined as the Interest, denoted by I, of that user in each programme. Then, the proposed combination
method and the standard method for statistics calculation [24] are utilized to estimate the Interestof the users by using 33% of the whole dataset. 300 users are used to train the model parameters and the performance of the models is tested in the remaining users. We compare the performance of the mentioned methods in terms of how successful they are in estimating the interest of users obtained in whole dataset.
Comparison with the standard method
To compare the performance of two models 1 and 2, we divide the RE (defined in Eq.5.10) of our proposed method by RE of the standard method for finding statistics. Comparing our proposed method and the standard method reveals that the proposed method outperforms the standard method. On average, our method improves the standard method by ∼ 15%.
In this section, we evaluated the proposed combination method in this chapter by a real-world dataset. The presented evaluation shows that the proposed combination method in this chapter performs very good in a real-world situation.