5. Robust Estimation of Visit Potential under Missing Data
5.3. Experimental Set-up
5.3.1. Test Scenario
We conducted our experiments on the GPS data set of the German audience measurement study as introduced in Section 3.2.2. However, as the usage of the complete data set would have resulted in very long computation times we restricted the analysis to a subset of the data. For our experiments we selected the city of Hamburg, i.e. the universal set of entities consists of all GPS participants in Hamburg and the universal set of locations consists of all poster sites in Hamburg. Note that the selection of a single city instead of distributing entities and locations randomly over Germany is necessary in order to concentrate visits and to obtain a reasonable level of visit potential. As most individual movements take place locally, a distributed location set would decrease the probability of a visit strongly. We chose the city of Hamburg because it offers a comparably large set of test persons and possesses a complex city structure. Although we selected only a single city for the experiments, the results can be expected to generalize because the subset comprises typical movement behavior.
Note that even though we use data from the outdoor advertising application in our experi- ments, we focus only on one part of the modeling process. The obtained results are therefore not directly comparable to the actual values used in the application.
One problem of evaluating missing data methods is that given data with missing values the true value of any derived quantity is unknown and thus an evaluation of missing data methods is impossible. We therefore use only test persons which are completely observed and introduce artificial missingness into the data. This allows us on the one hand, to control the mechanism of missing data and, on the other hand, to vary the amount of missingness. The mobility data provides up to seven measurement days, however, a restriction to test persons with seven
complete measurement days reduces the data set considerably. Therefore, we decrease the number of observed measurement days to five and evaluate visit potential for t = 5. For our experiments we selected all test persons with at least five observed measurement days and contracted the trajectory data set to the first five observed days of each of these persons. The reduced entity set contains 393 of the original 548 test persons in Hamburg.
In Section 4.4.1 we showed how visit potential can be used to define precisely poster per- formance indicators. The most important visit potential quantities in this context are gross visits, average visits per entity for visit class vc = 1 and entity coverage for visit class vc = 1. We will therefore conduct our experiments with respect to these three quantities. Note that we will shorten the names of the tested visit potential quantities to gross visits, average visits and entity coverage because only the entity perspective is applied in the scenario and therefore the quantities cannot be mixed up.
For a given entity set visit potential varies according to the size of the location set. Clearly, the more posters a campaign contains, the higher is the chance that a test person passes a poster of the campaign. We will therefore vary the size of the location set in order to test the performance of missing data methods at different levels of visit potential. In particular we will conduct our experiments for location sets of size 25, 50, 100, 250 and 500.
In order to obtain stable results, we test each missing data method on 30 different location sets of the same size. The location sets are sampled at the beginning of the experiments and are the same for each method.
A detailed parameterization of all experiments is given in Appendix B.
5.3.2. Error Measurement
We measure the performance of each missing data method using mean error, relative mean error and root mean squared error. The mean error expresses the bias of a method in absolute numbers while the relative mean error relates the bias to the value of the measured variable. The root mean squared error contains the bias as well as the variance of an estimation method, however, expressed in units of the analyzed variable. We will refer to these errors as basic errors. More precisely, the errors are defined as following. Note that we refrain from including the data set and variable in the parameterization of the basic errors and only specify the name of the missing data method in order to reduce the notation to essentials.
Definition 5.3.1 (Mean error) Let yi with i = 1..n denote the true values of some variable
Y and ˆyi denote the estimated or predicted values of the variable by some method m. The
mean error is defined as
me(m) = Pn
i=1yˆi− yi
n .
Definition 5.3.2 (Relative mean error) Let yiwith i = 1..n denote the true values of some
variable Y and ˆyi denote the estimated or predicted values of the variable by some method m.
The relative mean error is defined as
rme(m) = Pn i=1 ˆ yi−yi yi n .
Definition 5.3.3 (Root mean squared error) Let yi with i = 1..n denote the true values
of some variable Y and ˆyi denote the estimated or predicted values of the variable by some
method m. The root mean squared error is defined as
rmse(m) = r Pn
i=1(ˆyi− yi)2
As we conduct our experiments for different rates of missing data as well as for different sizes of the location set, we form three further errors that aggregate the results of the mean error, relative mean error and root mean squared error over all parameterizations. We will call these aggregated errors compound errors.
Definition 5.3.4 (Average absolute compound error) Let err(m)ij denote an arbitrary
basic error of some method m measured for location size si and missing data rate rj with
i = 1..n, j = 1..m. We define the average absolute compound error then as
aace(err, m) = Pn i=1 Pm j=1| err(m)ij | n · m .
The main purpose of the compound error is to provide a single error value for a series of tests and thus to ease the complexity of method comparison. We choose the absolute values of the basic errors so that possible positive and negative biases for different parameterizations do not cancel each other out. The average value instead of a sum of errors was selected in order to retain the relation to the true value of the evaluated variable.
In summary, our basic errors are formed for a given rate of missing data and a given location set size over 30 experiments with different poster campaigns. Our compound errors aggregate the basic errors for 10 (during parameter tuning only 6) different rates of missing data and 5 location set sizes, i.e. the compound error summarizes 1,500 (respectively 900) experiments.
5.3.3. Generation of Artificial Missing Data
In order to evaluate the robustness of the selected methods for missing data, we implemented the mechanisms MCAR, CDMAR and MAR (see Section 5.1.2). Further, we varied the rate of missing data. Note that we use the term rate to refer to the proportion of partially observed entities, i.e. the proportion of the entity set with at least one missing measurement day. The term does not refer to the proportion of missing measurement days in total. The reason for our definition is that a completely random introduction of missing measurement days according to a given rate, i.e. each day has a given probability to be missing, may lead to the deletion of all measurement days of a test person. This, however, reduces the size of the entity set, which falsifies the rate of missingness for the remaining entities and increases the standard error of results. Therefore we follow a strategy which first selects a group of persons according to a given missingness rate. Second, one or more measurement days of each person are deleted. However, at least one measurement day is retained. Table 5.5 shows the corresponding expected percentage of missing measurement days for the applied rates partially observed entities.
For MCAR the group of persons in the first step is chosen randomly. The number of deleted measurement days within the second step is also determined at random.
For CDMAR we select different proportions of persons with missing data within different sociodemographic groups. Hereby it is important that the sociodemographic variable influences the mobile behavior. Else, the connection between mobility behavior and missingness would still be at random.
Finally, we introduce a version of MAR where the selection of persons depends solely on their mobile behavior. In this case only an inclusion of mobility information can help to reduce the bias.