Assignment using Bayesian Network - Spatial Keyword Querying: Ranking Evaluation and Efficient

We proceed to explain how the Bayesian network is used to assign a stop location to a PoI. The assignment method employs distance-based filtering, which reduces the number of possible categories. It does so by filtering the PoIs according to a distance factor parameter (df ) that together with ad_th, introduced in Section 3.2, determines a distance threshold.

Algorithm E.1The Assignment Algorithm

Input: bn- Bayesian network, sl - Stop location, ad_th - distance threshold for labeled dataset construction, df - distance factor, tp - time period duration, dg - day granularity

Output: p- visited PoI

1: ti←time index of sl wrt tp

2: di←day index of sl wrt dg

3: sd←stay duration of sl

4: adt←adth·df .calculation of distance threshold

5: pSet←set of PoIs within adt meters of sl

6: cSet←getCategories(pSet) .set of categories of PoIs

7: cSelected←argmax_cat∈cSetP(cat, ti, di, sd) .computation of joint probability using Bayesian network bn

8: if∃!p∈pSet; (p.c=cSelected)then

9: return p

10: else

11: return .unique assignment is not possible

12: end if

The algorithm is given in Algorithm E.1. It takes a Bayesian network, a stop location, a distance threshold, a factor, a time period duration, and a day granularity as inputs and outputs a PoI that the stop location is assigned to. First, it computes time index, day index, and stay duration values using the input stop location as shown in lines 1–3. Then it calculates the distance threshold in line 4 and filters the PoIs around the given stop location and forms the set of possible categories, in lines 5–6. The filtering also uses open-ing hours information if it is available. Then it selects the most probable category from the set of possible categories by computing the joint probabil-ity of category, stay duration, day index, and time index using the Bayesian network (line 7). The joint probability is computed using Equation E.1.

P(cat, ti, di, sd) =P(di) ·P(ti) ·P(cat|di, ti) ·P(sd|cat) (E.1) If only one PoI has the selected category, that PoI is returned. Otherwise, the stop location is not assigned to any PoIs (line 11).

4. Experimental Evaluation

4 Experimental Evaluation

We continue with evaluating the proposed method. We present the experi-mental setup in Section 4.1. Then we report on studies aiming to understand the effect of parameters in Section 4.2. In Section 4.3, we present the effect of distance based filtering (DBF) on VPE. Finally, we report on the stay duration distributions of the output assignments in Section 4.4.

4.1 Experimental Setup

We use GPS data collected from 354 cars during the period 01/03/2014–

31/12/2014 with a frequency of 1 Hz. The trajectory dataset contains around 0.4 billion GPS records and the PoI dataset contains around 10, 000 PoIs of 88 categories. The majority of GPS records and all of the PoIs are located in or around Aalborg, Denmark. The complete dataset is used in all experiments.

We use default values for the parameters controlling stop location extrac-tion and home/work stop locaextrac-tion extracextrac-tion since our focus is assignment of stop locations to PoIs. The parameters∆th, d_th,∆hw, and p_hware set to 10 minutes, 250 meters, 240 minutes, and 3/7 (three days a week), respectively.

With the default parameters, we obtain 349, 637 stop locations, out of which 129, 836 correspond to home/work stops.

In order to evaluate the proposed VPE method, we construct a ground truth dataset. We use a labeled dataset constructed by using the method explained in Section 3.2. The labeled dataset obtained with ad_th = 100 m contains 36, 691 assignments. The top-5 PoI categories and the number of stop locations which are assigned to a PoI of this category are as follows:

supermarket - 3, 961, store - 3, 925, school - 3, 020, restaurant - 2, 832 and lodging - 2, 214.

To evaluate our algorithm and split the labeled data into training and test datasets, we apply 10-fold cross validation. The training dataset is used to learn the Bayesian network, and the test dataset is used to evaluate it. In order to make sure that the test dataset contains stop locations with more than one possible PoI, we extend the region defined by a stop location and adth using a parameter df . If the number of PoIs in this region exceeds the minimum PoI count parameter (mpc), the stop location is included in the test set. In other words, the test set contains only stop locations with more than mpc PoIs around them.

For experimental purposes, we modify the assignment algorithm used in VPE. Instead of returning a single PoI, it is set to return a list of possible categories sorted according to the joint probability value obtained from the Bayesian network. We report the following metrics: (i) p@n - Precision at position n: The percentage of stop locations for which the category of the PoI, the stop location is assigned to, is in the first n categories in the output list, (ii)

Paper E.

mrr - Mean Reciprocal Rank: The average position of the category of the PoI a stop location is assigned to in the output list, (iii) npc - Number of Possible Categories: The average number of possible categories after distance-based filtering.

We also report the results of applying our algorithm without distance-based filtering to show the effect of this filtering. When the assignment al-gorithm is applied without distance-based filtering, the set of possible cate-gories cSet in line 6 of Algorithm E.1 is set to all possible catecate-gories in the PoI database.

4.2 Exploring the Parameters

We first explore the effect of the parameters of the proposed method on the stop location assignment. We vary the value of an explored parameter while fixing all other parameters to their default values. The parameters and their default values are given in Table E.1.

Table E.1:Parameters and Default Values

Notation Name Default Value

adth Distance

Threshold 100 meters

df Distance Factor 2

tp Time Period 0.5 hours

dg Day Granularity 1

mpc Minimum PoI

Count 3 PoIs

100 150 200 250 300

Parameter adth

100 150 200 250 300 Parameter adth (meters) 0

Mean Reciprocal Rank / Number of Possible Categories

mrr - w/o DBF npc - with DBF mrr - with DBF

(b) on mrr and npc Fig. E.5:Effect of adth

Figure E.5a shows that precision values decrease when adthincreases since the number of possible categories also increases as shown in Figure E.5b.

4. Experimental Evaluation

However, VPE is still able to achieve a mean reciprocal value of 2 out of 6–8 categories.

Mean Reciprocal Rank / Number of Possible Categories

npc - with DBF mrr - with DBF

(b) on mrr and npc Fig. E.6:Effect of df

Figure E.6a shows that the precision decreases when the distance factor increases. This occurs because the number of possible categories increases sharply, as shown in Figure E.6b. The decrease in precision is expected when more categories are possible. However, the increase in number of possible categories is sharper than the increase in mean reciprocal rank, which shows that VPE performs well even though the number of possible categories is high.

Mean Reciprocal Rank / Number of Possible Categories

npc - with DBF mrr - with DBF

(b) on mrr and npc Fig. E.7:Effect of mpc

Figure E.7a shows that the precision decreases when the minimum PoI count, and thus the number of possible categories increases,as shown in Fig-ure E.7b. This is also expected since having more PoIs to choose from makes assignment more difficult.

Figures E.8a and E.8b show that time period of a time slot affects the model’s performance. We can see that best performance is achieved when the time period is 30 minutes or 1 hour. From these figures, we infer that in-creasing the time period reduces the proposed method’s ability to distinguish

Paper E. Parameter tp (hours)

Mean Reciprocal Rank / Number of Possible Categories

mrr - w/o DBF npc - with DBF mrr - with DBF

(b) on mrr and npc Fig. E.8:Effect of tp

PoI categories.

Our experiments show that the day granularity does not have a significant effect on the proposed method. This suggests that even though some specific categories have different day distributions, most of the categories have similar distributions for each day.

In document Spatial Keyword Querying: Ranking Evaluation and Efficient Query Processing (Page 170-174)