Building the Bayesian Network - Spatial Keyword Querying: Ranking Evaluation and Efficient Quer

Start

Stop Location Extraction Labeled Dataset

Construction

Learning the

Bayesian Network End

Determining Home/

Work Locations STR: Set of user

trajectories SP: PoI database

Fig. E.1:Flowchart of Building the Bayesian Network

The flowchart for building the Bayesian network is given in Figure E.1. It consists of four phases: stop location extraction, determining home/work lo-cations, labeled dataset construction, and learning the Bayesian network. We need to determine home/work locations in order not to assign stops which correspond to home/work place to any PoIs. In other words, it excludes stops that are related to home/work locations, which is a necessary step to identify true visits to PoIs.

Stop Location Extraction and Determining Home/Work Locations

To extract stop locations and to determine home/work locations of users, we use methods employed in previous work [10]. For stop location extraction, we

3. Visited PoI Extraction

use a duration threshold parameter∆th and a distance threshold parameter d_th to infer whether a GPS record with ignition mode IGNITION-OFF corre-sponds to a stop location. If the time difference between an IGNITION-OFF record and the next IGNITION-ON record exceeds ∆th and the spatial dis-tance between them is below d_th, the location of the IGNITION-OFF record is classified as a stop location.

To determine the home/work locations of a user, we employ a density based clustering approach. First, we cluster the stop locations of a user with DBSCAN [14]. The parameters of DBSCAN are determined with respect to a proportionality parameter (p_hw). For instance, a p_hw value of 5/7 means that the user should have at least 5 stop locations per week for a set of stop locations to form an output cluster. If the average stay duration of the GPS records in an output cluster exceeds a duration threshold (∆hw), we conclude that the stop locations forming the cluster are home/work stops.

Labeled Dataset Construction

To learn a Bayesian network, we need labeled stop locations to estimate distri-butions of stop locations over weekdays, arrival times, and stay durations for each PoI category. A labeled stop is an assignment that maps the stop loca-tion to the visited PoI. Unfortunately, labeled stops are typically not available for vehicle trajectories. Thus, VPE includes a method for extracting labeled stops directly from the trajectories. To generate labeled stops, we rely on distance based assignment (DBA) [10] with extreme parameter settings. The DBA method takes a stop location, a distance threshold parameter (ad_th), and a limit parameter (lim) as input. It assigns the stop location to the closest PoI if the number of PoIs in the circular region centered at the location of the stop location with a radius of parameter adthis below parameter lim. We use lim =1 that makes it highly probable that the PoI a stop is assigned to is the actual visited PoI because DBA only assigns a stop location to a PoI if the PoI is the only PoI within the region surrounding the stop. Thus, the correctness of resulting labeled stops should be sufficiently large to derive distributions for the temporal parameters in the Bayesian network. Let us note that the labeled set is drawn from a different feature base, i.e., it is com-pletely distance-based, and thus the derived temporal samples are labeled based on an independent feature space.

Learning the Bayesian Network

The Bayesian network contains four nodes. One node is the category of the PoI that the stop is assigned to. The three remaining nodes correspond to three attributes that can be inferred from a stop location: time index, day index, and stay duration, which is the difference between the at and dt at-tributes of a stop location. The time and day index values are determined

Paper E.

from the arrival time (at) at the stop location according to the parameters of time period duration tp and day granularity dg. The day is divided into equal time slots of duration tp, and the time index is a value identifying the time slot. For instance, if tp is 2 hours and atis 08:23, the time index value is 4. Possible values of dg are 1, 2, and 3 corresponding to a daily level, a weekdays-weekend level, and no distinction, respectively. So, if dg is set to 1, we have 7 possible values for the day index, and if dg is set to 2, we have 2 possible values.

Time Index

Category Day Index

Stay Duration

Fig. E.2:Structure of the Bayesian Network

The structure of the Bayesian network is shown in Figure E.2. Each node refers to an attribute inferred from a stop location and contains a conditional probability table. A directed edge from node A to node B shows that the value of attribute B is dependent on the value of attribute A. The structure is based on the preliminary analysis of the labeled dataset constructed by the method explained in Section 3.2 with parameter adth =100 m.

(a) (b)

Fig. E.3:Day and Time Index Distributions for Universities

The day and time distributions for the university category when tp = 1 and dg = 1 are illustrated in Figure E.3. According to these distributions,

3. Visited PoI Extraction

people tend to visit universities on weekdays (1–5), arriving during the time period 08:00–10:00. These distributions also show that the day of the week and the time have an affect on the probability of visiting a category, and this is why we have edges from the time index and day index nodes to the category node in the Bayesian network.

Restaurant Hospital University Grocery Store Category

0 200 400 600 800 1000 1200

Stay Duration (min)

Fig. E.4:Stay Duration Distribution for Different Categories

The stay duration distribution for some categories is shown in Figure E.4, where the green and red lines represent mean and median values, respec-tively. This figure shows that the category has a significant impact on the stay duration. For this reason, we have an edge from the category node to the stay duration node. We model the stay duration values of each category as a log-normal distribution [11]:

sd∼lnN (µ, σ²),

where µ and σ are the mean and standard deviation of the stay duration values for a category. For the stay duration node, we compute the observed mean and standard deviation for each category while learning the Bayesian network. The observed distribution is then used to compute conditional probability of a stay duration value given a PoI category.

After deciding the structure, we learn the Bayesian network with the la-beled dataset. The learning here means forming conditional probability ta-bles for the nodes of Bayesian network.

Paper E.

In document Spatial Keyword Querying: Ranking Evaluation and Efficient Query Processing (Page 166-170)