• No results found

We use Danske Kvadratnet1, which is the official geographical grid of Den-mark, as the underlying grid structure. The grid consists of equal-sized square cells of size 1 km2and it contains 111, 000 cells.

Initializing Values of Grid Cells. Next, for each PoI pi, we form the set Tpi of trips to pi. Using these sets, we initialize a grid structure for each PoI.

For each cell, the number of trips from the cell to the PoI and the number of distinct users making these trips are computed and recorded.

Table D.3:Number of Cells out of 111, 000 cells with Non-zero Values for Top-5 PoIs

PoI ID Before Smoothing After Smoothing

1 148 2, 021

2 142 1, 521

3 115 2, 123

4 98 2, 148

5 98 1, 652

Smoothing the Values. After the initialization, many PoIs have sparse grids, where many cells have no trips to the PoIs. Table D.3 shows the num-ber of cells with non-zero values for top-5 PoIs. Only 3 PoIs have more than 100 cells with non-zero values after initialization. Sparse grids are a prob-lem since this reduces the number of spatial keyword queries that we can construct rankings for. If neighboring cells of an empty cell have non-zero values, it is reasonable to assume that trips starting in these cells are also relevant for the empty cell. So, the neighboring cell values can be used for smoothing to address the sparsity problem. The smoothing also helps reduce noise in cell values.

As the smoothed grids of multiple PoIs will be used in the ranking build-ing phase, a smoothbuild-ing method should have two properties. First, for a PoI, a smoothing algorithm should not change the sum of all the values in the grid for that PoI. Inflating or deflating the sum of values would unfairly pro-mote or depro-mote the PoI in relation to other PoIs in a constructed ranking.

Second, the ordering of the values for all PoIs in a specific cell before and af-ter smoothing should be similar in order to reduce distortion of the original spatial popularity data for the PoIs.

The literature contains some smoothing and interpolation methods for spatial grid based data. The inverse distance weighting (IDW) method [26]

is proposed to interpolate missing values using distance-based weighting.

1http://www.dst.dk/da/TilSalg/produkter/geodata/kvadratnet

3. Proposed Method

This method builds on the intuition that the effect of a cell’s value on the value of an originally empty cell should depend on how close the cell is to the empty cell. However, IDW does not contain a smoothing method, and it changes the sum of values in the grid. Therefore, we do not utilize IDW in our work. Kernel-based methods [27] have also been proposed for smoothing. For these, it is possible to preserve the sum of the values since they produce a probability distribution as output. The sum of the values can then be distributed according to this distribution. However, they might introduce changes to the ordering of grid cells since kernel-based methods yield continuous functions that might not reflect the original properties of the data.

We use a smoothing algorithm based on personalized Pagerank [28] to interpolate values for cells with no trips. The PageRank algorithm was pro-posed for web graphs, where web pages are the vertices and hyperlinks are the edges. The algorithm assigns a page rank value to each website to indi-cate the relative importance of it within the set. A web page is considered important if other important web pages link to it. The algorithm can be described as a random walk over a directed graph G = hV, Ei. A random walker starts from a randomly chosen vertex. Then, with probability 1−α, it follows an outgoing edge, and with probability α, it teleports to another ran-domly chosen vertex y ∈ V, where α has the same value for each web page and 0 < α < 1. The PageRank of a vertex is the probability that a random walker will end up at the vertex.

Personalized PageRank [9] was proposed in order to incorporate person-alized preferences. This is achieved by changing the uniform probability distribution of teleportation to a random web page to a personalization pa-rameter that is basically a distribution based on user preferences. We use this parameter to utilize the initial values of the grid cells while smoothing the values.

The PageRank algorithm is a good candidate for smoothing, since, if a cell is close to another cell, they should have similar values just like the page rank values for the web pages linking to each other. The main idea is that if a PoI is of interest to drivers leaving from a cell, it might also be of interest to drivers leaving from nearby cells.

We first convert the underlying grid into a directed graph. For each cell, we introduce a vertex. Then, we add edges from a “cell” to the neighboring

“cells” with weight w=1/d2, where d denotes the distance between the cen-ters of the cells. The edge weights define how the page rank value of a vertex should be distributed to the adjacent vertices. In the initial version of PageR-ank, it is equally distributed. In our case, we use weights based on distance to make sure that the page rank value is distributed inversely proportional with the distance between the grid cells corresponding to the vertices. Then we apply PageRank to this graph for both number of trips and number of

Paper D.

users values. We use the initial cell values obtained after the initialization to determine the personalization parameters. The probability of teleportation to a vertex is set proportional to the actual value of the corresponding grid cell.

The procedure yields a probability distribution that indicates the likeli-hood that a random walker will end up at a particular vertex. We distribute the total number of trips and the total number of distinct users to the cells proportional to the output probability distribution. For instance, assume that we are smoothing the numbers of trips and that the total number of trips to the PoI is 100. A cell with probability 0.23 then gets the value 23. Note that this smoothing procedure is done for each PoI. Table D.3 shows that smoothing provides a significant increase in the number of cells with non-zero values.

Example. Let G be a grid with cells c1, c2, c3, c4, c5, c6. The grid structure is shown in Figure D.3a. The first value represents the number of trips from each cell to a PoI before smoothing.

c1- 0, Fig. D.3:Grid Structure and Corresponding Graph

The graph representing the grid is given in Figure D.3b. Each vertex has an edge to each vertex that represents a neighboring cell. Each edge is assigned a weight as explained above. For instance, the distance between c1

and c2is 1 unit, and the distance between c1and c5is√

2 units, so the weight of edge(c1, c5)is 1/d2=0.5.

Then, we apply personalized PageRank using the initial values as the personalization input. The second value of each cell in Figure D.3a represents the resulting probability of the cell.

Finally, we distribute the sum of the values according to the pagerank values. The third value of each cell in Figure D.3a represents the smoothed value. Here, c5 has the second largest value because it is closer to the cell with the largest value (c2) than c4and c6, and it has more edges than c3and c1 since it is in the middle column. The effect of the number of edges is not an issue when smoothing is applied on a large grid since the grid cells, except the cells on the boundary of the grid structure, have the same number

3. Proposed Method

of edges.