Calculating risk values using historical records

Chapter 6 Forecasting 311 Complaints in New York City

6.3.2 Calculating risk values using historical records

For the first part of our analysis, in order to decide whether there will be a non-emergency complaint reported from a grid cell by using historical 311 records, for each week and each grid cell, we calculate a risk value by using three different models: the spatiotemporal model, the static model and the seasonal model. Each model seeks to generate a risk surface for each incident type, where risk values are calculated for each cell. To assess

12.0 6.7 5.0 3.7 2.7 1.9 1.2 0.3 0

Total number of p

ic

tur

es

Central Park John F Kennedy Airport Statue of Liberty

Figure 6.1: Total number ofFlickr pictures taken and uploaded in the same week between 2012 and 2014.

Famous attractions including Central Park and Statue of Liberty as well as John F Kennedy Airport stand out as some of the photo sharing hotspots in New York City. Colour breaks were calculated usingk-means clustering algorithm on logarithmically transformed num- bers.

the models’ performance, we use the risk surfaces to generate forecasts of cells in which we would expect 311 incidents to occur. We use data from 2012 to train the seasonal model, which requires 52 weeks history in order to generate predictions, and we use data from the first eight weeks of 2013 to train the spatiotemporal model, following the initial two month period used for training in Bowers et al. (2004). For this reason, we evaluate the performance of the models during the period from week 9 in 2013 to the final week of 2014.

6.3.2.1 Spatiotemporal model

The first model, the “spatiotemporal model”, takes inspiration from the approach proposed in Bowers et al. (2004) for anticipating the location of future crimes. In this model, it is assumed that problems are most likely to occur in and around cells which have recently seen higher volumes of such problems. In other words, the location of previous events is of relevance, as is the recency with which they occurred. For each cell, we define a neighbourhood areaA, with a radius of 5 km. We consider previous events in all weeks before weekt from the first week of 2013 onwards, which we denote week 1, and begin to assess the quality of our forecasts in week 9. To calculate the risk value for a given incident type for each celli in weekt, we use the formula

Risk Valuei(t) = X a∈A t−1 X τ=1 1 d(a, i) + 1· 1 t−τ ·Na(τ), (6.1)

whereNa(t)is the number of 311 reports relating to the given incident type in celladuring

weekt, and whered(a, i)is the distance between the centre of cellaand the centre of celli

measured in metres. We illustrate the implementation of this model in Figure 6.2.

We note that the influence of a previous incident occurring in cell a on the risk value for cell i is inversely proportional to the distance of the centre of cell a from the centre of celli. For this reason, an incident which occurred in a cell on the boundary of the neighbourhood area, 5 km from celli, would have 10% less influence than an incident which occurred in a neighbouring cell, 500 m from cell i. In comparison to an incident which occurred in celliitself, an incident in a cell 5 km from celli would have 0.02% of the influence on the final risk value for celli. As this number is already very low, we set the neighbourhood area radius at 5 km and do not consider incidents in cells more than 5 km away in order to optimise the speed of risk value calculations.

To help determine whether data on how recently similar incidents have occurred nearby is of value in anticipating the future location of 311 incidents, we compare the spatiotemporal model to two further baseline models, thestaticmodel and the seasonal model.

6.3.2.2 Static model

In the “static model”, it is assumed that the location at which similar incidents have occurred is of relevance, but the time at which they occurred is of no relevance. To implement this model, data on incidents which took place between the first week of 2013 and the final week of 2014 are used to calculate a static risk value for celli. The calculated risk value for celli therefore remains constant throughout the time period. While it is still affected by the proximity of other incidents, it is not affected by the time at which incidents occurred. For each celli, we calculate the risk value,

Risk Valuei= X a∈A T X τ=1 1 d(a, i) + 1·Na(τ), (6.2) whereτ = 1is the first week of 2013, andτ = T is the final week of 2014. Again, we illustrate the implementation of this model in Figure 6.2. Forecasts are assessed from week 9 of 2013 until the final week of 2014, as for the spatiotemporal model. By comparing the performance of the spatiotemporal model to the performance of the static model, we can investigate whether information on the recency of similar events nearby helps improve the quality of predictions. If this is the case, we would expect to see better predictions generated by the spatiotemporal model than the static model.

6.3.2.3 Seasonal model

If an incident type were to occur in a seasonal fashion, for example with more reports in winter, we might also expect to see better predictions generated by the spatiotemporal model than the static model, as a higher number of recent events may reflect that the sea- son for a particular incident has begun. To distinguish between the possibilities of incidents

clustering in time because the problem is seasonal, and incidents clustering in time in a non-seasonal fashion which is better captured by the concept of recency, we create a sec- ond baseline model, the “seasonal model”. In the seasonal model, it is assumed that the location at which similar incidents have occurred is of relevance, and that incidents occur with a seasonal pattern. To implement this model, data on incidents which took place during weekt−52are used to calculate risk values for cells in week t. Again, the risk value is affected by the proximity of other incidents, but only those which took place at the same time of year in the previous year. To enable forecasts to be assessed from week 9 of 2013 until the final week of 2014 as for the previous two models, we draw on data from week 9 of 2012 onwards. For each celli, we calculate the risk value,

Risk Valuei(t) =

a∈A

d(a, i) + 1·Na(t−52). (6.3)

Once again, we illustrate the implementation of this model in Figure 6.2. Once a risk surface consisting of risk values for all cells has been calculated, a risk thresholdθmust be set so that predictions can be derived. A cell is considered to be at risk if its risk value is greater thanθ(Figure 6.2). We evaluate the performance of all models for a wide range of values ofθ, as described in more detail in the following section.

In document Quantifying human behaviour with online images (Page 98-101)