Pooled model - Quantifying and modelling online decentralised systems: a complex systems approa

7.3 Results

7.3.1 Pooled model

For valid time-series inference, we require the distribution of our data to be stationary across time [166]. To formally test stationarity, we conduct Augmented Dickey Fuller (ADF) tests on the sales data for each drug. We find monthly sales to be non stationary, with an average ADF p-value of 0.42 across drugs. In contrast, the average ADF p-value after converting the sales data to percentage changes is approximately 0.00. We therefore conduct our analysis on the monthly percentage change in sales over time.

Our data is longitudinal with 3 dimensions: drug, country and time. Letyi,j,t denote the percentage change in sales of drugiin country jand time period t. Our baseline model is:

yi,j,t =β0yi,j,t−1+αi+δj+γt (7.1) Whereyi,j,t−1is an autoregressive term, in case of serial correlation. We also

engineer binary variables (“dummies”) from the longitudinal data structure, which add complexity1to the baseline model:

• αiare dummies for each drug.

• δjare dummies for each country.

• γt are dummies for each month, in case of seasonality.

In this specification (the ”pooled” model), we model all drugs jointly. The advantage is that we have more data to fit each of the pooled parameters, which makes overfitting less likely. For example, if we have Ndrugs and J countries then we haveN∗Jobservations to fit each time dummyγt. Having fewer drug-specific parameters may also allow prediction of drugs that are not in our sample. However, the disadvantage is that we restrict complexity relative to modelling each drug separately, which we analyse in Section 7.3.2. 1_{Complexity means how much the model’s predictions can vary, rather than computa-}

To estimate the performance improvement from Wikipedia data, we add it to the baseline model. LettingXi,j,t be the percent change in Wikipedia views for drugiin country jand time periodt, the ”Wikipedia model” is:

yi,j,t =β0yi,j,t−1+β1Xi,j,t +αi+δj+γt (7.2) Table 7.1 presents in-sample results comparing the pooled models. All models are unpenalised regression. Scores are adjustedR2, which includes a penalty term for models with more features. The baseline score is the model accuracy without including Wikipedia views. The Wikipedia Model includes data on Wikipedia views. The models in the first column use only the autoregressive terms as predictors and Wikipedia views as predictors. The models in the second column add complexity with country, drug and month dummies.

The Wikipedia model outperforms the baseline by between 49 and 64 percentage points (pp), depending on the model choice. Therefore Wikipedia data is a strong in-sample indicator for drug demand. This effect is also much larger than the boost from adding the dummies, which we estimate at 7-20pp. However, in-sample performance may not reflect true predictive accuracy because of possible overfitting.

TABLE7.1: Pooled model - in-sample accuracy

Simple Model All Dummies

BaselineR2 0.003 0.22

Wikipedia ModelR2 0.64 0.71

Sample Size 1918 1918

Number of features 2 35

We cannot evaluate out-of-sample performance with a random train test split, as time series data is not independent and identically distributed (i.i.d). A random split would put some data in the training set that occurs chronologi- cally after some of the testing set. We would therefore be using data from the future to fit a model predicting the past. This is clearly not possible when performing an actual prediction.

We instead use a one-step ahead nowcasting procedure to measure out-of- sample performance [238]. We first set a training window,w, that determines the size of the training set. Then for each period t ∈ [w,T]in the data, the training set is data from periods∈ [t−w−1,t−1]. To prevent overfitting, we penalise the model’s coefficients using LASSO and 5-fold cross validation in the training set. The penalised model then predicts the test set from period t, which is completely held out from training. This procedure only predicts the present with data from the past, so it is truly out-of-sample.

We record the errors in period t and use the mean absolute error (MAE)

to measure that period’s accuracy. Each time we increase t, we slide the training window to update the data and re-fit the model. The model therefore ”adapts” over time to new data, which helps maintain accuracy if the underlying relationship changes over time. We set a training window of 12 months, which allows the model to see each month in the training set and fit the seasonality dummies. The first period in our test set is therefore October 2016.

Figure 7.1 compares out-of-sample results from the pooled models. We include month, drug and country dummies in both models.

2016.1

2017.01

2017.04

2017.07

0.0

0.2

0.4

0.6

0.8

1.0

1.2 MAE

Baseline, average MAE 0.52

Augmented, average MAE 0.32

FIGURE 7.1: Out-of-sample adaptive nowcasting results -

Adding Wikipedia data to the model reduces nowcast mean absolute error (MAE) in almost every time period. The average reduction in error across the sample is 43%. These results are robust to a range of training windows as shown in Table 7.2. Therefore, Wikipedia data is also a strong out-of-sample predictor for drug demand.

TABLE7.2: Out-of-sample results with different training win-

dows. The main text results use a 12 month window.

Training Window Baseline MAE Augmented MAE

10 months 0.51 0.30

11 months 0.52 0.30

13 months 0.52 0.31

14 months 0.53 0.29

We also address the potential data limitations of data sparsity and deleted review (discussed in Section 3.3.3) by considering different aggregation frequencies and different starting time. Table 3.9 shows the usage of different aggregation frequencies effect on the MAE, in all cases the augmented model is outperforming the baseline model. Similarly, in case of using different start dates, Table 7.3 shows that the augmented model is outperforming the baseline model.

TABLE7.3: Out-of-sample results at different aggregation fre-

quencies. The main text results aggregate data to 1 month frequency.

Aggregation Frequency Baseline MAE Augmented MAE

2 weeks 0.45 0.30

4 weeks 0.50 0.30

6 weeks 0.55 0.32

8 weeks 0.61 0.29

In document Quantifying and modelling online decentralised systems: a complex systems approach (Page 115-118)