Posterior analysis - Simulation study - Bayesian modelling and analysis of ranked data

2.4 Simulation study

2.4.1 Posterior analysis

Before we begin our investigation into the posterior distribution we shall first give some computational details. Our MCMC algorithm was initialised using a random draw from the prior distribution. We then proceeded to perform 11K iterations; the first 1K of which were discarded as a burn-in period. This left us with 10K (almost) un-autocorrelated samples from our posterior distribution. The computational time required to perform inference on these data is approximately 1 and 1.3 seconds for Datasets 1 and 2 respectively. This inference scheme is implemented in C and computation is performed on a single thread of an Intel Core i7-4790S CPU (3.20GHz clock speed).

Our posterior distribution is of high dimension, namely (n× K) + K in these analyses. Assessing convergence and mixing of each individual parameter is therefore problematic; especially as it is easy to see that the dimension of our parameter space will increase

Iteration Log−lik elihood 0 2000 4000 6000 8000 10000 −2000 −1000 Iteration Log−lik elihood 0 2000 4000 6000 8000 10000 −2000 −1000

Figure 2.1: Trace plots of the log complete data likelihood for Datasets 1 and 2 from left to right respectively.

significantly for larger datasets. Consequently it is desirable to obtain a method for con- veniently assessing the convergence and mixing of a Markov chain for high dimensional sample spaces such as this. As opposed to considering each random variable in turn we instead propose to consider an overall summary of our random variables, namely the complete data likelihood, π(Z,_{D|λ) = π(Z|D, λ)π(D|λ). Gelman et al. (2014) advocate this} approach to assessing convergence (especially when implementing mixture models which we consider in Chapter 3). Figure 2.1 depicts trace plots of the log complete data likelihood (after burn-in) for the analyses of both Datasets 1 and 2. We observe that our chains appear to be mixing well and furthermore each chain appears to be sampling from its stationary distribution. Convergence (to the stationary distribution) was also veri- fied by initialising numerous chains at different starting values and checking the posterior distributions are equivalent (up to stochastic noise) in all cases.

Given that we are satisfied that our MCMC scheme is generating realisations from the posterior distribution (for both analyses) we can now begin our investigation into the inferences on our skill parameters λ. In order to ease comparison we perform (offline) rescaling by letting λk → λk/λ20 for k = 1, . . . , 20 at each iteration of our MCMC output.

As λ20now takes its true value we can compare our posterior marginals (for the remaining

skill parameters) relative to the true values chosen at the beginning of this study, λ = (20, 19, . . . , 1). Figure 2.2 depicts boxplots of the marginal posterior distribution for each log λk. The distributions corresponding to the analyses of Datasets 1 and 2 are shown

in white and red respectively. The blue crosses denote the true values from which the (informative) rankings were simulated. We also make it clear that, due to our rescaling, λ20 is constant and therefore omitted from the plot. Furthermore, outliers, defined as

those observations further into the tail than 1.5 times the inter quartile range (IQR) from either the upper or lower quartiles, have also been omitted.

Under the analysis of Dataset 1 we observe that the posterior marginal distributions typically have significant support for the true parameter values. This is not particularly surprising given these data were simulated from the true model. There are of course some

0.5 1.0 1.5 2.0 2.5 3.0 3.5 k Log λk 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Figure 2.2: Boxplots summarising the marginal posterior densities for each log λkgiven that λ20= 1. The densities in each case are shown in white and red for Datasets 1 and 2 respectively. The blue crosses depict the true values from which these data were simulated (log scale).

support for larger values of λ than the values from which these data were generated. In other words the analysis suggests these entities are “stronger” (or more preferred) than we know they actually are. We believe this is a feature of these data and had we analysed a larger dataset consisting of say n = 1000 rankings we would expect our marginal posteriors to be somewhat more focussed around the true values. From the analysis of Dataset 1 we can conclude that the Plackett–Luce model (and our associated sampling scheme) are capable of making reasonable inferences from a set of ranked data.

We now consider the analysis of Dataset 2, it is interesting to see how the introduction of an additional 10 uninformative rankings has a significant effect on our marginal posterior distributions; see Figure 2.2. In this analysis the marginal posterior distributions often show little support for the true values from which the 40 informative rankings were simulated. However it is worth nothing that, for this analysis at least, the model is still able to detect a similar (downward) trend in the preference of the entities as to that under the analysis of Dataset 1. The trend however does appear less significant and the uninformative rankings seem to have induced a “flattening” effect, that is, the (relative) differences between our marginal posterior distributions are less compelling. The result of this is that there is more (posterior) uncertainty on the preference order of the entities. This is perhaps not surprising given the uninformative rankings each express a random preference and therefore our model takes this into account by increasing uncertainty.

Often the aim/purpose of analysing ranked data is to obtain a so-called aggregate ranking. An aggregate ranking is a single ranking that summarises the preferences across all rankings contained within a particular dataset; to this extent it could be interpreted as an “average” ranking. There are numerous ways in which to obtain such a ranking. Here we choose to form our aggregate ranking by ordering the entities based upon their marginal posterior mean. The aggregate ranking, which we denote xagg, is therefore equivalent to

Dataset 1 Dataset 2 ˆ x λ xagg₁ λ¯1 xagg2 λ¯2 1 20.00 3 27.47 3 11.67 2 19.00 1 25.83 1 11.44 3 18.00 5 23.90 2 11.29 4 17.00 2 22.66 4 9.84 5 16.00 4 18.11 9 9.54 6 15.00 6 17.98 5 9.41 7 14.00 8 17.31 8 9.11 8 13.00 7 16.12 6 8.55 9 12.00 9 15.91 7 8.10 10 11.00 10 14.50 10 7.48 11 10.00 11 11.84 11 6.36 12 9.00 12 9.95 12 5.42 13 8.00 13 9.00 13 5.02 14 7.00 14 7.17 14 4.17 15 6.00 16 7.08 16 3.98 16 5.00 15 4.99 15 3.47 17 4.00 17 4.58 17 3.10 18 3.00 18 2.81 19 2.16 19 2.00 19 2.68 18 2.08 20 1.00 20 1.00 20 1.00

Table 2.2: Aggregate rankings under our analysis of Datasets 1 and 2 along with the corresponding posterior means (denoted ¯λ). The value of λ which was used to simulate these data is also reproduced for ease of comparison.

the “optimal” ranking given ¯λ where ¯λ = (¯λ1, . . . , ¯λK) is the parameter vector containing

the marginal posterior means for each entity. Formally xagg = ˆx|¯λ where ˆx|λ is as in (2.7).

Table 2.2 provides the aggregate rankings for the analyses of Datasets 1 and 2 (denoted xagg₁ and xagg₂ respectively) along with the corresponding marginal posterior means (¯λ1

and ¯λ2). To ease comparison the true values from which these data were simulated along

with the “optimal” ranking based upon the true values (ˆx) are also given. We observe how the aggregate rankings under both analyses are coherent with the optimal ranking; particularly for the entities which are ranked at least tenth. The Kendall-tau distance, Kτ(a, b), is a measure of distance which is defined for any two orderings, a and b. The

value of the Kendall-tau distance is equivalent to the number of adjacent (bubble sort) swaps which must be performed to b such that it becomes aligned with a. It is therefore a useful distance to use when comparing rankings (Marden, 1995). We can compute the Kendall-tau distance between the optimal ranking and the aggregate rankings under each

analysis of Dataset 1 results in an aggregate ranking that is, in some sense, more similar to the true preference ordering in comparison to the equivalent summaries based upon Dataset 2.

Another interesting feature of the posterior distributions is that although the aggregate rankings are somewhat similar the marginal posterior means of the entities are significantly different within each analysis; see Table 2.2 and the boxplots shown in Figure 2.2. In other words, although ¯λ1and ¯λ2define different distributions over rankings, the modal ranking is

similar under each parameter vector. However, as the marginal posterior means (of the skill parameters) are significantly less dispersed under the analysis of Dataset 2, this suggests an increased level of uncertainty about an entity’s position. To summarise this uncertainty we look at the probability of the modal ranking (xagg_{), calculated using the posterior}

means of the skill parameters (¯λ), relative to the same probability calculated under the uniform distribution. The probability of any (complete) ranking x under the uniform distribution is 1/K! and so the quantity of interest is r = K! Pr(X =xagg_{|¯λ). The idea is}

that large values of r indicate that the modal ranking has much larger (posterior) support than a uniform ranking and so we can conclude that the ranking distribution (defined by ¯

λ) is, in some sense, more concentrated around the modal ranking. In contrast, small values of r suggest that the ranking distribution is not as concentrated around the modal ranking and so there is more variation within the ranking distribution. In other words, the differences between the probabilities of different rankings are much smaller. Small values of r therefore indicate increased levels of uncertainty about the position of entities within the ranking. Note that r = 1 corresponds to the uniform distribution over rankings and so in this case each ranking is equally likely. For the analyses considered here we obtain r1 = 103625 and r2= 12364 for Datasets 1 and 2, and so the probability of the aggregate

ranking under the analysis of Dataset 1 is over one hundred thousand times larger than the probability of a uniform ranking, with this reducing to around twelve thousand times for the analysis of Dataset 2. It follows that, although the aggregate rankings under both analyses are similar there is much more posterior support for the aggregate ranking under the analysis of Dataset 1. This again highlights how the uninformative rankings in Dataset 2 have weakened our inferences on the skill parameters.

To conclude, this study has shown how outlying rankings can have a significant effect on our posterior distribution. This is a feature of our model which is not particularly desirable. A more robust model would be one which was more flexible and had the capability to allow for potential heterogeneity between the strength of particular rankings. In the next section we shall detail an extension of the Plackett–Luce model which allows us to account for such potential heterogeneity.

In document Bayesian modelling and analysis of ranked data (Page 55-60)