4.1 Sampling-based scatterplot and parallel coordinates
4.1.2 Reality Check
A Reality Check function was added to give the user some confidence in the fact that a particular pattern was real as opposed to an artefact of the sampling. This essentially displays a completely new sample of the data (or as much of a new sample as possible if the sampling rate is greater than 50%).
Parallel coordinate plots following successive Reality Checks. [5% sampling rate - 5K Synthetic dataset]
Figure 4-6
The same section from three scatterplots following successive Reality Checks. Whereas the small area below A and C appear in all three plots, the artefact below B is not apparent in the middle plot and this suggests that this is an artefact of the sampling rather that the data. [20% sampling rate – Parcels dataset] Figure 4-7 1 1000 300 301 600 900 601 901 200 (a) (b) (c) 500 201 (d) (e) sampling rate 30% 30% 30% 30% 30% (f) 15% Reality Check 2nd 1st 3rd 4th 1 start end main sample key: Figure 4-8
To illustrate the Reality Check in action, Figure 4-5 and Figure 4-6 show plots following three successive Reality Checks at low sampling rates. The scatterplots in Figure 4-5 each show a different 2% of the full dataset and the more dense regions near the bottom of each plot show fairly consistent distribution of data points. This demonstrates that even small samples of data provide a representative plot and suggests that the patterns present are artefacts of the data and not artefacts of the sampling. In sparse regions, it is expected that there will be differences between the plots, in this case, not just in the position of points but also in the colour of points in this example as there are 8 categories of educational achievement within the dataset. The successive parallel coordinate plots in Figure 4-6 are very similar and strongly suggests that the clusters are real. This is not wholly surprising as this particular synthetic dataset has been created to test clustering algorithms. However, each plot is a relatively small random sample of the data (5%) and a visual inspection would suggest that the sampling technique gives a representative set of data.
To illustrate a situation where Reality Check identifies sampling artefacts, Figure 4-7 shows the same section from a scatterplot following successive Reality Checks on a 20% sample of data from the Parcels dataset (details in Appendix B.3). This plots package weight against volume for a selection of 7760 parcels delivered by the German postal service. We are interested in the small areas of the plot just below the letters A, B and C. The groups of points at A and C appear to be consistent in all three plots suggesting that they are inherent in the data. However, the group of points at B are not so apparent in the middle plot and hence could well be an artefact of the sampling and should be investigated further.
Now we will look at how the Reality Check is implemented using the z-index method. We saw in Figure 4-3 how the data items are randomly selected for different sample sizes ensuring display continuity. Figure 4-8 follows on to illustrate how Reality Check samples are generated. Diagram (a) shows a 30% sample of a 1000 record dataset. A Reality Check event attempts to produce a sample of new data items and hence moves the start of the new sample to the next item following the end of the previous sample and then calculates the end of the new sample based on the size of the new sample. As mentioned earlier, only if the sample size is less than 50% will this result in a completely new sample; otherwise, some of the previous data items will obviously be included.
Successive Reality Checks are shown in (b) to (e). Note that when the sample window reaches the end of the data set it wraps around and continues from the start of the data set. Given the start and end points of the sample window, the visualisation can very easily calculate if an item should be displayed. Finally, in Figure 4-8(f) the sampling rate has been reduced to 15%, which is easily achieved by moving the sample
end point. To summarise, the sampling rate control moves the end ( ), whilst the Reality Check moves the start ( ) to the previous end, a new end point is then calculated based on the sample size.
The z-index method efficiently generates new samples whilst maintaining display continuity or in the case of Reality Check, display discontinuity. However, several issues that arise from its use. If the sampling rate is set to an equal proportion of the full dataset (e.g. 50%, 25%, 20% etc.) successive Reality Checks will eventually wrap around and replicate previous samples. One solution would be to reshuffle the data items at the point when no new samples are possible (e.g. after four Reality Checks at 25% sampling rate). This could be achieved by filling the sample column with new random values (as it done when the visualisation initialises) to generate a new index by which the items are ordered. This would avoid the user being presented with reoccurring samples but then raises an issue when the sampling rate does not give an equal proportion of the dataset.
Referring to Figure 4-8(d) as an example; with a sampling rate of 30% and 1000 records, the sample after the third Reality Check includes 100 unseen data items plus 200 from wrapping around to the start. There might well be some features in the display which prompt the user to say that they have seen this sample before and hence a case could be made for reshuffling whenever the sample wraps around but would this be necessary when say the unseen component of the new sample is about half? The processing overhead of producing a new random index needs to taken into account as this is approximately 300msecs for the full cars dataset (almost 6000 records). One solution to the reshuffle or not dilemma would be to provide a reshuffle button so the user could initiate the action, but it was felt that this might be an unnecessary source of confusion and so was not included.
Another issue was whether to provide a back function for the Reality Check. This would enable the user to go back to review the previous state or even states of the sampled display. However, if the user had changed the sampling rate following the Reality Check (or the system had changed the rate as with auto-sampling described in the next chapter) a return to the previous state would not be possible. In addition, the function of the Reality Check is “convince me that the patterns I see are real” and is not a navigation function. The back feature was therefore not implemented.