This page intentionally left blank
Step 1: Develop and document
8.5 RA: The Enhanced KDD Approach
8.6.8 Case Study: Discussion
This case study has applied the extended HCI-KDD methodology to a published application of SDG that generates electronic healthcare records in the midwifery domain for the labour and birth event. Through description of each of the described elements, the case study investigates a comparatively simple SDG model that only utilises one statistical dataset, some published guidelines and a procedural flowchart known as a CareMap in the healthcare domain. It demonstrates that while obvious data exists on the surface, a wealth of knowledge is hidden below the surface that the methodology described in this chapter is able to expose. Having more knowledge about the inner aspects and characteristics of the data we seek to synthesise can only serve to improve the likeness, accuracy and realistic nature of the synthetic dataset.
96 8.7 Conclusion
It was discussed earlier that a large number authors claim some requirement for realism in their SDG methods (see Chapters 2 and 7). All of the SDG methods reviewed in this research claimed success, giving rise to the impression that unsuccessful methods are not published. We also saw that the vast majority of SDG methods provide little or no evidence of a structured approach to identifying and recording the elements of the dataset they seek to synthesise, beyond ensuring superficial comparability to the obvious structural and statistical elements (see chapters 4 and 7). One final note is that it should be reinforced here is that commencing any validation of an SDG method with anything less than the most complete set of highest quality knowledge available to the researcher will be hampered and unable to provide justification for the resulting synthetic data. Therefore, the use of a structured knowledge discovery methodology is possibly the only sound way of bringing together the seed knowledge required for any such validation process.
This chapter presented an enhanced and extended KDD method that can resolve both issues. The proposed method utilises a range of qualitative and quantitative observations followed by and incorporating HCI-KDD principles that are extended through the identification of concept hierarchies, formal concept analysis and systematic analysis that delivers characteristic and classification rules. The knowledge recorded from these efforts resolves the issue of ensuring we have abundant high quality information about the observed data that we seek to synthesise. In identifying and recording that framework of knowledge we have also provided a systematic approach to validating both the generation method used and the resulting synthetic data.
98
9. The HORUS Approach to Validation of Realism
This chapter presents the HORUS approach to validating and justifying the existence of realism in synthetic data.This chapter sets out to achieve the requirements for functional goal 8:
Functional Goal 8. Develop and integrate the additional processes to validate realism in SDG.
This chapter is structured as follows:
9.1 Introduction
9.2 The Validation Approach 9.3 Case Study
9.4 Discussion and Summary
9.1 Introduction
The presence of realism should only be asserted if it is verified (Penduff et al, 2006; Putnam, 1997). The domain of science should always be concerned primarily with testing; the validation and justification of any claim (Gallagher, Ritter, Satava, 2003; Haig, 1995). Validation is necessary to ensure the synthetic data we create is not skewed. This is why the ability to validate should be built in from the beginning as many models that use synthetic data may become unreliable in the company of skewed data (Gao et al, 2007). It has been argued that self-validation of your own methodology is meaningless (Forer, 1949), moreover it is observed, and now confirmed by this research, that very few published models are validated (Barlas, 1996; Carley, 1996). In Appendix B we find many SDG models that claim success in the absence of a rigorous method of scientific validation. Some form of validation would be absolutely necessary to support claims for realism in synthetic data (Penduff et al, 2006).
If the reader finds no documented evidence demonstrating reliability of the SDG model then the validity of the approach must be questioned (Moss, 1994). More than one of the SDG models reviewed discussed the fact that their generator could be or was run through many permutations or tweaks with each rendering different synthetic data, however the authors only discuss and display the outcome of one such operation (see Efstriadis et al, 2014; Gafurov et al, 2015; Ngoko et al, 2014). This gives rise to the question of whether the operation presented represents the only generation pass to deliver synthetic data remotely close to the real observations being modelled.
Each of the four CM validation approaches discussed in chapter 2 at section 2.6 is used to confirm a CM model’s relationship to observed data, and therefore, it could be argued each detects some degree of realism. At the very least, these CM validation approaches represent a legitimate starting point for the discussion of validating realism in SDG models.The CM validation methods demonstrate a largely quantitative approach, or at best only possess some minor qualitative properties. The common tendency observed in the literature has been to measure some number of statistical properties in the synthetic data and draw comparisons to the same in real datasets. When validating realism in numerical or forecasting models quantitative methods are highly regarded. However, while statistical approaches may be of benefit to those models they should not be relied upon solely or to the complete exclusion of qualitative assessments of the interactive metrics, structure and characteristics that should have been identified from the real dataset (Penduff et al, 2006).