Functional Components of the Framework - Contextual Background of the Administrative Dataset an

Chapter 3: Contextual Background of the Administrative Dataset and the

3.7 Functional Components of the Framework

In the preceding sections, we have discussed the major theoretical concepts that form the basis of our framework. We now discuss the functional components of the framework that are derived from the aforementioned concepts and how they fit together in our research. This discussion is brief, as further details of the methodological components of the framework are discussed in the next chapter.

The input to the research framework is an administrative health dataset. It is assumed that for each patient this should at least contain disease codes representing the health condition of the patient at a point of time (e.g., during hospital admission), along with vital demographic information like age, sex and smoking status. The output of the framework takes two forms: first, it creates a baseline network, which represents an aggregated health trajectory for all chronic disease patients. Second, based on the baseline network,

the framework should provide an interface for predicting the likelihood of chronic disease development for a new test patient who is as yet non-chronic. The interpretation and utility of the baseline network and its predictive capability will depend on the dataset, context and interests of the stakeholders. To illustrate this capability, we implement the framework empirically on an administrative dataset in Chapter 5 and provide the interpretation based on the context.

Figure 3.7: Functional components of the framework

Figure 3.7 shows the functional components of the framework. The overall framework has two major parts, each of which offers an output, as described in the previous paragraph. These two outputs are discussed below.

3.7.1 Part 1: Generating a baseline network

The first part of the framework focuses on generating the baseline network (concept introduced in Section 3.6.4). This network should represent the health trajectory of particular chronic disease (i.e., type 2 diabetes or T2D) patients. Administrative dataset containing medical histories of patients serve as the input for this part. Initially, this dataset goes through the stages of performing an integrity check, code translation and organisation of the data. This, in combination, can be treated as data filtering. The resultant dataset is then ready for analysis. We then need to partition the data into chronic and non-chronic cohorts. Depending on the context and requirements, one chronic disease is selected—for example, T2D—before implementing the framework. Then, if a patient in the research dataset has T2D, he is placed in the chronic cohort; otherwise, he is moved to the non-chronic cohort. Further, each of the cohorts should be partitioned into two groups: the baseline and training partitions. As the names suggest, the baseline partition is used to construct the baseline network, and the other partition is reserved for training the predictive model later.

All medical histories of patients up until they are first diagnosed with the chronic disease (i.e., T2D) in the baseline partition are iteratively aggregated to form the first baseline network, named the ‘positive baseline network’. This network represents the disease progression (i.e., patterns, prevalence) that lead to the particular chronic disease. Similarly, a ‘negative baseline network’ is derived from the baseline partition of the non- chronic patients. The process of generating a baseline network from multiple patients’ medical histories is done through Statistical Aggregation, discussed in detail in the next chapter (Section 4.7). Finally, the two baseline networks are merged into one network by adjusting for the attribution effect, as discussed earlier (see Section 3.6.6). In this process, different properties of the baseline network—for example, prevalence of disease and transitions—are given higher scores if they are more exclusive in the ‘positive baseline network’ and are given lower (negative) scores if they are more exclusive in the ‘negative

baseline network’. The resultant output is the simply called the baseline network and represents the final output from the first part of the framework.

3.7.2 Part 2: Generating predictive model

The second part of the framework focuses on generating a predictive model for the chronic disease of interest. Similar to any generic predictive model, this part is divided into training and test phases. In general, the predictive model compares the baseline network with the medical history of a patient who is not yet diagnosed with the particular chronic disease, and calculates this patient’s risk of developing the chronic disease in future. The comparison method used here is called Longitudinal Distance Matching, which calculates how similar the test patient’s network is, compared to the baseline network. The similarity is scored against three risk factors based on the graph and social network theories. Detail formulation of these scores is given in the next chapter – “Methodology and Framework”. Along with these scores, which are calculated by comparing with the baseline network, the framework should also consider demographic and behavioural risk factors of the test patients, such as age, sex or smoking history. Therefore, there are multiple risk scores for one test patient and the overall risk of chronic disease is calculated by developing a linear prediction model from these scores. In this process, weighting factors (i.e., parameters) are associated with each of the risk factors. The optimal values of the weighting factors are calculated through standard binary logistic regression, or a parameter estimation model. Optimal values for these parameters should be determined using the training partitions from each cohort (established in the previous step). The framework can now be able to predict a test patient’s likelihood of developing chronic disease by comparing their medical history with the baseline network and calculating the risk scores, as the weighting parameters are known and optimised from the previous training phase. We should also consider data mining approach for prediction. For example, binary classification method can be used. The choice of predictive model can differ based on the context and performance. It should also be interesting to see which predictive modelling gives better

performance. Therefore, in our implementation of the framework, we follow an exploratory approach, running different models and discussing their relative performance.

3.8 Summary

In this chapter, we have introduced the dataset and functional concepts of the framework in the context of Australian health system. After Chapter 2 established the methodological background in an abstract way in relation to the current literature on disease prediction, it was necessary for this chapter to discuss the specific characteristics of a real-world dataset and other contextual information. This included a detailed description of how the data is generated and transmitted, enabling a closer look at common inconsistencies that we should be aware of. We then discussed associated security and privacy issues. Then we introduced the database structure that is suitable to support our methodological framework. Finally, we discussed some key concepts related to the framework and then discussed how these concepts fit together to make up the key components of the framework.

Having introduced the contextual background of the administrative dataset and the framework, we now focus on the methodological and mathematical details of the framework. We will discuss these in the next chapter.

In document Predicting the Risk of Chronic Disease: A Framework Based on Graph Theory and Social Network Analysis (Page 104-109)