• No results found

Chapter 4: Methodology and Framework

4.9 Risk Prediction

This is the second part of our framework, and perhaps more analytically complex than the previous ‘baseline creation’ part. In this phase, we predict the risk of developing the chronic disease for a test patient who has not yet been diagnosed with that disease. To accomplish this, we compare the medical history of that patient with the baseline network of the chronic disease (i.e., T2D) derived in the first part. Note that the baseline network has some important properties, including the number of times the disease has occurred in

115

all the patients diagnosed with chronic disease, denoted as strength; the average time gap between any disease b’s occurrence after any disease a; and the sequence of disease occurrence, called the direction of the network. When matching the test patient’s network with the baseline network, we consider the following principles:

1. The patient’s chance of developing the chronic disease is increased when he has a history of diseases that have a high strength in baseline network.

2. The patient’s risk is increased if he has more diseases from the baseline network (this principle is analogous with collaborative filtering).

3. The patient’s risk is increased if his sequence of diseases largely matches the sequence of those diseases in baseline network.

4. The patient’s risk is increased if the time gap between any two diseases in the patient matches the time gap of those two diseases in the baseline network. In addition, we consider behavioural, sex- and age-related risk factors, as these are often associated with the onset of chronic disease. We discuss how to calculate risk score considering these factors in the sections below.

4.9.1 Age, sex and behavioural risk factors

People approaching elderly ages are often at higher risk for chronic diseases, because of various pathophysiological, environmental and lifestyle factors. This is especially true for T2D, as there is strong evidence that age is a risk factor. Some chronic diseases are also likely to have a gender bias. In an attempt to keep the framework generic, we also kept a sex-based risk score for diabetes. Finally, some behavioural risk factors, such as alcohol or smoking, are also considered, as these act as risk factors for T2D. Whether or not the patient has a current or previous history of alcohol or smoking is coded in the HCP data and reflected in the patient records. Therefore, looking at the patient’s recorded list of diagnoses, from the specific ICD codes that represent behavioural risk factors like smoking and alcohol use, we can determine a behavioural risk score for a patient.

116

The age risk factor (𝑓𝑜𝑎𝑎𝑎) is a continuous score, ranging from 0 to 1. We divide the

patient’s actual age (in years) by the difference between the maximum and minimum ages in the cohort in order to normalise the age score within the range 0 to 1. The sex risk factor (𝑓𝑜𝑠𝑎𝑥) is essentially a categorical score, and needs no further calculation, as the

patient record in the dataset already has the flag to indicate whether the patient is male or female. Finally, the score for behavioural risk factors (𝑓𝑜𝑏𝑎ℎ𝑎𝑎) has the discrete value of

0 if the patient does not have any ICD codes that are considered as risk scores (e.g., smoking) and 1 if at least one match is found.

After the scores for age, sex and behavioural risk factors are set, the framework can then move forward to calculate the scores for the graph- and social network-based risk factors of a test patient, comparing them with baseline network.

4.9.2 Longitudinal Distance Matching

In this part of the prediction framework, we compare the disease network of a test non- chronic patient with the baseline network and give scores against three graph theory and network-based scores. We call the overall comparison method ‘longitudinal distance matching’. The network comparison method and this name are motivated by the concept of the ‘String Edit Distance’ algorithm, also known as ‘Levenshtein distance’ (Levenshtein, 1966). String edit distance methods are widely used in spell checkers, word suggestion and optical character recognition. ‘Edit distance’ is defined as the minimum number of operations (insertions, deletions or substitutions) required to change one word into another. For example, to change ‘worde’ to ‘world’, we need to delete ‘e’ from the end of ‘worde’ and then insert ‘l’ before ‘d’. Thus the cost of this conversion is one deletion and one insertion. This algorithm considers the sequence of the words and calculates the minimum cost of conversion; this is analogous to our target. According to our assumption above, if the baseline network has the disease sequence ‘a-b-c-d’, then a patient’s disease sequence ‘a-b-c’ is more similar to the baseline than another patient’s disease sequence

‘a-c-d’. Although both patient sequences have same number (three) of overlap with the

117

However, for our case, we have two disease networks—the baseline and the test patient’s network—instead of a flat sequence of characters, as in typical string edit distance problems. In addition, the two networks have several attributes that also need to be matched. These factors make the two scenarios different in terms of data structure and implementation. Besides, the edit distance method would be computationally expensive for our case if we were to adapt it for our large sequence of diseases with networked structure. Further, the number of diseases in a test patient’s network should be considerably smaller than the overall baseline network, resulting in a great number of mismatches between the two networks. Therefore, we match the graph similarity against three different network-based risk factors and find the corresponding similarity-matching scores. Scores for each risk factor have a mathematical formulation, and their motivations come from network theory and SNA (see Chapter 2). These scores are discussed below.

Graph node match score: This measures the similarity between the test patient network

(𝑁𝐷𝑎𝑠𝐷) and the baseline network (𝑁𝐵) in terms of disease prevalence. As diseases are

depicted as the network nodes, this measure is considered to calculate the ‘node-based risk factor’, or 𝑓𝑜𝑛𝐷𝑑𝑎. The measure also considers the prevalence intensity, or node

frequency, of the baseline and test patient networks while calculating the score. Therefore, for a test patient, a high value in graph node match score is only possible under these following three scenarios:

1. The test patient has more diseases that are also present in baseline network. 2. These diseases (that are common in both networks) have higher prevalence in

the baseline network.

3. These diseases (that are common in both networks) have higher prevalence in the test patient’s network.

118

Mathematically, we can define scores for the node-based risk factors 𝑓𝑜𝑛𝐷𝑑𝑎 of any node

𝑑𝑖𝑡𝑣𝑡𝑡 of a test patient’s disease network 𝑁𝐷𝑎𝑠𝐷 as follows:

𝑓𝑜𝑛𝐷𝑑𝑎𝑑𝑖𝑡𝑣𝑡𝑡 = � 𝑚𝑎𝐷𝑐ℎ 𝑠𝑐𝐷𝑠𝑎 𝐷𝐷𝐷𝑎𝑡 𝑝𝑠𝑎𝑎𝑎𝑡 𝑑𝑜𝑑𝑎𝑡 𝑝𝑓𝑑𝑣𝑎𝑡 ≠ 0 0, 𝑜𝑑ℎ𝑑𝑓𝑤𝑎𝑎𝑑 , where 𝑎𝑎𝑑𝑓ℎ 𝑎𝑓𝑜𝑓𝑑 = � 𝑜𝑓𝑑𝑓�𝑑𝑖𝑡𝑣𝑡𝑡� ∗ 𝑜𝑓𝑑𝑓 �𝑑𝑗𝑁𝐵� , 𝑤ℎ𝑑𝑓𝑑 𝑑𝑖𝑡𝑣𝑡𝑡= |𝑉(𝑁𝑡𝑣𝑡𝑡)| 𝑖=1 𝑑𝑗𝑁𝐵 𝑑𝑜𝑑𝑎𝑡 𝑝𝑓𝑑𝑣𝑎𝑡 = � 𝑜𝑓𝑑𝑓�𝑑𝑖𝑡𝑣𝑡𝑡� |𝑉(𝑁𝑡𝑣𝑡𝑡)| 𝑖=1

The numerator, the ‘match score’, essentially multiplies the frequency of a common disease in both baseline network and test patient’s disease network, and sums the results over all common diseases of both networks. The denominator is used to normalise the score in terms of the overall sum of frequencies for the nodes of the test network. If the denominator is 0, the score is kept at 0 to avoid divide-by-zero error, although the possibility of this does not arise, as we checked earlier during the filtering step.

Graph pattern match score: Like the node-based risk factor, the graph pattern match

score measures similarity in the disease transition: that is, the calculation considers the number of matching edges and their corresponding frequencies. The risk factor against which the score is given is denoted as the edge-based risk factor, or 𝑓𝑜𝑎𝑑𝑎𝑎. The equations

are also similar to those of the graph node match score. The only difference is that, instead of disease prevalence, the calculation involves the transition prevalence between disease pairs; that is, it looks for edge frequency.

Graph cluster match score: This measure is scored against the cluster-based risk factor,

𝑓𝑜𝑐𝑡𝐷𝑠𝐷𝑎𝑠. The process is slightly different from the previous two network-based risk

factors. This risk factor is based on the social network theory of clustering. The motivation behind the score is that diseases do not occur in isolation, but rather, a group of diseases tend to occur together: these diseases have a higher number of transitions (i.e., existence

119

of edges, high frequency) between them and lower number of transitions to diseases of other groups. Therefore, this score measures the proportion of edges in the test patients network that lie inside the same cluster in the baseline network.

To calculate the score, the framework first runs a clustering algorithm (Blondel, et al., 2008) in the baseline network. The algorithm assigns ID numbers to all nodes (diseases) of the network. If two diseases in the network receive the same ID, it indicates that the diseases are in the same cluster. Once the baseline network nodes receive the clustering IDs, the test network is brought to measure the cluster similarity. To do that, each of the edges of the test network is considered. If the nodes that make up the edge have the same clustering ID in the baseline network, we count that as a cluster match. The final match score is the sum of all cluster matches normalised by the number of edges present in the test network. Therefore,

𝑓𝑜𝑐𝑡𝐷𝑠𝐷𝑎𝑠

= 𝑓𝑜𝑓𝑛𝑑 𝑜𝑜 𝑑𝑑𝑒𝑑𝑎 𝑎𝑛 𝑁𝐷𝑎𝑠𝐷 𝑤ℎ𝑜𝑎𝑑 𝑓𝑜𝑓𝑓𝑑𝑎𝑝𝑜𝑛𝑑𝑎𝑛𝑒 𝑛𝑜𝑑𝑑𝑎 ℎ𝑎𝑣𝑑 𝑎𝑎𝑎𝑑 𝑓𝑡𝑓𝑎𝑑𝑑𝑓 𝐼𝐷 𝑎𝑛 𝑁𝑓𝑜𝑓𝑛𝑑 𝑜𝑜 𝑑𝑜𝑑𝑎𝑡 𝑑𝑑𝑒𝑑𝑎 𝑎𝑛 𝑁 𝐵

𝐵