5.4 Experimental Analysis
5.4.2 Experiment Result
The DERL model on the clinical data is shown as Figure 5.4(a). For both the hospital attributes and patient attributes we learned multinomial mixture models using hidden mixture attributes Hospital.Zho and Patient.Zpa. The system was optimized to have 60 patient clusters and 3 hospital clusters. Both the relations between patients and proce- dures and the relations between patients and diagnoses are modeled as reference uncer- tainty. Thus the two relationship classes have additional attributes Selectpr andSelectdi, respectively. The values of Selectpr and Selectdi indicate which procedure, resp. diag- nosis is given by the physician to the patient. Selectpr and Selectdi follow multinomial distributions with parameters θi and φi, which are individual for each patient. The two parameters share a prior Gpc,Zho 1, which is a sample from a Dirichlet process. Note that
the base distribution G0 of the Dirichlet process is a product of two independent Dirichlet distributions as Equation 5.5. In the experiments we assume that
βpr = ( 1 Mpr, 1 Mpr, . . . , 1 Mpr); βdi= ( 1 Mdi, 1 Mdi, . . . , 1 Mdi).
Where Mpr and Mdi denote the number of procedures and diagnoses, respectively (i.e. 367 and 703 in the case). The base distribution states unbiased priors, i.e. we believe that each procedure, resp. diagnosis has the same probability before the arrival of the data. It also specifies a priori, procedures and diagnoses are modeled as independent. However a posterior learned by the Dirichlet enhanced model is able to represent dependencies
between procedures and diagnoses. The confidence parameters β0pr and β0di for G0 are optimized via v-folder cross-validation method. Since the relations are dependent on
P atient.P rimeComplaint and Hospital.Zho, we implement separate prior distribution for each configuration of the parents. As mentioned in Section 5.2.4, it will bring up the issue of overfitting. To remove the constraint we employ linear-interpolation-smoothing technique. In this case, it yields:
ˆ
P(spr|Zho, pc) = λ0P(spr) +λ1P(spr|Zho) +λ2P(spr|pc) +λ3P(spr|Zho, pc) and a corresponding expression for diagnosis selectionssdi. The weightsλ
` >0, P
`λ` = 1 can be estimated using EM algorithm. We did not show the smoothing variables in Figure 5.4(a) due to the readability of the figure.
1Model selection showed that we obtain a better predictive model by using prime complaint as a
5.4. EXPERIMENTAL ANALYSIS 71
(a)
(b)
Figure 5.4: (a) DERL model for the medical application, where the model parameters
θi and φi are owned by each patient himself. (b) PRM model for the same application, where the model parameters are global.
72 CHAPTER 5. DIRICHLET ENHANCED RELATIONAL MODELS
(a) (b)
Figure 5.5: (a) ROC curves for predicting procedures, given prime complaint and patient and hospital attributes. The plots are average over all test patients. (b) ROC curves for predicting procedures given prime complaintrespiratory problemand patient and hospital attributes.
The DERL model is compared with standard PRM, which is shown in Figure 5.4(b). For more details about PRM, please refer to e.g. Friedman et al. (1999) and Chapter 2. The difference from DERL model is that the multinomial distributions of selecting pro- cedures (and diagnoses) are global, not individual for each patient.
We evaluate model performances by predicting the application of procedures. In the first experiment we predicted any of the procedures that a patient has received given hospital attributes, patient attributes and given prime complaint. The corresponding ROC curve (averaged over all patients) for DERL model is shown as E2 in Figure 5.5(a). In the experiment we selected the top C procedures recommended by the model. Sensitivity indicates how many percents of the actually being performed procedures were correctly proposed by the model. (1-specificity) indicates how many percents of the procedures that were not actually performed were recommended by the model. Along the curves, the C was varied from left to right as C = 5,10, . . . ,50. E1 in Figure 5.5(a) shows the experimental result of the standard PRM model given the same information as E2. It is essentially identical to the result of E2. The situation changes when additional information is available such as past procedures or diagnoses: the standard PRM model would not change the proposal probabilities. In contrast, the prediction of a subsequent procedure is improved for DERL model if the first diagnosis is available (E3) or both the first diagnosis and the first procedure are available (E4). We can see, for example, that if we would propose 15 procedures, after we know the prime complaint, the first diagnosis, and the first procedure, we would cover approximately 83% of the actually prescribed procedures. Figures 5.5(b) shows the corresponding plots for patients with prime complaint respiratory problem exhibiting similar trends.
In the second set of experiments, we investigated how the procedure probabilities se- quentially change when additional information becomes available. Figure 5.6(a) shows the selection probabilities for 20 procedures which are relevant for myocardial infarction. The top ten procedures are listed in Table 5.1. The first column indicates the predicted
5.4. EXPERIMENTAL ANALYSIS 73
(a) (b)
Figure 5.6: (a) Procedure probabilities conditioned on increasing information. (b) Pro- cedure probabilities for different hospital clusters.
Table 5.1: The most frequent procedures for disease No. 410.71.
Rank Code Description
1 88.56 coronary arteriography using two catheters 2 37.22 left heart cardiac catheterization
3 88.53 angiocardiography of left heart structures 4 36.06 insertion of coronary artery stent(s)
5 36.01 single vessel percutaneous transluminal coronary an- gioplasty
6 99.20 injection or infusion of platelet inhibitor
7 36.15 single internal mammary-coronary artery bypass 8 39.61 extracorporeal circulation auxiliary to open heart
surgery
9 88.72 diagnostic ultrasound of heart 10 99.04 transfusion of packed cells
74 CHAPTER 5. DIRICHLET ENHANCED RELATIONAL MODELS
probabilities for the case that only patient attributes and hospital attributes are available. The second column shows the procedure probabilities when, in addition, the prime com- plaintcirculatory problem becomes available. The third column shows the situation when, in addition, the first diagnosis acute myocardial infarction becomes available. The fourth column shows the situation when, in addition, the procedure single vessel percutaneous transluminal coronary angioplasty has been performed. One sees that the probabilities for procedures relevant for myocardial infarction increase when prime complaint becomes available. The tendency is that if more information becomes available, the model becomes more certain about coming procedures for a patient. Figure 5.6(b) shows probabilities of selecting procedures given the diagnosis single live-born in hospital in deferent hospital clusters. One can see that the probabilities vary significantly. It demonstrates that hos- pital attributes are quite relevant for the procedure prediction. In the experiment, the hospitals are assigned to the most likely cluster based on a mixture model.
5.5
Summary
In this chapter we give some analysis how nonparametric hierarchical Bayesian modeling can be very useful in relational learning and propose a new DERL model, which is one of the major contributions of the thesis. In DERL, model parameters can be attributes of entities or relations and can thus be non-global. These individual parameters share a common nonparametric prior, technically as a sample distribution from a Dirichlet pro- cess. As an important result, the posterior learned by DERL can exhibit a rich structure and parameter dependencies which are impossible to be represented in a parametric for- mulation. We demonstrated the performance of DERL model using data from a medical database. The relations are explicitly incorporated into probabilistic models with refer- ence uncertainty (Getoor et al., 2003) and DERL model is used to encode the dependencies between patients and diagnoses and patients and procedures. Despite the fact that the base distribution (prior belief) exhibits parameter independence, the learned posterior does display parameter dependencies. The couplings between diagnoses and procedures could truthfully be modeled.
Part III
Relational Learning with Infinite
Mixture Models
Chapter 6
Finite Mixture Models
6.1
Introduction
Mixture model is a very common modeling tool, which is well suited in the situations where the samples are generated under different conditions. For example, we want to make a survey about the reaction time of people when driving. It is better to divide the people into two subsets in which ones do or do not drink alcohol, then the reaction time is modeled as separate distributions conditioned on the situation of alcohol or non- alcohol, rather than building a single bimodal distribution with two different peaks. Let
y denote the reaction time of a person, θ1 and θ2 are the distribution parameters in the two situations, respectively. π is the probability of a person drinking alcohol. Then the distribution of the reaction time is represented as
P(y|π, θ1, θ2) = πP(y|θ1) + (1−π)P(y|θ2). (6.1) The atom distributionsP(y|θ1) andP(y|θ2) are referred to as mixture components. When the atom distribution is parameterized, we can directly refer to the parameters (θ1 andθ2) as mixture components. The parameter π, referred to as mixture proportion or mixture weight, specifies the proportion in which the atom distributions are mixed. The finite mixture model can be viewed as a special case of a more general specification continuous mixture model: P(y) = Z π(θ)P(z|θ)dθ = ∞ X k=1 πkP(y|θk). (6.2)
In the running example about reaction time, if the atom distributions are conditioned on the extent ones drink alcohol, rather than the binary variable alcohol/non-alcohol, then we obtain a continuous mixture model. Furthermore, from the mathematical form of the density function point of view, the hierarchical Bayesian model introduced in Chapter 4 can be thought of as a variant of the continuous mixture model. The two models are however applied in different situations. The mixture model is applied when the samples
78 CHAPTER 6. FINITE MIXTURE MODELS
(a) (b)
Figure 6.1: (a) Samples drawn from a population equally mixing 3 Gaussian dis- tributions. (b) The graphical representation of an empirical finite mixture model. Θ = {θ1, . . . , θK} are K mixture components, π are mixture proportions. The parame- ters Θ and π are unknown but not random. Each observation yi is associated with an auxiliary variable zi, which specifies the mixture component from which the observation
yi is generated.
belong to a single data set and are generated under different conditions. It is widely used in the clustering and classification problems. In comparison, hierarchical model is widely used in the situations where multiple parallel data sets are available and these data sets come from different but related settings. The parameters for each data set are distinct but share a common prior, by which the learned knowledge from previous data sets can be transferred to the new data sets. Hierarchical model is widely used in meta-learning.