3.4 Machine Learning Driven Prognostic Model
3.4.2 Data Pre-Processing
As stated earlier, non-knowledge-based CDSS relies on evidence extrapolated through a known dataset in order to provide predictions on unseen cases. Clinical data is composed of a number of patient data records/data points, each of them as a number of inputs expressed as independent variables and one output as the dependent variable. Data pre-processing entails a number of sub-processes for the pre-processing of data, these sub-processes are vital towards developing efficient and accurate prognostic models.
Clinical Variables Pre-Processing
In the prognostic model development, pre-processing of candidate variables is done in accordance with data types associated with each clinical variable. Clinical variables could be categorised as follows:
1. Categorical variables, a type of variable that can take a finite number of values, thus assigning each individual to a specific group or “category”. Categorical variables can be further divided into:
Nominal variables, which have two or more values without an intrinsic order;
Ordinal variables, which have two or more values with an intrinsic order or ranking;
Binary variables, which can assume only two values.
Dichotomous variables, which can have only two categories. For exam- ple, if we were looking at gender, we would most probably categorise somebody as either ”male” or ”female”. This is an example of a di- chotomous variable.
2. Continuous variables are also known as quantitative variables , which can take any real value within given intervals;
Continuous variables are normally used “as is” or after a normalisation process. Categorical variables cannot be used “as is. They need to be encoded into a series of n - 1 binary variables where n is the number of categories to be represented. It has to be noted that n - 1 binary variables are able to define exactly n categories, while using n binary variables would lead to a n-th variable which could be expressed as function of the other n - 1 ones causing problems to learning algorithms (i.e. making impossible the matrix inversion in the estimation algorithm). This coding is necessary to avoid the well known dummy variable trap, making the regression problem unsolvable [83]. Generally speaking, model specifications should always explain how variables are collected (including units of measure), calculated and used in order to guarantee that the model will always be applied to datasets which are consistent with the one used for developing such models, i.e. variables are of the same kind and measured in the same unit [84]. ”Effect Coding Scheme” is often utilised to alleviate collinearity problem in the categorical clinical variables, it is represented in Table 3.1. More independent variables are generated using this coding scheme.
Group Dummy Codes Effect Codes Contrast Codes Trend Codes a1 a2 a3 a1 a2 a3 a1 a2 a3 a1 a2 a3
A1 1 0 0 1 0 0 3 0 0 -3 1 -1
A2 0 1 0 0 1 0 -1 2 0 -1 -1 3
A3 0 0 1 0 0 1 -1 -1 1 1 -1 -3
A4 0 0 0 -1 -1 -1 -1 -1 -1 3 1 1
Table 3.1: Different types of Coding Schemes for Categorical Variables, adapted from ”Multiple Regression (MR) Using Categorical Variables in MR” tutorial.
Normalisation
Normalisation process involves transforming the data to fall within a common range such [-1, 1] or [0.0, 1.0]. The term standardise and normalisation are used interchangeably in data pre-processing. Normalising the data attempts to give all clinical variables an equal weight. It is often useful for classification algorithms involving neural networks or distance measurements such as nearest-neighbor
classification and clustering. The most commonly used normalisation technique is z-score normalisation (zero mean normalisation) method, which converts all variables to a common scale with an average of zero and standard deviation of one.
Collinearity issue among independent variables
Collinearity test is often carried out to find out whether two or more independent variables have a strong correlation: if there is strong collinearity between inde- pendent variables it becomes impossible to obtain unique estimates of the model coefficients [85]. However, also high levels of collinearity also present a problem for any regression analysis [86], increasing the probability that a good predictor (i.e. an independent variable which has good explanatory power) is considered not significant and then rejected by the model. It is estimated that less than 20% of published literature on medical logistic regression models reported appropriate tests for detecting collinearity problems [87]. The prognostic model development process recommends that an appropriate test is carried out to detect collinearity issues. Various collinearity diagnostics are available; for example, the variance inflation factor (VIF) or the tolerance statistics (defined as 1/VIF). VIF provides an estimate of how much the variance of an estimated coefficient is increased by the effect of collinearity [88]. Common criteria to determine if a collinearity problem is present are a tolerance value less than 0.1 [89] or, equivalently, a VIF value greater than 10 [90].
Missing Data Handling
Clinical decision making frequently involves making decisions under uncertainty because of missing key patient data (e.g. demographics, episodic and clinical diagnosis details) - this information is essential for modern clinical decision sup- port systems to perform learning, inference and prediction operations. Machine learning and clinical informatics experts aim to reduce this clinical uncertainty by learning from the missing clinical attributes with a view to improve the overall decision making. These high-dimensional clinical datasets are often complex and carry multifaceted patterns of key missing clinical attributes.
The problem of learning from incomplete real patient data acquired from hos- pital repositories could be handled through a statistical perspective. This could entail using the likelihood-based approach, one of the most renowned techniques to deal with this challenging issue. The statistical framework based on a set of challenging statistical machine learning algorithms, derived from the likelihood- based framework can handle clustering, classification, and function approxima- tion from missing/incomplete data in an intelligent and resourceful manner. The implementation of mixture modelling algorithms as well as utilising Expectation- Maximization techniques for the estimation of mixture components and for deal- ing with the missing clinical data can provide useful insights on how best to approach classification techniques after missing values are estimated. Another technique which is often used in such cases is handling missing data by substitut- ing a mean for this missing data. For example if you don’t know cholesterol levels of a patient, just substitute the mean cholesterol level for the patient and con- tinue with classifying the datasets. It is to be noted that using mean substitution techniques introduces only a trivial change in the correlation coefficient and no change in the regression coefficient, therefore likelihood-based approach is often a preferred choice due to its efficiency and consistency (maximum likelihood always produces the same results for the same set of data) when dealing with missing data in the clinical datasets.