4. Disease State Index and Disease State Fingerprint
4.2 Requirements specification process
4.2.4 Scalability requirements
The number of training cases used for building the disease models was expected to be at most some thousands of previously diagnosed patients. As for the number of features, anything between a few features up to thousands of features was expected. The whole range of possibilities should be supported, with the speed requirements defined earlier also fulfilled. Supporting scaling at this level would allow relatively large amounts of data are to be processed interactively. The scal- ing should also extend to the reporting of analysis results, enabling the clinicians to absorb all the important information, regardless of the scale of the data.
4.2.5 Interpretability requirements
Interpretability is a subjective measure and therefore more difficult to assess. In the discussions with clinicians it became apparent that their ability to interpret the results is as important as the accuracy of the method. Incorporating new infor- mation into the diagnostic process would be challenging if the method provided, for example, only a single number indicating the probability of a patient having a
disease. Thus, the method was required to provide a comprehensive and objective estimate of a patient’s disease state that also corresponded to his/her clinical status.
Another requirement for interpretability was to keep the algorithm understanda- ble to the level at which clinicians were able to verify the results using pen and paper if they wanted. The reasoning for this ‘white box’ approach was that if clini- cians are able to see and understand how the algorithm arrives at its results, they should be more comfortable using this information in their decisions. Obviously, with enough data, manual verification would become inconvenient, but neverthe- less it should be possible. This requirement was in clear contrast to many modern classifiers, which process the data as a ‘black box’ that cannot be easily inspected by humans, especially if they are not machine learning experts.
The final major interpretability requirement was to indicate clearly the influence of diagnostic tests and any raw measurement values on the results. This would allow clinicians to see how much the different tests affect the classification, possi- bly determine which tests should be performed next and evaluate the results ap- propriately. Lastly, related to the speed requirement, the inclusion and exclusion of variables was required to be interactively modifiable, allowing exploration of pa- tient data in search of answers to several clinical questions.
4.2.6 Consideration of existing methods
Several existing methods were considered after the requirements became clear. Not surprisingly, quick quantification of the disease state from heterogeneous and sparse data in a deterministic and understandable manner had not been exten- sively addressed in previous research at the time the work started. Although sev- eral promising approaches were found, they invariably fell short in areas of ro- bustness, speed or interpretability.
Well-known classifiers and regression methods were considered first. Ensem- ble methods like RF and stochastic methods such as GP were the most promising. Ultimately, none of the existing methods fully satisfied all the requirements set for this work: their outputs (probability estimates of having the disease or not) did not always produce values that reflected disease progression. The algorithms were also often overwhelmingly complex to clinicians. Some research using these methods have since been done and they appear to be reasonably good solutions for as- sessing disease progression [Young et al. 2012, Chincarini et al. 2011]. The other recently introduced disease state quantification methods mentioned in Section 3.4 were published in parallel with this thesis work and thus were not available for con- sideration when the work started. The method proposed in this thesis work is com- pared with the other disease quantification methods in more detail in Section 6.2.
4.3 Disease State Index
Research and development work for this thesis was done using the commercial software package MATLAB2. This work resulted in the supervised learning method DSI and an associated visualization method DSF, described later in this chapter. These methods are the main topic of this thesis. Results from evaluating the methods with various data sets are provided in the next chapter in which the thesis publications are summarized.
In short, the DSI is a supervised learning method that processes heterogene- ous patient data to derive numeric index values denoting the disease state of a patient. Disease state can be considered a condition related to the progression of a disease based on data measured from a patient. DSI is the numeric quantifica- tion of disease state, obtaining values between [0, 1]. In practice, the DSI is com- puted by comparing a patient’s measurement values comprehensively with previ- ously diagnosed subjects with and without a disease. Previously diagnosed pa- tients are provided as training data for the method, containing examples of control (healthy) cases and positive (disease) cases. The numeric values resulting from evaluating DSI, i.e. disease indices or DSI values, are defined as the location or rank of the patient between the control and positive cases. They denote the simi- larity of patient data to the positive cases in the training data. Thus, increasing DSI values indicate greater similarity to patients having the disease, based on the comparison with the training data. The following sections describe in detail how the DSI is computed.