methods
We have applied SDRI methods to practical problem domains in medicine and biology.
In the medical application (Chapter 3), we developed and used our CSM-SD methodol-
ogy for contrast set mining through subgroup discovery on a real-life problem of analyzing
82
RESULTS SUMMARY
a dataset of patients with brain ischemia, where the goal of data analysis was to determine
the type of brain ischemia from risk factors obtained from anamnesis, physical examina-
tion, laboratory tests and ECG data. The data analysis process was iterative and included
interaction with the domain expert in each iteration. First, standard data mining methods
were used, like decision tree (Quinlan, 1986) and classification rule (Cohen, 1995) learn-
ing, but both lead to results that were not satisfactory to the domain expert. Second,
the data mining task was formulated as a contrast set mining task and a formally justified
transformation of contrast set mining to subgroup discovery was introduced, named the
pairwise transformation. In this iteration, the domain expert was not fully satisfied with
the result. Brainstorming on the methods used and the discussion of the expectations
and way-of-thinking in the analyzed domain led to important new insights that triggered
a new approach. In the next phase, an application driven transformation from contrast
set mining to subgroup discovery was developed, which incorporated the experiences from
the previous iterations and was well fitted to the problem and the expert’s expectations.
We named it the one-versus-all transformation. Last, we improved the explanatory po-
tential of the discovered patterns by providing the supporting factors for each discovered
pattern. The supporting factors, which have been defined and used in subgroup discovery,
were adapted to the contrast set mining task and successfully used in our experiments.
Visualization was also used in every iteration.
The analysis results were interpreted by the medical domain expert. The main aim of
our research was to discover the differences between two types of stroke: embolic stroke
and thrombotic stroke, to be able to define the risk factors for the diseases. Both types
of stroke are ischemic (a clot blocks the blood flow to the brain), but the origin of the
clot is different. The results confirmed some already known risk factors for stroke and
new insights were also gained. For example, high systolic blood pressure (above 139)
is in medical practice considered characteristic for both diseases.
Our results confirm
this finding and also indicate that extremely high systolic blood pressure (above 185) is
not typical for embolic stroke patients. To summarize, our application proved successful:
known risk factors were confirmed and new insights were gained.
Several lessons have been learned from this experiment. First, iterative knowledge dis-
covery is a necessity. Second, the descriptive data analysis task is not concluded when
individual patterns are discovered; presenting the results to the end user with proper visual-
ization methods and additional information (in our case the supporting factors) makes the
discovered patterns more tangible and therefore more acceptable to the end user. Third,
the involvement of the end user is beneficial for achieving better analysis results. Last,
the involvement of the end user is beneficial also for the development of the theory and
Applications of supervised descriptive rule induction methods
83
methodology. To summarize, the interaction with the end user is of vital importance, not
only for the application itself, but also for the development of the theory and methodology.
In the biological application (Chapter 4, Section 6.3), we applied the mining of closed
sets for labeled data (RelSets for short) to potato microarray data, where the goal was to
find differences in response to viral infection of virus resistant and virus sensitive potato
transgenic lines (see Kralj et al., 2006, for more details). The RelSets method was devel-
oped independently of any application and the motivation for its development was mainly
theoretical. The reason for applying it to microarray data was that data mining tasks on
microarray data differ from traditional data mining tasks because microarray domains are
characterized by very large numbers of attributes (genes) relative to the number of exam-
ples (observations, samples). Standard SDRI algorithms do not perform well on microarray
data because of this high dimensionality problem. In contrast, RelSets does not have any
difficulty when faced with the high dimensionality problem.
The involvement of the biological expert was crucial in the non-trivial data preparation
phase. Besides the data cleaning and normalization, which are standard preprocessing
steps in microarray data analysis, expert-driven data discretization was also performed for
semi-automatic feature generation, which was the most complex step due to the compli-
cated biological experimental setup. Such data preprocessing can be used in other similar
settings. Once the adequate features were generated, the algorithm was run and the
results were visualized with heatmaps, which are a standard method in micorarray data
visualization. RelSets is not only fast—much faster than other SDRI algorithms on mi-
croarray data—but also returns a small set of rules that are meaningful and easy to be
interpreted by domain experts (see Table 6 in Chapter 4).
The analysis results were interpreted by a biology expert. The expert was, for example,
able to determine the categories of genes that influence the sensitivity of potato plants
to the tested virus. The analysis results also helped to elucidate the time response of the
plants to the virus: all the plants responded similarly in the first eight hours after infection,
while the response twelve hours after the infection was different for resistant and sensitive
transgenic potato lines (Baebler et al., 2009).
In summary, we have shown the adequacy of supervised descriptive rule induction meth-
ods in real-life data analysis problems where the goal is to gain new insights into the domain.
In our applications, the domain experts were intensively involved in the knowledge discovery
process, which was beneficial for both, the expert and the methodology development.
84
RESULTS SUMMARY
In document
SUPERVISED DESCRIPTIVE RULE INDUCTION
(Page 99-102)