Applications of supervised descriptive rule induction methods

methods

We have applied SDRI methods to practical problem domains in medicine and biology.

In the medical application (Chapter 3), we developed and used our CSM-SD methodol-

ogy for contrast set mining through subgroup discovery on a real-life problem of analyzing

82 RESULTS SUMMARY

a dataset of patients with brain ischemia, where the goal of data analysis was to determine

the type of brain ischemia from risk factors obtained from anamnesis, physical examina-

tion, laboratory tests and ECG data. The data analysis process was iterative and included

interaction with the domain expert in each iteration. First, standard data mining methods

were used, like decision tree (Quinlan, 1986) and classification rule (Cohen, 1995) learn-

ing, but both lead to results that were not satisfactory to the domain expert. Second,

the data mining task was formulated as a contrast set mining task and a formally justified

transformation of contrast set mining to subgroup discovery was introduced, named the

pairwise transformation. In this iteration, the domain expert was not fully satisfied with

the result. Brainstorming on the methods used and the discussion of the expectations

and way-of-thinking in the analyzed domain led to important new insights that triggered

a new approach. In the next phase, an application driven transformation from contrast

set mining to subgroup discovery was developed, which incorporated the experiences from

the previous iterations and was well fitted to the problem and the expert’s expectations.

We named it the one-versus-all transformation. Last, we improved the explanatory po-

tential of the discovered patterns by providing the supporting factors for each discovered

pattern. The supporting factors, which have been defined and used in subgroup discovery,

were adapted to the contrast set mining task and successfully used in our experiments.

Visualization was also used in every iteration.

The analysis results were interpreted by the medical domain expert. The main aim of

our research was to discover the differences between two types of stroke: embolic stroke

and thrombotic stroke, to be able to define the risk factors for the diseases. Both types

of stroke are ischemic (a clot blocks the blood flow to the brain), but the origin of the

clot is different. The results confirmed some already known risk factors for stroke and

new insights were also gained. For example, high systolic blood pressure (above 139)

is in medical practice considered characteristic for both diseases.

Our results confirm

this finding and also indicate that extremely high systolic blood pressure (above 185) is

not typical for embolic stroke patients. To summarize, our application proved successful:

known risk factors were confirmed and new insights were gained.

Several lessons have been learned from this experiment. First, iterative knowledge dis-

covery is a necessity. Second, the descriptive data analysis task is not concluded when

individual patterns are discovered; presenting the results to the end user with proper visual-

ization methods and additional information (in our case the supporting factors) makes the

discovered patterns more tangible and therefore more acceptable to the end user. Third,

the involvement of the end user is beneficial for achieving better analysis results. Last,

the involvement of the end user is beneficial also for the development of the theory and

Applications of supervised descriptive rule induction methods

83 methodology. To summarize, the interaction with the end user is of vital importance, not

only for the application itself, but also for the development of the theory and methodology.

In the biological application (Chapter 4, Section 6.3), we applied the mining of closed

sets for labeled data (RelSets for short) to potato microarray data, where the goal was to

find differences in response to viral infection of virus resistant and virus sensitive potato

transgenic lines (see Kralj et al., 2006, for more details). The RelSets method was devel-

oped independently of any application and the motivation for its development was mainly

theoretical. The reason for applying it to microarray data was that data mining tasks on

microarray data differ from traditional data mining tasks because microarray domains are

characterized by very large numbers of attributes (genes) relative to the number of exam-

ples (observations, samples). Standard SDRI algorithms do not perform well on microarray

data because of this high dimensionality problem. In contrast, RelSets does not have any

difficulty when faced with the high dimensionality problem.

The involvement of the biological expert was crucial in the non-trivial data preparation

phase. Besides the data cleaning and normalization, which are standard preprocessing

steps in microarray data analysis, expert-driven data discretization was also performed for

semi-automatic feature generation, which was the most complex step due to the compli-

cated biological experimental setup. Such data preprocessing can be used in other similar

settings. Once the adequate features were generated, the algorithm was run and the

results were visualized with heatmaps, which are a standard method in micorarray data

visualization. RelSets is not only fast—much faster than other SDRI algorithms on mi-

croarray data—but also returns a small set of rules that are meaningful and easy to be

interpreted by domain experts (see Table 6 in Chapter 4).

The analysis results were interpreted by a biology expert. The expert was, for example,

able to determine the categories of genes that influence the sensitivity of potato plants

to the tested virus. The analysis results also helped to elucidate the time response of the

plants to the virus: all the plants responded similarly in the first eight hours after infection,

while the response twelve hours after the infection was different for resistant and sensitive

transgenic potato lines (Baebler et al., 2009).

In summary, we have shown the adequacy of supervised descriptive rule induction meth-

ods in real-life data analysis problems where the goal is to gain new insights into the domain.

In our applications, the domain experts were intensively involved in the knowledge discovery

process, which was beneficial for both, the expert and the methodology development.

84 RESULTS SUMMARY

In document SUPERVISED DESCRIPTIVE RULE INDUCTION (Page 99-102)