Classification and Regression Tree (CART): A Data Mining Technique

CHAPTER 2: LITERATURE REVIEW

2.4 Statistical Tools and Techniques 1 Logistic Regressions

2.4.2 Classification and Regression Tree (CART): A Data Mining Technique

In comparison to logistic regressions, data mining techniques have not been widely

applied for tobacco related research. A few studies that employed this class of methods

are Moon et al. (2012), Schane et al. (2010) and Gervilla et al. (2011). Data mining

(DM) is the application of special algorithms established from a few disciplines,

namely, statistics, artificial intelligence, machine learning, database sciences, and

information recovery (Han & Kamber, 2001). DM techniques can be used for different

data types covering databases, text, spatial data, temporal data, images, and other

complex data (Frawley et al., 1991; Hearst, 1999; Roddick & Spiliopoulou, 1999;

Zaïane et al., 1998). The purpose of technique is for knowledge discovery in databases,

text and web mining, and they utilize the toolsets and process to yield products which

are useful knowledge but different from the original data set (Benoît, 2002; Fayyad et

al., 1996; Han & Kamber, 2001). DM is the way of discovery interesting patterns that

are not obviously part of the data and which can be used to find out new knowledge of

data and to make predictions (Witten & Frank, 2005). DM is a multi-staged process of

data and infers associations and rules from them. This mined information can be applied

for prediction and in classification models by detecting relations within the data records

or between the databases. The identified patterns and guidelines can then be used for

decision making and forecasting the effects of those decisions (Clifton, 2010).

The fundamental principle of data mining is that there are unseen but useful

patterns inside data and these patterns can be used to infer rules that allow for the

forecast of future results (GAO, 2004). Before the period of 1960 and the beginning of

the computer age, a data analyst with expert knowledge and training in statistics can

find patterns, make extrapolations, and discover interesting information which is then

conveyed via written reports, graphs and charts. But today, the task is too complex for a

single expert (Fayyad et al., 1996). Information is spread across multiple platforms and

deposited in a wide variety of formats, some of which are structured and some

unstructured. Data sources are often inadequate and some data are continuous while

others are discrete. All forms of DM are based on the principle to learn new

characteristics of the data by applying certain procedures to find patterns and to create

models which can then be used to make forecasts, or to find new data associations

(Benoît, 2002; Fayyad et al., 1996; Hearst, 2003). The other significant principle is the

importance of presenting the patterns in an understandable way. Once patterns have

been recognized, they must be taken to the end user in an effective way that allows the

user to act on them and to provide reaction for decision making (Han & Kamber, 2001).

The DM techniques such as neural networks (NN), decision trees (ID3, c4.5,

CHAID, QUEST, and CART), self-organizing map (SOM), linear regression (local,

global), exponential regressions, logistic regression, k-means, CN2, K-NN, radial basis

function and bays classifiers are divided into two broad groups, namely, descriptive

(clustering) and predictive (classification and regression) (Benoît, 2002; Dunham, 2003;

challenge is to decide the appropriate data mining techniques and proper use for

application, for instance, when neural networks (NN) are appropriate and when are the

decision trees (DTs)? When is data mining suitable as opposed to just working with

interpersonal databases and reporting? When would OLAP (On-line analytical

processing) and multidimensional database be appropriate? An approach commonly

followed in finding a suitable technique is by trial and error. The choice of techniques

depends on the types of problem and information available. The advice is to take a

robust model that could be an under-performer and perform the analysis without delay,

compared to what some of the finest data mining techniques could provide but require a

great deal of time to understand and interpret (Benoît, 2002; Dunham, 2003; Witten &

Frank, 2005).

Decision trees are predictive models that classify the data into leaf and node,

viewed as part of a tree until the entire set has been analyzed. Each branch of the tree is

created according to the classification criteria and the leaves of the tree are divided

based on all the possible outcomes of the criteria under study. Decision trees produce

guidelines that are mutually exclusive and jointly extensive and work from a forecast

target downward in what is known as a “greedy” search. It classifies information at each branch point without losing any of the data. For instance, the number of total

observations in a parent node is equal to the sum of the observations contained in its two

children nodes. Decision tree approach is easy to understand in contrast with other DM

techniques (Romei & Turini, 2011; Sarker et al., 2011; Yoo et al., 2012). So it can be

used either for the search of new information within databases or building predictive

models.

The decision tree algorithms include ID3, C4.5, Chi-Square Automatic

Interaction Detector (CHAID), Quick, Unbiased, Efficient Statistical Tree (QUEST) and

(Agresti, 2007; Daeppen et al., 2000; Dunham, 2003; Giskes et al., 2005; Hagman et al.,

2008; Moon et al., 2012; Ruben & Canlas, 2009; Soni, et al., 2011; Srinivas et al., 2010)

due to the following reasons: (a) CART offers a concise way for describing groups with

elements that vary in terms of the dependent variable. A set of rules concerning the

decisions to be taken to assign a certain element to a class is presented graphically.

CART detects “splitting” variables based on a thorough search of all possibilities. Since competent algorithms are used, it is able to search all potential variables as splitters,

even in problems with many hundreds of probable predictors. (b) The predictor

variables are hardly nicely distributed, many variables are not normally distributed and

different groups may have evidently different degrees of variation or variance.

Composite interactions or patterns may exist in the data, for instance, the value of one

variable (e.g., age) may markedly affect the importance of another variable (e.g.,

weight). These types of relations are generally difficult and virtually impossible to

model when the number of relations and variables becomes extensive. CART is often

able to discover complex relations between predictors which may be difficult or

impossible to discover using traditional multivariate techniques. CART can handle

numerical data that are highly skewed or multi-modal, as well as categorical predictors

with either ordinal or non-ordinal construction. Therefore, time could be saved which

would otherwise be spent defining whether variables are normally distributed, and

making conversion if they are not. (c) CART can competently handle data with a

combination of categorical and continuous variables. The Chi-square test is used for

categorical variables and F-test is used for continuous variables. For instance, most

studies of smoking behaviors among adults have used logistic regression technique

which is based on parametric assumption of the dependent variable. However, the

parametric assumption of logistic regression often limits its application to data that are

can be used to overcome the limitations posed by the logistic regression. CART is

naturally non-parametric and no assumptions are made concerning the underlying

distribution of the predictor variables and can successfully handle any data type. (d)

CART algorithm can efficiently handle missing data through surrogates. For cases in

which the value for a variable is missing, other independent variables having high

relations with the original variable are used for classification. (e) It is relatively

automatic ‘machine learning’ and less input is needed for analysis. This is a noticeable difference from other multivariate modeling methods, in which widespread input from

the analyst, analysis of provisional results, and subsequent modification or refinement

of the method is essential. (f) CART has good properties of visualization and simple for

non-statisticians to interpret, and more likely to be feasible and practical, since the

structure of the rule and its inherent logic are apparent to the readers.

It is descriptive in nature, which makes it easy to understand and interpret the

results of the model. In addition, it has the efficiency and scalability of data mining

algorithms; useful for handling high-dimensionality and noise as well as uncertainty and

incompleteness; uses knowledge in data mining; has pattern evaluation and knowledge

integration; has the protection of security, and privacy in data mining (Daeppen et al.,

2000; Dunham, 2003; Giskes et al., 2005; Hagman et al., 2008; Ruben & Canlas, 2009;

Soni, et al., 2011; Srinivas et al., 2010.

In document Tobacco consumption, environmental tobacco smoke exposure and illicit drug use: A study on selected south Asian countries / Mohammad Alamgir Kabir (Page 72-76)