CHAPTER 2: LITERATURE REVIEW
2.4 Statistical Tools and Techniques 1 Logistic Regressions
2.4.2 Classification and Regression Tree (CART): A Data Mining Technique
In comparison to logistic regressions, data mining techniques have not been widely
applied for tobacco related research. A few studies that employed this class of methods
are Moon et al. (2012), Schane et al. (2010) and Gervilla et al. (2011). Data mining
(DM) is the application of special algorithms established from a few disciplines,
namely, statistics, artificial intelligence, machine learning, database sciences, and
information recovery (Han & Kamber, 2001). DM techniques can be used for different
data types covering databases, text, spatial data, temporal data, images, and other
complex data (Frawley et al., 1991; Hearst, 1999; Roddick & Spiliopoulou, 1999;
Zaïane et al., 1998). The purpose of technique is for knowledge discovery in databases,
text and web mining, and they utilize the toolsets and process to yield products which
are useful knowledge but different from the original data set (Benoît, 2002; Fayyad et
al., 1996; Han & Kamber, 2001). DM is the way of discovery interesting patterns that
are not obviously part of the data and which can be used to find out new knowledge of
data and to make predictions (Witten & Frank, 2005). DM is a multi-staged process of
data and infers associations and rules from them. This mined information can be applied
for prediction and in classification models by detecting relations within the data records
or between the databases. The identified patterns and guidelines can then be used for
decision making and forecasting the effects of those decisions (Clifton, 2010).
The fundamental principle of data mining is that there are unseen but useful
patterns inside data and these patterns can be used to infer rules that allow for the
forecast of future results (GAO, 2004). Before the period of 1960 and the beginning of
the computer age, a data analyst with expert knowledge and training in statistics can
find patterns, make extrapolations, and discover interesting information which is then
conveyed via written reports, graphs and charts. But today, the task is too complex for a
single expert (Fayyad et al., 1996). Information is spread across multiple platforms and
deposited in a wide variety of formats, some of which are structured and some
unstructured. Data sources are often inadequate and some data are continuous while
others are discrete. All forms of DM are based on the principle to learn new
characteristics of the data by applying certain procedures to find patterns and to create
models which can then be used to make forecasts, or to find new data associations
(Benoît, 2002; Fayyad et al., 1996; Hearst, 2003). The other significant principle is the
importance of presenting the patterns in an understandable way. Once patterns have
been recognized, they must be taken to the end user in an effective way that allows the
user to act on them and to provide reaction for decision making (Han & Kamber, 2001).
The DM techniques such as neural networks (NN), decision trees (ID3, c4.5,
CHAID, QUEST, and CART), self-organizing map (SOM), linear regression (local,
global), exponential regressions, logistic regression, k-means, CN2, K-NN, radial basis
function and bays classifiers are divided into two broad groups, namely, descriptive
(clustering) and predictive (classification and regression) (Benoît, 2002; Dunham, 2003;
challenge is to decide the appropriate data mining techniques and proper use for
application, for instance, when neural networks (NN) are appropriate and when are the
decision trees (DTs)? When is data mining suitable as opposed to just working with
interpersonal databases and reporting? When would OLAP (On-line analytical
processing) and multidimensional database be appropriate? An approach commonly
followed in finding a suitable technique is by trial and error. The choice of techniques
depends on the types of problem and information available. The advice is to take a
robust model that could be an under-performer and perform the analysis without delay,
compared to what some of the finest data mining techniques could provide but require a
great deal of time to understand and interpret (Benoît, 2002; Dunham, 2003; Witten &
Frank, 2005).
Decision trees are predictive models that classify the data into leaf and node,
viewed as part of a tree until the entire set has been analyzed. Each branch of the tree is
created according to the classification criteria and the leaves of the tree are divided
based on all the possible outcomes of the criteria under study. Decision trees produce
guidelines that are mutually exclusive and jointly extensive and work from a forecast
target downward in what is known as a “greedy” search. It classifies information at each branch point without losing any of the data. For instance, the number of total
observations in a parent node is equal to the sum of the observations contained in its two
children nodes. Decision tree approach is easy to understand in contrast with other DM
techniques (Romei & Turini, 2011; Sarker et al., 2011; Yoo et al., 2012). So it can be
used either for the search of new information within databases or building predictive
models.
The decision tree algorithms include ID3, C4.5, Chi-Square Automatic
Interaction Detector (CHAID), Quick, Unbiased, Efficient Statistical Tree (QUEST) and
(Agresti, 2007; Daeppen et al., 2000; Dunham, 2003; Giskes et al., 2005; Hagman et al.,
2008; Moon et al., 2012; Ruben & Canlas, 2009; Soni, et al., 2011; Srinivas et al., 2010)
due to the following reasons: (a) CART offers a concise way for describing groups with
elements that vary in terms of the dependent variable. A set of rules concerning the
decisions to be taken to assign a certain element to a class is presented graphically.
CART detects “splitting” variables based on a thorough search of all possibilities. Since competent algorithms are used, it is able to search all potential variables as splitters,
even in problems with many hundreds of probable predictors. (b) The predictor
variables are hardly nicely distributed, many variables are not normally distributed and
different groups may have evidently different degrees of variation or variance.
Composite interactions or patterns may exist in the data, for instance, the value of one
variable (e.g., age) may markedly affect the importance of another variable (e.g.,
weight). These types of relations are generally difficult and virtually impossible to
model when the number of relations and variables becomes extensive. CART is often
able to discover complex relations between predictors which may be difficult or
impossible to discover using traditional multivariate techniques. CART can handle
numerical data that are highly skewed or multi-modal, as well as categorical predictors
with either ordinal or non-ordinal construction. Therefore, time could be saved which
would otherwise be spent defining whether variables are normally distributed, and
making conversion if they are not. (c) CART can competently handle data with a
combination of categorical and continuous variables. The Chi-square test is used for
categorical variables and F-test is used for continuous variables. For instance, most
studies of smoking behaviors among adults have used logistic regression technique
which is based on parametric assumption of the dependent variable. However, the
parametric assumption of logistic regression often limits its application to data that are
can be used to overcome the limitations posed by the logistic regression. CART is
naturally non-parametric and no assumptions are made concerning the underlying
distribution of the predictor variables and can successfully handle any data type. (d)
CART algorithm can efficiently handle missing data through surrogates. For cases in
which the value for a variable is missing, other independent variables having high
relations with the original variable are used for classification. (e) It is relatively
automatic ‘machine learning’ and less input is needed for analysis. This is a noticeable difference from other multivariate modeling methods, in which widespread input from
the analyst, analysis of provisional results, and subsequent modification or refinement
of the method is essential. (f) CART has good properties of visualization and simple for
non-statisticians to interpret, and more likely to be feasible and practical, since the
structure of the rule and its inherent logic are apparent to the readers.
It is descriptive in nature, which makes it easy to understand and interpret the
results of the model. In addition, it has the efficiency and scalability of data mining
algorithms; useful for handling high-dimensionality and noise as well as uncertainty and
incompleteness; uses knowledge in data mining; has pattern evaluation and knowledge
integration; has the protection of security, and privacy in data mining (Daeppen et al.,
2000; Dunham, 2003; Giskes et al., 2005; Hagman et al., 2008; Ruben & Canlas, 2009;
Soni, et al., 2011; Srinivas et al., 2010.