♦ No weights will be applied (the Model tab)
♦ 10-fold cross validation for testing (the Testing tab)
♦ Minimum cost tree will become the best tree (the Best Tree tab)
♦ Only five surrogates will be tracked and they will all count equally in the variable importance formula (the Best Tree tab)
♦ GINI splitting criterion for classification trees and least squares for regression trees (the Method tab)
♦ Unit (equal) misclassification costs (the Costs tab)
♦ Equal priors: all classes treated as if they were equal size (the Priors tab)
♦ No penalties (the Penalty tab)
♦ Parent node requirements set to 10 and child node requirements set to 1 (the Advanced tab)
♦ Allowed sample size set to the currently-open data set size (the Advanced tab)
Many other options are available to the advanced user and we invite you to explore them at your leisure in the chapters that follow. The good news about CART is that you can get started by focusing only on the essentials, deferring advanced topics.
The remainder of this section discusses the model setup process. Subsequent sections cover additional options.
The Model tab
The Model Setup—Model tab is the central location for model control—where you identify the target or dependent variables. This is the one and only task that CART requires of you. CART will not know which column of your data to try to analyze without your guidance. Once you provide that information CART is technically able to do everything else for you.
In practice you will probably also want to select the candidate predictor (independent) variables, because data sets typically contain bookkeeping columns such as ID variables that are not suitable for prediction. In some cases you may also have a weight variable. Where possible CART will automatically realize that you want to grow a classification tree. But when the target variable is numeric you do have the choice of growing a classification or regression tree and you may need to correct the selection indicated on the Model Setup dialog. This is the heart of the Model Setup dialog.
Target Variable Selection
The target variable is specified by checking off ONE variable in the target column of the Model Setup—Model tab. Locate the row with LOW as Variable Name and put a checkmark in the Target column.
After the target has been checked, the Model tab switches from red to black, indicating that CART is ready to start an analysis according to the default settings.
Specifying Tree Type
CART uses a set of Tree Type radio buttons to determine if the tree grown will be a classification tree or a regression tree. The difference between the two tree types is simple. Classification trees use a "categorical" target variable (e.g., YES/NO, while the regression tree uses a "continuous" target variable (such as AGE or INCOME).
The purpose of classification is to accurately discriminate between (usually a small number of) classes; the purpose of regression to is predict values that are close to a true outcome (with usually a large number or even an infinity of possible outcomes).
When the Tree Type: Classification radio button is checked, the target variable automatically will be considered categorical regardless of the Categorical check-box designation defined in Model tab. Similarly, the Regression radio button will automatically cancel the categorical status of the target variable (so long as the variable is coded as a number and not as text). In other words, the specified Tree Type determines whether a numeric target is treated as categorical or continuous, superseding any Categorical check-box designation.
Predictor Variable Selection
Candidate predictor (independent) variables are specified by check marks in the Predictor column. In this example, include the following subset of variables as predictors: AGE, RACE, SMOKE, HT, UI, FTV, PTD, and LWD, by placing checkmarks in the Predictor column against the above variables. Alternatively, hold down the <Ctrl> key to simultaneously highlight the variables with left-mouse clicks and then place a checkmark in the Select Predictors box at the bottom of the column. The Model tab will appear as follows:
If you inadvertently include a variable as a predictor, simply uncheck the corresponding box.
Note also that each of the model setup tabs contains a [Save Grove...] button in the lower left corner. This allows you to request saving of the model for future review, scoring, or export.
For command-line users, the MODEL command sets the target variable, while the KEEP command defines the predictor list. See the following command line syntax.
MODEL <depvar>
KEEP < indep_var1, indep_var2, …,indep_var#>
--- MODEL LOW
KEEP AGE, RACE, SMOKE, HT, UI, FTV, PTD, LWD
Categorical Predictors
Put checkmarks in the Categorical column against those predictors that should be treated as categorical. For our example, specify RACE, UI, and FTV as categorical predictor variables. Alternatively, as for predictor variables, hold down the <Ctrl> key to simultaneously highlight the variables with left-mouse clicks and then place a checkmark in the Select Categorical box at the bottom of the column.
When the Tree Type: Classification radio button is checked, the target variable will be automatically defined as categorical and appear with the corresponding checkmark at later invocations of the Model Setup. Similarly, the Regression radio button will automatically cancel the categorical status of the target variable. In other words, the specified Tree Type determines whether the target is treated as categorical or continuous.
Annotation On Categorical Variables
Categorical targets and predictors are those that take on a conceptually finite set of discrete values, for example, data naturally in text form (e.g., “Male," "Female"). You may declare any variable categorical but you should do so only when this is sensible.
It should be noted that CART 6 supports "high-level categoricals" through its proprietary algorithms that quickly determine effective splits in spite of the daunting combinatorics of many-valued predictors. This feature was introduced in CART 4 and is increasingly important considering CART 6's character predictors, which in
"real world" datasets often have hundreds or even thousands of levels. When forming a categorical splitter, traditional CART searches all possible combinations of levels, an approach in which time increases geometrically with the number of levels.
In contrast, CART's high-level categorical algorithm increases linearly with time, yet
yields the optimal split in most situations. See the section below titled "High-Level Categorical Predictors" for additional details.
Character Variable Caveats
Character variables are implicitly treated as categorical (discrete), so there is no need to "declare" them categorical. CART 6 has no internal limit on the length of character data values (strings). You are limited in this respect only by the data format you choose (e.g., SAS, text, Excel, etc.).
Character variables (marked by “$” at the end of variable name) will always be treated as categorical and cannot be unchecked.
Occasionally columns stored in an Excel spreadsheet will be tagged as
“Character” even though the values in the column are intended to be numeric.
If this occurs with your data refer to the READING DATA section to remedy this problem.
Categorical vs. Continuous Predictors
Depending whether a variable is declared as continuous or categorical, CART will search for different types of splits. Each takes on a unique form.
Continuous Split Form
Continuous splits will always use the following form.
A case goes left if
[split-variable] <= [split-value]
A node is partitioned into two children such that the left child receives all the cases with the lower values of the [split-variable].
Categorical Split Form
Categorical splits will always use the following form.
A case goes left if
[split-variable] = [level_i OR …level_j OR … level_k]
In other words, we simply list the values of the splitter that go left (and all other values go right).
If a categorical variable with many levels is coded as a number it may actually be helpful to treat it as a continuous variable. This is discussed further in a later chapter.
One should exercise caution when declaring continuous variables as categorical because a large number of distinct levels may result in significant increases in running times and memory consumption.
Any categorical predictor with a large number of levels can create problems for the model. While there is no hard and fast rule, once a categorical predictor exceeds about 50 levels there are likely to be compelling reasons to try to combine levels until it meets this limit. We show how CART can conveniently do this for you later in the manual.
For command-line users, categorical variables are defined using the CATEGORY command. See the following command line syntax.
CATEGORY <cat_var1, cat_var2, …, cat_var#>
---
CATEGORY LOW, RACE, SMOKE, UI
Case Weights
In addition to selecting target and predictor variables, the Model tab allows you to specify a case-weighting variable.
Case weights, which are stored in a variable on the dataset, typically vary from observation to observation. An observation’s case weight can, in some sense, be thought of as a repetition factor. A missing, negative or zero case weight causes the observation to be deleted, just as if the target variable were missing. Case weights may take on fractional values (e.g., 1.5, 27.75, 0.529, 13.001) or whole numbers (e.g., 1, 2, 10, 100).
To select a variable as the case weight, simply put a checkmark against that variable in the Weight column.
Case weights do not affect linear combinations in CART-SE, but are otherwise used throughout CART. CART-Pro and ProEX include a new linear combination facility that does recognize case weights.
If you are using a test sample contained in a separate dataset, the case weight variable must exist and have the same name in that dataset as in your main (learn sample) dataset.
For command line users, the variable containing observation case weights is specified with the WEIGHT command, which is issued after the USE command and before the BUILD command. See the following command line syntax:
WEIGHT <wgtvar>
Auxiliary Variables
Auxiliary variables are variables that are tracked throughout the CART tree but are not necessarily used as predictors. By marking a variable as Auxiliary you indicate that you want to be able to retrieve basic summary statistics for such variables in any node in the CART tree. In our modeling run based on the HOSLEM.CSV data, we mark AGE, SMOKE and BWT as auxiliary.
Later in this chapter, in the section titled "Viewing Auxiliary Variable Information," we discuss how to view auxiliary variable distributions on a node-by-node basis.
Command-line users will use the following command syntax to specify auxiliary variables.
AUXILIARY <auxvar1>, <auxvar2>, … etc.
---
AUXILIARY AGE, SMOKE, BWT
Setting Focus Class
In classification runs some of the reports generated by CART (gains, prediction success, color-coding, etc.) have one target class in focus. By default, CART will put the first class it finds in the dataset in focus. A user can overwrite this by pressing the [Set Focus Class…] button.
Sorting Variable List
The variable list can be sorted either in physical order or alphabetically by changing the Sort: control box. Depending on the dataset, one of those modes will be preferable, which is usually helpful when dealing with large variable lists.