We develop a regression tree using the Boston Housing Price dataset that reports the median value of owner-occupied homes in about 500 U.S. census tracts in the Boston area, together with several variables that might help to explain the variation in median value across tracts. For ease of reference, definitions of the variables in BOSTON.CSV data (included with your installation sample data) are given below.
CRIM per capita crime rate by town
ZN proportion of residential land zoned for lots over 25,000 sq. ft.
INDUS proportion of non-retail business acres per town
CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) NOX nitric oxides concentration (parts per 10 million)
RM average number of rooms per dwelling
AGE proportion of owner-occupied units built prior to 1940 DIS weighted distances to five Boston employment centers RAD index of accessibility to radial highways
TAX full-value property-tax rate per $10,000 PT pupil-teacher ratio by town
LSTAT % population of lower status
MV Median value of owner-occupied homes in $1000's
After you open a data set, setting up a CART regression analysis entails several logical steps, all carried out in one of the Model Setup dialog tabs available after clicking on the [Model…] button in the Activity Window.
Model selects target and predictor variables, specifies categorical predictors and weight variables, chooses tree type (regression), specifies auxiliary variables
Categorical sets up categorical class names
Force Split specifies splitter for root node and its children Constraints specifies structural constraints on a tree Testing selects a testing or self-validation method Select Cases selects a subset of original data
Best Tree defines the best tree-selection method Method selects a splitting rule
Penalty sets penalties on variables, missing values, and high-level categorical predictors
Advanced specifies other model-building options Battery specifies batteries of automated runs
The key differences regression tree models impose on both model setup and resulting output are:
♦ Certain Model Setup dialog tabs are grayed when you select the regression tree type in the Model dialog. These include the Costs and Priors tabs that provide powerful means of control over classification trees.
♦ Least Squares (default setting) and Least Absolute Deviation are the only splitting rules available.
Even though classification splitting rules are not grayed out, the actual setting is ignored in all regression runs.
♦ Gains charts, misclassification tables and prediction success tables are no longer displayed in the Tree Summary Reports because they are not applicable.
♦ The Mean (or within-node average) of the target variable is reported for each node (rather than a class assignment) and node distributions are displayed as box plots (rather than as bar/pie graphs).
The only required step for growing a regression tree is to specify a target variable and a tree type in the Model Setup—Model tab.
If the other Model Setup dialog tabs are left unchanged, the following defaults are used:
♦ All remaining variables in the data set other than the target will be used as predictors (the Model tab)
♦ No weights will be applied (the Model tab)
♦ 10-fold cross validation will be used for testing (the Testing tab)
♦ the minimum cost tree will become the best tree (the Best Tree tab)
♦ Only five surrogates will be tracked and they will all count equally in the variable importance formula (the Best Tree tab)
♦ the least squares splitting criterion for regression trees will be used (the Method tab)
♦ No penalties will be applied (the Penalty tab)
♦ Parent node requirements will be set to 10 and child node requirements set to 1 (the Advanced tab)
♦ Allowed sample size will be set to the currently-open data set size (the Advanced tab)
♦ the 3000 limit warning for cross validation will be activated
With respect to the command line, CART determines which tree to grow (classification or regression) depending on whether the target appears in the CATEGORY command. A classification tree is built for categorical targets and a regression tree for continuous targets.
To illustrate the regression tree concept, we use the following steps to start the analysis:
7. Select File->Open->Data File to open the BOSTON.CSV dataset (506 observations).
8. In the Model Setup dialog, check MV as the target variable and click on the Regression Tree radio button. Check all the other variables as predictors.
9. In the Model Setup—Advanced tab, set “Parent Node Minimum Cases” to 40 and “Terminal Node Minimum Cases” to 20. This will ensure that the terminal nodes will not become too small.
10. Click [Start].
Tree Navigator
At the end of the model-building process, a navigator window for a regression tree will appear.
By default, CART uses the least squares splitting rule to grow the maximal tree and cross-validated error rates to select the “optimal” tree. In this example, the optimal tree is the tree with 18 terminal nodes, as displayed in the Navigator above.
The upper button in the group cycles over three possible display modes in the lower part of the Navigator Window:
Default Mode shows the relative error profile (either Test, Cross-Validated, or Learn depending on the testing method chosen in the Testing tab of the Model Setup window):
1-SE Mode shows the relative error profile where all trees with performance within one standard error of the minimal error tree are marked in green:
Node Size mode shows the node size bar chart for the currently-selected tree:
You can click on any of the bars to see the corresponding node highlighted in yellow on the tree display.
To change the currently-selected tree, go to one of the previous modes, pick a new tree, and switch back to the Node Size mode.
The tree picture can be made smaller or larger by pressing the corresponding buttons in the left upper corner of the navigator window.
As with classification trees, to change the level of detail you see when hovering over nodes, right-click on the background of the Navigator window and select your preferred display from the local pop-up menu.
The [Learn] and [Test] group of buttons controls whether Learn or Test data partitions are used to display the node details on the hover displays or all related Tree Details windows.
Color Coding
The terminal nodes can be color coded by either target mean or median. Make your selection in the Color Code Using: selection box.
Viewing Tree Splitters and Details
The [Splitters…] button and the [Tree Details...] buttons work similarly to the classification case described previously (see Chapter 3: CART BASICS). The only difference is that node information now displays target means and variances instead of frequency tables and class assignments.
The tree Details display can be configured using the View—Node Detail… menu.