• No results found

2.5.1

Forecasting “Bakeoff ”

Finally, we compare the performance of bartMachine to BayesTree (and RF). We

bake-off nine regression data sets and assessed out-of-fold RMSE using 10-fold cross- validation. We average the results across 20 replications of cross-validation. The results are displayed in Table 2.2.

bartMachine BayesTree RF boston 4.451 4.503 4.582 triazine 0.128* 0.130 0.119 ozone 4.147 4.144 4.064 baseball 709.197 709.437 729.188 wine.red 0.656 0.651* 0.642 ankara 1.348* 1.461 1.574 wine.white 0.759* 0.766 0.746 pole 11.713* 12.755 10.691 compactiv 3.262 2.795* 2.957

Table 2.2: RMSE values for three machine learning algorithms averaged across

20 replicates. Asterisks indicate a significant difference between bartMachine and

BayesTree at a significance level of 5% with a Bonferroni correction. Comparisons

We conclude that the implementation outlined in this paper performs approxi- mately the same as the previous implementation with regards to predictive accuracy.

Table 2.3 shows the average run-time for each algorithm. Note thatbartMachine

is run using 4 cores.

bartMachine BayesTree RF boston 7.8 28.5 5.1 triazine 5.7 10.7 2.6 ozone 4.7 17.6 2.1 baseball 5.6 18.6 3.3 wine.red 13.5 51.1 10.6 ankara 12.8 27.0 10.9 wine.white 13.5 56.0 11.0 pole 18.2 7.0 12.0 compactiv 16.3 18.4 19.2

Table 2.3: Average run-times (in seconds) for each complete k-fold estimation for three machine learning algorithms.

2.5.2

Discussion

This article introduced bartMachine, a new R package which implements Bayesian

additive regression trees. The goal of this package is to provide a fast, extensive and user-friendly implementation accessible to a wide range of data analysts, and increase

the visibility of BART to a broader statistical audience. We hope we have provided

organized, well-documented open-source code and we encourage the community to make innovations on this package.

Citation

Kaplener, A. and Bleich, J. (2015). bartMachine: Machine learning with Bayesian Additive Regression Trees. Journal of Statistical Software, forthcoming.

3

Variable Selection for

BART

Abstract

We consider the task of discovering gene regulatory networks, which are defined as sets of genes and the corresponding transcription factors which regulate their expres- sion levels. This can be viewed as a variable selection problem, potentially with high dimensionality. Variable selection is especially challenging in high dimensional set- tings, where it is difficult to detect subtle individual effects and interactions between predictors. BART provides a novel nonparametric alternative to parametric regression approaches, such as the lasso or stepwise regression, especially when the number of relevant predictors is sparse relative to the total number of available predictors and the fundamental relationships are nonlinear. We develop a principled permutation- based inferential approach for determining when the effect of a selected predictor is likely to be real. Going further, we adapt theBARTprocedure to incorporate informed prior information about variable importance. We present simulations demonstrating that our method compares favorably to existing parametric and nonparametric proce- dures in a variety of data settings. To demonstrate the potential of our approach in a biological context, we apply it to the task of inferring the gene regulatory network in yeast (Saccharomyces cerevisiae). We find that ourBART-based procedure is best able

to recover the subset of covariates with the largest signal compared to other variable selection methods. The methods developed in this work are readily available in the

R packagebartMachine.

3.1

Introduction

An important statistical problem in many application areas is variable selection: iden- tifying the subset of covariates that exert influence on a response variable. We consider the general framework where we have a continuous response variableyand a large set of predictor variables x1, ...,xK. We focus on variable selection in the sparse setting: only a relatively small subset of those predictor variables truly influences the response variable.

One such example of a sparse setting is the motivating application for this paper: inferring the gene regulatory network in budding yeast (Saccharomyces cerevisiae). In this application, we have a collection of approximately 40 transcription factor proteins (TFs) that act to regulate cellular processes in yeast by promoting or repressing transcription of specific genes. It is unknown which of the genes in our yeast data are regulated by each of the transcription factors. Therefore, the goal of the analysis is to discover the corresponding network of gene-TF relationships, which is known as a gene regulatory network. Each gene, however, is regulated by only a small subset of the TFs which makes this application a sparse setting for variable selection. The available data consist of gene expression measures for approximately 6000 genes in yeast across several hundred experiments, as well as expression measures for each of the approximately 40 transcription factors in those experiments (Jensen et al., 2007). This gene regulatory network was previously studied in Jensen et al. (2007) with a focus on modeling the relationship between genes and transcription factors. The

authors considered a Bayesian linear hierarchical model with first-order interactions. In high-dimensional data sets, specifying even first-order pairwise interactions can substantially increase the complexity of the model. Additionally, given the elaborate nature of biological processes, there may be interest in exploring nonlinear relation- ships as well as higher-order interaction terms. In such cases, it may not be possible for the researcher to specify these terms in a linear model a priori. Indeed, Jensen et al. (2007) acknowledge the potential utility of such additions, but highlight the practical difficulties associated with the size of the resulting parameter space. Thus, we propose a variable selection procedure that relies on BART.BART dynamically esti- mates a model from the data, thereby allowing the researcher to potentially identify genetic regulatory networks without the need to specify higher order interaction terms or nonlinearities ahead of time.

Additionally, we have data from chromatin immunoprecipitation (ChIP) binding experiments (Lee et al., 2002). Such experiments use antibodies to isolate specific DNA sequences which are bound by a TF. This information can be used to discover potential binding locations for particular transcription factors within the genome. The ChIP data can be considered “prior information” that one may wish to make use of when investigating gene regulatory networks. Given the Bayesian nature of

our approach, we propose a straightforward modification toBART which incorporates

such prior information into our variable selection procedure.

In Section 3.2, we review some common techniques for variable selection. We em- phasize the limitations of approaches relying on linear models and highlight variable selection via tree-based techniques. Section 3.3 focuses on modifying BART for vari- able selection. In Sections 3.3.1 and 3.3.2, we introduce howBART computes variable inclusion proportions and explore the properties of these proportions. In Section 3.3.3,

Section 3.3.4 we extend the BART procedure to incorporate prior information about predictor variable importance. In Section 3.4, we compare our methodology to al- ternative variable selection approaches in both linear and nonlinear simulated data settings. In Section 3.5, we apply our BART-based variable selection procedure to the discovery of gene regulatory networks in budding yeast. Section 3.6 concludes with a brief discussion. As noted in Chapter 2, our variable selection procedures as well as the ability to incorporate informed prior information are readily available features in

the bartMachine package. Code demonstrations are shown in Sections 2.3.9-2.3.10.