3.1 Motivating Problem
3.1.1 Real World Data Example
The methods we explore are designed to identify association between sets of response and explanatory variables when the data under consideration comes from multiple data sources or nontrivial amount of inconsistency has been introduced into the data. Specifically in cases where integration of the data is so complicated and contradictory that the only reliable in- formation deduced is binary, in the sense that either a response occurred or it did not occur. Furthermore, determining relationships between this inconsistent data is so difficult that using traditional methods of analysis fail to produce substantial results. A real world example of data that meets the above criteria is the Environmental Protection Agency’s (EPA) ToxCast and Toxicity Reference Database (ToxRefDB) data programs. These datasets share a set of common potentially toxic chemicals, where ToxRefDB contains a multitude of animal study endpoints based upon exposure to these chemicals and ToxCast contains bioassay responses to those same potentially toxic chemicals (Dix et al., 2007; EPA, 2010a; Judson et al., 2009; Martin et al., 2009a; Knudsen et al., 2009; Martin et al., 2009b; EPA, 2010b). The EPA would like to integrate these data together as to determine which chemicals will cause malady as seen with the animal endpoint studies based primarily upon ToxCast bioassay responses (Dix et al., 2007; EPA, 2010a; Judson et al., 2009). Their reasoning behind this is described below along with a more detailed explanation of the data and how it qualifies as a real world example for our methods.
The efficient testing of chemicals for possible human health effects is a continually grow- ing challenge, with over 83,000 chemicals currently within the Toxic Substances Control Act inventory and over 30,000 in widespread use(Agency, 2011; Judson et al., 2009). Compli- cating the demand for increased screening of chemicals having industrial and agricultural
importance is the fact that the current processes for testing chemicals is extremely complex, expensive and time intensive, with heavy reliance on animal studies that take 2-3 years and millions of dollars to complete(Judson et al., 2009). With the majority of such chemicals hav- ing undergone little to no safety testing, there is a significant need to develop complimentary or alternative approaches to help prioritize both the chemicals to be tested as well as identify the types of tests that will be most informative in the regulatory decision making process.
To help understand and address these challenges, the EPA established the ToxCast pro- gram, a high-throughput screening (HTS) effort focused on the development of methods for accurate and cost-effective chemical screening and prioritization (Dix et al., 2007; EPA, 2010a). The initial phase of the ToxCast program consisted of the testing of 309 unique chem- icals against a panel of over 650 toxicity-relevant assays. While chemicals chosen for this first effort are comprised largely of food-use pesticide active components, assays vary greatly in the type of technology used, the target measured, as well as the biological context in which the assay is performed. While still in the early stages, programs such as ToxCast and Tox21 are expected to provide the methodological foundations for future sustainable efforts in chemical screening(Dix et al., 2007; EPA, 2010a; Kavlock et al., 2009).
Although providing a wealth of data across a broad spectrum of chemicals, high-throughput approaches as used in ToxCast present their own challenges with regard to data integration and downstream interpretation. There is a great deal of variation in the types of assays used for screening, with associated variation in the levels of quality, sensitivity and specificity. Fur- thermore, tests are performed in cells or tissues of a number of different species including rat, mouse and human. As our understanding of mechanisms of toxicity for different chem- icals is far from complete, methods that can use such data to help establish more integrated pictures of the linkages between chemicals, biomolecular players and disease endpoints are of significant value. Specifically ToxCast data provided the ideal dataset to demonstrate the utility of our methods for integrating inconsistent/noisy datasets in a response and explanatory variable framework where the data is subset to consider only the strongest relationships as an
alternative to considering the entire data record.
Preliminary investigations by both the Reif et al. (2010) using the ToxPi measure and DiMaggio et al. (2010) with their biclustering and logistic regression framework indicate that the data collected from the first round of the ToxCast program is quite sparse, with highly vari- able amounts of inconsistency within the bioassays data. This is demonstrated by the modest subset of the data that is used in both methods and the small number of important results that are reported as discussed in greater detail below. The measure ToxPi presented by Reif et al. does integrate several sources of data into a measure that ranks a chemical’s toxicity with a score (Reif et al., 2010). The difficulty of ToxPi is that it does not indicate which results are statistically significant with regards to chemical toxicity. Moreover, the paper provides scant evidence that the top ranking chemicals are toxic and the bottom ranking chemicals are benign with regards to toxicity(Reif et al., 2010). The methodology of DiMaggio et al. indicate the minimal set of bioassays that maximize the separation between 8 liver and 10 reproductive an- imal endpoints(DiMaggio et al., 2010). Their framework helps determine which bioassay can be uniquely associated with either liver or reproductive animal endpoints. Yet, their results fail to provide any goodness-of-fit measures for the logistic regression models that were used to determine the association between animal endpoint and bioassay. Furthermore, they are only able to determine unique association between animal endpoint and bioassay for 18 of the over 300 active animal endpoints(DiMaggio et al., 2010). Similarly the ToxPi measure only uses 90 of the over 650 bioassay to create its integrated measure of chemical toxicity(Reif et al., 2010).
The primary goal in this work is to use our methods to account for the underlying inconsis- tency/noise within the ToxCast data by focusing the analyses on subsets of data. We assume that the desired associations are the most prominent for subsets of the data due to this un- derlying inconsistency. We integrate the ToxCast and ToxRefDB data to identify association between the explanatory variables (ToxCast bioassays) and response variables (ToxRefDB an- imal endpoints) with regards to identifying chemical toxicity. Our methods provide statistical
measures that furnish strength of association for the subsets of data and statistical signifi- cance that accounts for multiple hypothesis testing. Additionally our methods allow for some approximation/fuzziness to be incorporated into the results in an intuitive manner.