PHARMACEUTICAL INDUSTRY
2.1.5 The Available Information in Medical Data
Capturing massive amounts of data from patients can help us get to the new chemical entity that represents a trial candidate, as well as understanding its action during trials . Such scientifi c data that are currently being generated and are potentially relevant to a patient in clinical trials and the patient in the physician ’ s offi ce are often called translational science or translational research . Medical imaging alone is or soon will be producing many petabytes (1 byte = 8 × 10 15 bits) worldwide, with new imaging modalities pumping out as much as 13 GB/s per device (though it is signifi cantly reduced by consider-ing resolution required on a local and on - demand basis). Other sources of biomedical information, from genomics and proteomics, and including human expertise and information in the form of medical text, add signifi
-cantly to the still considerable load. All this can, in principle, be stored and transmitted (and the latter may be much more problematic because of the bandwidth issues). Note that artifi cial storage is not so effi cient: DNA could store at approximately 1 bit/nm 3 , while existing routine storage media require 1012 nm 3 to store 1 bit. However, the universe has allocated the human race a lot more space to play with for artifi cial storage than has been allowed to the tiny living cell.
However, the universe has allocated the human race a lot more space to play with for artifi cial storage (say, soon some 10 17 bits) than has been allowed to the tiny living cell (10 10– 10 13 bits). The trick is in using this artifi cial data. It is still 10 17 bits that have to be sifted for relevance. It is not information “ in the hand ” but rather more like the virtual reservoir that evolution has tapped in its trial - and - error process. Looking at the above numbers and information rates suggests that normal processing, but using trial and error to sift the data, would demand some 8 billion years. Clearly, a strategy is required, and one that leaves little room for fundamental error, which will collapse some of it to a trial - and - error basis, or worse by pointing us in wrong directions.
An emerging dilemma for the physician refl ects that for the drug researcher.
In fact, we are fast approaching an age when the physician will work hand in hand with the pharmaceutical companies, every patient a source of informa-tion in a global cohort, that informainforma-tion being traded in turn for patient and physician access to the growing stockpile of collective wisdom of best diagno-ses and therapies. But in an uncertain world that often seems to make the role of the physician as much as an art as a science, physicists are not surprised that medical inference has always been inherently probabilistic. The patient is a very complex open system of some 10 28 atoms interacting with a potentially accessible environment of perhaps 10 35 – 10 43 atoms. Just within each human, then, there are thus roughly 10 15 atoms mostly behaving unpredictably for every bit of information that we considered relevant above. There are many hidden variables, most of which will be inaccessible to the physician for the long foreseeable future. Balanced against this, the homeostatic nature of living organisms has meant that they show fairly predictable patterns in time and space. Thus so far, there have been relatively rigid guidelines of best practice and contraindications based on the notion of the more - or - less average patient, even if their practical application on a case - by - case basis still taxes daily the physician ’ s art.
Ironically, however, while the rise of genomics and proteomics substantially increases the number of medical clues or biomarkers relevant to each patient, and so provides massive amounts of personal medical data at the molecular level, it brings the physician and researcher closer to the atomic world of uncer-tainty and underlying complexity. It demands a probabilistic approach to medical inference beyond the medical textbooks, notably since the develop-ment of disease and the prognosis of the patient based on the molecular data are often inherently uncertain. Importantly, the high dimensionality of the data includes many relevant features, but also variations and abnormal
features that may be harmless or otherwise irrelevant. They are poised to over-whelm the physician, increase the number of tests, and escalate clinical costs, thus imminently threatening rather than aiding the healthcare system [14] . 2.1.6 The Information Flow
The useful information that is available in biomedicine is best understood as a fl ow, which it is pretty much the same in any discipline. We shall defi ne useful information as that which leads to an actionable decision with a required benefi cial outcome.
Data → structured data → rules → inference → decisions → benefi cial outcome In an overview of what follows, a brief comment may be made on this sequence.
Data (or “ raw data ” ) should be necessarily qualifi ed as accessible data and should ideally be in a s tructured data form suitable for analysis. In business and industry generally, roughly some 95% it is not. In medicine, medical text and medical images well exemplify unstructured forms. Explicitly or perhaps implicitly in an analysis procedure, conversion to at least a transient structured form is required.
This structured form is then transformed into a set of elemental statements about associations and correlations, above indicated as rules , which express the content in a succinct way suitable for inference . However, classically, the rules step has been represented by statistical analysis, with inference and deci-sion making left to the human expert based on the results. There are numerous tools that have of course been developed to analyze data, and these obviously remain of interest. The probability theory [23] underlines classical statistics [24 – 26] . Of particular interest here, because of the high dimensionality of clinical with genomic and proteomic data, is multivariate analysis [27 – 32] . Dimensional reduction techniques such as multidimensional scaling [33] and principal coordinate analysis are essentially clustering (and by implication dendrogram or “ tree ” ) methods that reveal useful patterns in data in fewer dimensions while preserving the rank order of distances or the distances them-selves, respectively. There are several pharmaceutical and biotechno-logical applications. For example, multidimensional scaling in conjunction with structure – activity data seems very useful for identifi cation of active drug conformers [34 – 37] .
Less classically, direct use of information (as opposed to probability) - based methods seems well suited to the automation of the above sequence, which is, after all, an information fl ow . Information theory has already long been recog-nized as of value in inference from rules, and the decision process based on that inference [38,39] , whence it is closely related to decision theory [40,41] . Application of information theory in commercial methods of data mining for the rules, i.e., empirical rule generation , as the fi rst step has been less common, though it is the approach taken by one of the authors (B. Robson) [18,42,43]
and applied to 667,000 patient records [44] . Because the method is somewhat less orthodox, it is worth stating that it has its roots in the theory of expected information [44] and in the subsequent widely used application as the Garnier Osguthorpe - Robson (GOR) method [45] for data mining protein sequences.
Widely cited and used since its publication 1978, the latter had some 109,000 Google hits on Robson GOR protein in September 2007. The “ rules ” here were basically rules in the same sense as in subsequent data mining efforts, though then known as the “ GOR parameters, ” and concerned the relationships between amino acid residues and their conformation in proteins. The diffi culty was that the GOR method and its rules took advantage of and was “ hard wired ” to the chemistry and biology of protein structure. In effect, the more recent papers [18,42,43] developed a more general data mining approach where there is no imposition on what the rules are about, except for a choice of plug - in cartridges, which customized to particular domains such as clinical data.
A simple example of such a rule may be that if a patient is tall, he will be heavy. This illustrates that rules are not in general 100% guaranteed to be correct. Rather, rules will, in general, be associated with a quantity ( weight ) expressing uncertainty in an uncertain world, even if some or many of them, such as “ if the patient is pregnant, the patient is female, ” emerge as having a particular degree of certainty of 100% and may constitute ontology in the sense of “ All A are B. ” In the abovementioned rule generation methods [18,42 – 44] , the probabilistic weight was actually an estimate of the information available to the observer, refl ecting both the strength of the relationships and the amount of data available for estimating them (a natural and formal com-bination; see below). Weights will be discussed in several contexts in what follows.
As in a large study of patient data [44] , the rules themselves represented the end of the road as far as basic research is concerned, with the important qualifi cation that they were automatically fed to medical databases such as PubMed to ascertain how many hits were associated with the rule. Some (3 – 4%) had few or no hits and represented potential new discoveries to be further investigated. The signifi cance of subsequent inference is that it allows for the fact that rules are not independent; indeed many weak positive and negative rules with topics in common like patient weight may add up to a strong weight of evidence regarding that topic. Rules interact to generate further rules within an inference process without further information except for certain established laws of inference used in logical and probabilistic argument. It may be noted in passing that this is more easily said than done because some of the laws of higher - order logic required for much inference, such as syllogistic reasoning, are not well agreed upon in the matter of handling uncertainty.
When focus is on a specifi c decision or a set of decisions as opposed to general discovery, there is funneling or selection, focusing on the domain of relevant rules. A decision is in that sense an inference step out of many possible infer-ence steps. To choose the appropriate decision, one must consider what exactly benefi cial means. In medicine, conveniently, we can characterize this in terms
of outcomes and specifi cally a sense of enhancement in the well - being of a patient in particular and of the population in general. Of course, well - being is a somewhat fuzzy and not invariant concept, but then so is the sense of lack of well - being in the fi rst place; fuzziness and, conversely, distinguishability are some of the recurrent issues that are important to deal with at several points.
Though it seems odd at fi rst, the fi les containing extracted rules can in principle be much larger (though in practice this is currently rarely so) than the fi les including the raw data analyzed. That does not mean that information is created, but that there is an overhead price to pay in putting data in a more knowledge - related form, which is appropriate for inference. The important notion is that these rules may be used to some fundamental underlying prin-ciples comparable to laws of nature. The explosion potentially occurs because relations between things are rendered specifi c in terms of, behind the scenes at least, combinatorial mathematics. This can be glimpsed by stating that in studying the relationships in a mere four items, A, B, C, and D, the relation-ships to be explored are (A, B) (B, C), (C, D), (A, B, C), (A, B, D), (B, C, D), (A, C, D), and (A, B, C, D). The consequent “ combinatorial explosion ” as the number of items is increased is considerable. It is at least 10 30 for 100 items, still an incredibly small number of items for, for example, a patient record including genomic and proteomic data and image data. This makes the discovery of relevant rules diffi cult and computationally expensive and repre-sents the “ dragon ” protecting the discovery of the gold of knowledge therein.
There may also be more rules generated in the inference process, in the sense of logical or essentially probabilistic interim or fi nal deductions from the data - mined rules. For example, in the PC, the syllogisms generate a further rule, which can follow from two given rules. One may say that the increase of information available to the researcher is inevitable because it is necessarily so that these interim or fi nal rules are unexpected or at least are hidden from consideration, else why acquire an inference engine software that performs the inference process? That accepted, then the data mining process, as a combi-natorial expansion of the description of the relationship between things in the raw data, can be considered a part of inference, which is another reason why data mining and inference cannot be divorced. The feature that dictates the severity of combinatorial expansion is not the explicit information content of the whole fi le in terms of bits, but rather, the width of the data, refl ecting the number of parameters to consider, not the depth of data, refl ecting the sample size. The terms width and depth comes particularly from the concept of archives of analyzable records discussed below, the width of the record rep-resenting the number of parameters and the depth reprep-resenting the number of records. Width makes analysis more challenging; depth makes it statistically more reliable, and there is a relationship in that increasing width demands increasing the depth to obtain statistical signifi cance. The information is a logarithmic function of aspects arising from data and hence rises only as the logarithm of the depth. The information in terms of the actual rule content rises proportionally, however, to the width, this representing an explosive
increase. The width as number of parameters represents the true complexity of any analogous problem in both the colloquial and mathematical sense, as follows.