PROBABILITIES, RULES, AND HYPOTHESES - PHARMACEUTICAL INDUSTRY

PHARMACEUTICAL INDUSTRY

2.2 PROBABILITIES, RULES, AND HYPOTHESES

Classical statistics also starts early with structured data. The statistician ’ s col-lection sheets are highly structured, clearly identifying independent (e.g., patient age) and dependent variables (e.g., blood pressure as a function of age) and ordering them if appropriate.

2.2.1 Semantic Interpretation of Probabilities

Classical statistics being routed in probability theory is interested in the prob-ability of any datum A . How it considers and calculates the values of forms P ( ) is discussed later below, but for probability theory in general, it is a rea-sonably intuitive reﬂ ection of probability as used in colloquial speech, not least in regard to laying bets. There are also coincident, conjoint, or compound probabilities such as P ( A & B ), which in everyday conversation might be paralleled by speaking the chances that A and B are seen together, or the extent to which A and B are two qualities or quantities describing a common thing. There may be several symbols, A , B , C , and so on, as in P ( A & B & C );

their number (here 3) is in fact the complexity of that probability expression, in the sense of the word complexity used above. The data as a whole will in

general be of much higher complexity, so P ( A & B & C ) is just one facet of it. Probability functions of states like those above are basically associations and hence quantify existential or “ some ” statements probabilistically. There is, however, a case for using probability ratios P ( A & B )/[ P ( A ) P ( B )] express-ing departure from randomness, because observexpress-ing just one of few coinci-dences of A and B may not be meaningful as indicating “ Some A are B , ” perhaps representing experimental errors. In which case, for complexity 3 or more, there are complications. For example, note that P ( A & B & C )/[ P ( A ) P ( B ) PC ] for quantifying existential statements is not necessarily the same quantity as P ( A & B & C )/[ P ( A & B ) P ( C )] and that P ( A & B & C )/[ P ( A ) P ( B & C )] and P ( A & B & C )/[ P ( A & C ) P ( B )] can be different again. The correct perspective is arguably that there should be associated distinguishing existential statements expressing departure from what kind of prior expecta-tion [3] , analogous to issues in defi ning the free energy of the ABC molecular complex relative interacting molecules A , B , and C . There are also conditional probabilities such as P ( A | B ), equal to P ( A & B )/ P ( B ) when B is defi ned, countable, and exists, such that P ( A ) is greater than zero. These requirements are generally the case when P ( A & B) also satisfi es them, and then P ( A ) = ΣX P ( A & X ) for all possible X that may exist. An analogy exists with the above discussion on random association in that if P ( A | B ) indicates “ All B are A , ” i.e., a universal or “ All ” statement, just one single observation of A not being B , again perhaps an error, can break its validity and make “ Some A are B ” the appropriate semantic interpretation [3] . There are in fact ways of treating this problem for both Some and All statements by adding the caveat “ for all practical purposes ” to the statement, raising then the argument that perhaps we should take the square root of that probability, i.e., make the probability larger, because we hold a strong belief in a weaker statement [3] . Semantically, this constitutes a hedge on a statement, as in “ A is fairly large ” compared with “ A is large. ” Nonetheless, the general sense in human thought and conversation seems to be that one observation in a trillion that A is B justifi es less the Some statement than that one observation in a trillion of B not being A invalidates the All statement.

2.2.2 Probability Theory as Quantiﬁ cation of Logic

Boole published binary logic as “ the laws of thought, ” so one should be able to drill deeper with this more rigorous perspective. The probability theory is actually a quantifi cation of binary logic, say, with functions of states such as L ( A & B ), which can only take the value 0 (false) or 1 (true). P ( A ) would be a quantifi cation of statements like L ( A ) = 1, which can be interpreted as a statement that A exists. The probability theory thus handles uncertainty, i.e., intermediate values. L ( A & B ) = 1 is an existential statement that A and B exist and coexist, i.e., and P ( A & B ) quantifi es the extent to which L ( A &

B ) = 1. If meaning can be attached to L ( A | B ), it is the universal statement

that “ All B are A . ” Existential and universal statements form the core of a higher - order logic called the PC, which goes back to the ancient Greeks.

Interestingly, there is no widely agreed quantiﬁ cation of that, handling uncer-tainty in ontology combined with that for associations, in the same kind of sense that probability theory is a quantiﬁ cation of binary logic, though some obvious methodologies follow from the following statements. This lack of agreement is a handicap in the inference to be deduced from data - mined rules.

2.2.3 Comparison of Probability and Higher - Order Logic Perspective Clariﬁ es the Notions of Hypotheses

Statements like “ Refute the null hypothesis ” sound scientifi cally compelling but do not mean much more than “ Get rid of that notion that we don ’ t like. ” Clearer statements have a more elaborate higher - order logic structure. PC is one example of a higher - order logic because we can write nested things like, L [ L ( A | B ) = 1 & L ( B & C )] = 1] = 1 (All B are A , some B are C , so some A are C , an example of a syllogism ). For a statement like this, all the values of 1 refl ect that the syllogism is valid , not necessarily true, as in the sense of given that L ( A | B ) = 1 and L ( B & C )] = 1, then L [ L ( A | B ) = 1 & L ( B & C )] = 1] = 1 (but L ( A | B ) and L ( B & C ) may not actually equal 1). In that sense, it is useful to consider them, and certainly the inner terms, as hypotheses, hunches, or postulates or propositions, which do not actually have to be the case, reserv-ing L for actual empirical truth (maybe T instead of L for “ truth of, ” or R for “ reality ” would be better than L ). Then one writes H in place of L as, e.g., H ( A | B ). Despite the above comments on quantifi cation, it is certainly mean-ingful to build quantifi ed examples, e.g., as P [ H ( A | B ) = 1 | L ( A | B )] = 1 ], meaning the probability that the hypothesis that “ All B are A ” takes a truth value of 1 when it is empirically true, a semantic overkill which statisticians use (or ought to, see below) in the form contracted to P ( H₊ | D ). This is the probability of the positive hypothesis H ( A | B ) = 1 being true given data D , which means that L ( A | B ) = 1. Alternatively, there is P ( H_– | D ), the probabil-ity of the negative hypothesis H ( A | B ) = 0 being consistent with data D , which actually means here L ( A | B ) = 0. In practice, as analyzed below, the prefer-ence in classical statistics is to use the probability of the null hypothesis . By analogy to the above, this would be P ( H₀ | D ), which ought by that name and 0 subscript to be the type of hypothesis associated with H ( A | B ) = 0. One hopes the probability is low so that the hypothesis can be rejected. Actually, it relates something like H ( A′ | B ) = 1, where A′ is variously some most expected state or the most boring, or even the most costly, and such that H ( A′ | B ) = 1 implies something much closer to H ( A | B ) = 0 than H ( A | B ) = 1.

This unsatisfactory account of a state is discussed below. As it happens, things are even more tortuous because it is P ( D | H₀ ), which is used in classical sta-tistics, a matter also discussed below.

2.2.4 Pharmaceutical Implications

For simplicity, we start off with the positive hypothesis H₊ . This seems reason-able, and arguably it is the basis of the inference process that often goes on (and should go on) at least qualitatively behind the scenes in R & D before a more classically framed statistical report is produced. After all, in any scien-tiﬁ c paper about drug action or in a project of drug R & D, the hope that a new drug will work is H₊ . The probability of that being true prior to generat-ing or seegenerat-ing any hard data is the prior probability P ( H₊ ). With the data subsequently considered, the probability of the hypothesis typically changes for better or worse, to the posterior probability , which is the conditional prob-ability P ( H₊ | D ) = P ( H₊ & D )/ P ( D ). The vertical bar again means “ condi-tional on. ”

As indicated above, the writing of P ( H₊ ) and P ( H₊ | D ) is really shorthand because inference involving hypotheses has a more complex higher - order structure. For easy comprehension, this will be framed in terms of more spe-ciﬁ c examples and need not drill down quite as far, at least in terms of symbolic overkill, as the previous section. It still requires a signiﬁ cant elaboration. In the pharmaceutical industry, even the so - called prior probability P ( H₊ ) typi-cally really corresponds to a conditional probability such as P (drug works | drug X & disease Y ), a probability which as written is thus really of complexity 3.

An even greater complexity may emerge as important for proper analysis.

Notably, while the above notation should convey adequate sense, something such as Pr[ P (disease Y at time T + t = false | drug X given at time T & disease Z at time T = true) > 0.9 | D ] is implied. By analogy with the discussion above on higher - order logic, this new P implies a higher - order inference process, and here a higher - order probability theory. In practice, Pr signiﬁ es a probability distribution ( probability density ) or one value on such, and X , Y , T , and t are all variables underlying that density, as follows.

2.2.5 Probability Distributions

In reality, no one but a mathematician interested in statistical methodology wants a probability distribution . It implies uncertainty (perhaps even uncer-tainty about degrees of unceruncer-tainty about speciﬁ c things) and ultimately that we can at best expect rather than rely on something (and perhaps not expect with any great reliability). From a Bayesian perspective (see below), it can represent a spread of different degrees of belief in the observer ’ s brain as opposed to a ﬁ rm, judiciously held opinion (which would appear as a single spike — a so - called delta function — if the distribution perspective is still taken).

This all reﬂ ects ignorance. A distribution arises instead of discrete points because we are, for example, pooling studies on many clinical trial patients.

Here is impossible to completely know and control the observation system and process: there are experimental errors. Even if we could, we cannot know and, with some 10 ¹⁰ – 10 ¹³ bits of information capacity to potentially worry

about, may perhaps never be able to know all the features and mechanism of each patient and environment.

Challenges can arise even in a single dimension, say, along a single param-eter like height in a population. For example, even when there is an inkling of the genes involved, the genes can of course interact to express phenotype in a complex way with each other and with the environment. It is thus diffi cult to set up the many detailed conditional probabilities; conditional, that is, on each relevant factor, which if fi nely conditional enough would be a sharp spike (a delta function). Without complete resolution onto different conditions, one might at least hope that separate peaks would be seen to inspire the hunt for appropriate conditions. Unfortunately, that does not always happen. There are, for example, not just two variations of a single gene that would make patients fi ve foot tall and six foot tall, and no other height, but many. All unknown factors may effectively appear as a random infl uence when taken together, and a typical distribution is thus the normal (Gaussian, bell curve) distribution, the basis of the z - and t - tests.

Worse still, there is no reason a priori to expect that the ideal distribution based on many factors can be adequately expressed on a one - dimensional axis.

In two dimensions, the probability peaks will look like hills on a cartographer ’ s contour map, yet hills seen in perspective from the roadway on the horizon as a one - dimensional axis can blur into an almost continuous if ragged moun-tain range profi le. And if we use a two - dimensional map to plot the positions of currants in a three - dimensional bun (or berries in a blueberry muffi n), the picture will be confusing: some currants would overlap to look like a bigger, more diffuse, and perhaps irregularly shaped currant, and even those that remain look distinct and may look closer than they really are. With big enough currants, they can look like a normal distribution describing one big fuzzy currant. Many parameters as dimensions arise in targeted medicine where we wish to consider the application to specifi c cohort populations, and rather similarly personalized medicine where attention directs toward drug selection for a specifi c patient in the clinic. Keeping the fi rst notation (the one without T and t ) for relative simplicity, a typical probability of interest, at least in the researcher ’ s head, elaborates to

P H( ₊ D) = (P X Y

drug works drug & disease

clinical record featture clinical record feature lifestyle feature proteomic feature and proteomic feature

Supposing that there is enough data for all this, which is as yet uncommon, the distribution seen in the full number of dimensions may have a potentially complicated shape (including, in many dimensions, complicated topology).

The question is, when we have knowledge only of fewer dimensions (relevant parameters), when is the shape described real?

In document Pharmaceutical Data Mining (Page 63-68)