PHARMACEUTICAL INDUSTRY
2.6.6 Good Data Mining Is Not Just Testing Many Randomly Generated Hypotheses
Even if the core parts of a data mining program look like iteration over many arbitrary hypotheses, the code overall, and its effect, is much larger than the sum of those parts. If it were not, we would simply run out of project time, patience, or computer power long before all the space was covered for discov-ery (i.e., generating all the possible rules), and there would be an arbitrary focus on what was examined fi rst. Thus, data mining really implies many algorithms to try and enhance genuine discovery.
At its rawest, data mining has no sense of what is interesting, or even new, to the researcher. It has no sense of physics, chemistry, or biology. It reports or should report the surprising absence of pregnant males with equal enthusiasm to relationships implying a potential cure for cancer. Of course, well founded prior judgment is not excluded, and it is useful to have “ interesting ” and importantly “ not interesting ” commands, ideally with a probabilistic element rather than being commands set in stone, so that related matters do not avoid discovery. However, these imply the presence and application of
D * . They draw data mining back toward classical hypothesis testing. This is sometimes a good thing, but it runs the risk of the dangers as well as the ben-efi ts posed by D * .
There is much that can be done with more general heuristic algorithms that are relatively “ D * free ” and yet restrict the early calculations and search to where it counts [16 – 18] . Most abundant of these is based on the amount of data, related to “ the level of support for a rule. ” Where there is inadequate data, why bother to calculate? But inadequate data does not mean, however, that n ( A & B & C & … ) counts as a small number of observations. Consider the situation that thousands of female patients taking, say, a cholesterol lowering drug X for a month never get pregnant. Here n (female & pregnant & drug X ) equals zero, yet the effect is very signifi cant indeed.
The information theoretic approach makes the situation clearer. In the huge number of potential rules above, most are not likely to be rules in the everyday colloquial sense. Some will contain little information: their probabil-ity is close to what we might expect on a chance basis. On the face of it, we would have to look at every possible rule to calculate that. However, rules could be avoided from further consideration where there is enough data to obtain reliable rules. The value can be positive or negative, and that it is close to zero implies no information . Thus, the algorithm would typically be to halt calculation where early it is detected that information greater than + x or less than – x cannot be obtained. The rule for this is again not that n ( A & B & C & … ) are below a critical number, but that the terms of lower complexity are below a critical number. Moreover, we may start with the terms of least com-plexity n ( A ) working up to n ( A & B ), n ( A & B & C ) … , which are inevitably always smaller values. When that complexity falls below a critical value for any subset of the parameters in the full set A & B & C & … of interest, we may halt.
Note that this impact of data has nothing to do with estimates of probabili-ties P ( A ), P ( A & B ), and so on, per se , since this conveys nothing about the levels of data involved. We might get the same probability (depending on the estimate measure) by taking a subset of just one thousandth of the overall data.
A direct measure of information including the level of data is philosophi-cally sound and feasible. It measures the information in a system that is avail-able to the observer. On such grounds, the real form of interest having the above properties dependent on data levels arises naturally. It is an expectation calculated by integration over Bayesian probabilities given the data [18,12,20] . This was used in the GOR method in bioinformatics [20] , which was based on several preceding studies including Robson ’ s expected theory of information [21] . In the latter study, the integration of information functions log e P is made over the probability distribution Pr( P | ) dP, where Pr[ P | D ] is given by Bayes ’ equation as Pr[ D | P ]Pr( P )/ P ( D ). Consider also that what we imply by Pr( P ) is really the estimate or expectation E ( P | D ) of an underlying P “ out there in nature, ” conditional on data D , say, D = [ n ( A ), n ( B ), n ( A & B )]. It means
that the estimate arises only by considering cases when here D = [ n ( A ), n ( B ), n ( A & B )].
The integration over information measures is similar. The most general way to write it is to simply state I ( P ) as some function of P , viz, Equation 2.18 is in some respects the most complete because it includes not only A but the contrary information in ∼ A . In fact, whatever terms A , ∼ A , and Since the log terms are separable, we may focus on
The “ plug - in ” point in the above for actually introducing the counted numbers is the likelihood , whence one must be specifi c about the arguments of P . It is a binomial, or in general multinomial, function of the number of observations of something (or joint occurrences of something, say, n . The integration then yields
E I D( ) =ζ(s=1,n) +C. (2.37) Here z is the incomplete Riemann zeta function discussed above, actually more general than the complete one, which implies n = ∞ .
When n becomes indefi nitely large,
ζ(s=1,n) − →γ loge( )n ,n→ ∞ (2.38) E I D( ) =ζ(s=1,n) −γ,n→ ∞. (2.39)
Constant γ = 0.5772156649 … is the Euler – Mascheroni constant. In fact, there seems no interesting case in data mining yet noted where one zeta function is not subtracted from another, so the constant C always cancels, as when we wrote for Equation 2.27
E I A A B C D
s o A B C s e A B C
:~ ; ; ;
, & & & , & & &
…
( )
[ ]
=ζ( =1 [ …]) −ζ(( =1 [ …])
−ζ(s=1, o[~ &A B&C&…]) +ζ(s=1, e[~ &A B&C&…]).
2.7 INFERENCE FROM RULES