PHARMACEUTICAL INDUSTRY
2.3 PATTERN AND NECESSITY
2.3.1 Mythological Constellations Can Appear in Projection
In many displays of data, the data may be very distinct, typically well - sepa-rated points. That may of course be because there is simply not enough sam-pling. However, there may be some very crisp measurements or collections of measurements that we know are functions of several parameters, but not all the relevant parameters are known. Only the perception of three dimensions of space, plus one more of time (and many more properties of the light from the stars), allows scientists to build a true picture of the universe and to deduce the underlying physical relationships of the stars in the night sky. The two dimensional view of the night sky yields the constellations that are mostly artifacts of perspective, refl ecting the worldview of the observers. Many of the northern hemisphere were seen by the ancient civilizations, and are mytho-logical in the traditional meaning of the word: fl ying horses and supernatural beings. However, even the stars that are contained within a constellation can vary with culture: the Chinese constellations are different from the European.
The Southern Constellations were mostly named in modern times by European seamen and scientists and include the ship ’ s keel, the compass, the clock, the pump, and the microscope.
In the quest for real patterns, valid techniques exist for reducing such multi-dimensional data into fewer dimensions. Principal coordinate analysis seeks to do so while preserving with minimal stress some metric of distance between the points, while multidimnsional scaling preserves the rank order of the such metrics [4] . Both can produce meaning patterns, for example, clusters and clusters of clusters, and so on. With such a result, and with all points envisaged as the intersection of branches with a horizontal cross section through a tree, that tree may be deduced and may reveal genuine ontological relationships.
However, this cross section may not produce an evenly spread or random distribution of points, such that many objects such as a circle (with many points clustered round the circumference) or part circle (crescent moon shape) may also appear. Occasionally, more angular shapes like triangles may emerge.
While the dimensions into which reduction occurs may be arbitrary, they may not happen to correspond to the real parameters, or they may simply represent axes at non - right angles to the dimensions representing the real parameters.
Correlating the shapes with the parameters describing the original points can yield genuine, if sometimes surprising relationships with physical meaning.
The principle has been applied to drug design based on analysis of predicted conformers and in regard to protein structure analysis [5 – 7] .
However, while such dimensional reduction approaches are valid, the golden rule should perhaps be that persuasive patterns based on a viewpoint
that implies a projection like stars in the sky should always be suspect. The Ramsey theory [8,9] studies the conditions under which order must appear. In particular, how many elements of some structure must there be to guarantee that a particular property will hold? Consider the night sky as a graph of n vertices (the stars) and each vertex is connected to every other vertex by an edge (a line in the pictorial rendering of a constellation as an image). Now color every edge randomly green or red. Imagine that the ancient Chinese happened to pick constellations corresponding to the green lines, and observers of the ancient Middle East picked those correpsonding to the red.
How large must n be in order to ensure that there we see either a green triangle or a red triangle? It turns out that the answer is just 6, and the dif-ferent triangles emerged with equal probability. A common party puzzle of the same structure is this: what is the minimum number of people at a party such that there are either three people who are all mutual acquaintances (each one knows the other two) or mutual strangers (each one does not know either of the other two). The answer is again six.
2.3.2 The Hunger for Higher Complexity
To avoid the above and other problems, there is of course, or should be, a hunger. Throughout, an important bottom line is that expansion of human knowledge is refl ected by our ability to increase the complexity of probability terms or other measures of uncertainty that we are able to quantify . Each term, A , B , C , … that makes up the complexity of a probability, P ( A & B & C , … ), represents a dimension. A rule of high complexity can be very strong, yet rules of lower dimensionality like P ( A & B ) may not be deducible from it and vice versa. That there were no pregnant males is not deducible from the abundance of male patients and the abundance of female patients (see below). This is not always the case. In a study of some kind of metric distances between using multidimensional scaling, principal coordinate analysis, and clustering and other techniques, things can produce meaningful patterns and relationships.
But that depends on the nature of the system under study and is not in general true.
Obviously, a critical factor in that is the amount of data available. The sparseness of data points increases as the number of dimensions, that is, the number of parameters represented, and hence with the complexity of any rule associated with a probability P ( A & B & C & … ). This means of course that we have much less data to deduce any n - dimensional probability distribution from P ( A & B & C & D & E ) than from P ( A & B & C ). The number of pos-sible potentially interesting combinations P ( A & B ), P ( A & D & H & Q ), and so on, rises as at least approximately 2 N for N parameters A , B , C , … . Data thus run out fast. Many thousands of complexity 2 and complexity 3 rules, mostly known but many new, came from an analysis of 667,000 patient records.
Yet many rules of complexity just 4 and especially 5 might be represented by a single observation, if any. That said, a few strong rules of much higher
com-plexity can show up. Nonetheless, there is always a level of analysis plowing into higher dimensions, which, in principle, can contain data and the tendency to overreach the interpretation of the sparse data encountered.
2.3.3 Does Sparseness of Data Breed Abundance of Pattern?
At fi rst inspection, the answer is no (but see Discussion and Conclusions).
When data is sparse, it at least tends to look more random, in the sense that a true pattern distinct from randomness will only emerge as data build up. We tend to look forward to, for example, the beautiful and smooth normal curve that will one day emerge from our ragged bar chart that currently looks more like the Manhattan sky line. The reliability of our statistical summaries assum-ing the normal curve is the right choice, and the convergence of our bar chart to it, rises as N the amount of data, a consideration taken into account more robust t - test making it more robust than the z - test in utilizing the normal curve model . In gathering data to plot a normal distribution, there may be many modes that appear, meaning that several values will be the same or similar.
But our dreams of convergence to that curve refl ect our expectation that the normal curve is the correct underlying model. For distributions in general, such modes in the raw data may, but may well not, survive to be ultimately perceived as the true modes of a multimodal (i.e., non - normal) probability distribution.
So our occasional initial assumption that we might adequately pool data into a single dimension may be too optimistic. In any event, whether or not a multidimensional description is reached or there from the outset, increasing the number of dimensions increases the opportunities for greater separation between points. In many dimensions, rogue outlying points due to experimen-tal error and representing a rare probability of belonging to a cluster (while physically, chemically, or biologically entitled to belong to it nonetheless) tend to lie at greater Euclidean distances when that distance is in more dimensions.
This can be distracting to visual analysis, attracting too much attention to it.
Now it may be countered that the Ramsey theory does lead to increased chances that we might read too much into them as these sparse data are encountered. The Ramsey theory does indeed mean that we will tend to fi nd irrelevant patterns in any data, and this presents a particular lure to the unwary when there is not enough data to be convergent to true distributions.
But in another sense, the Ramsey theory runs in the opposite direction. It predicts that more elaborate patterns will emerge as the number of data points increases , and that the number of them rises explosively.
2.3.4 Sparse Data Can in Context Be Strong Data When Associated with Contrary Evidence
High dimensionality is not the only cause of sparse data in certain specifi c circumstances, and there can be a strong pattern of sorts by absence of
obser-vations. This applies to negative associations . Obviously, noting an unexpected large black hole in a starry sky will be signifi cant — hopefully indicating just a cloud! The case where there is just one dimension, a marked local gap or gaps in data may be equally signifi cant. However, with two or more dimensions, it is also true that the hole represents less data than we would expect on the basis of the projections onto to the axes. In other words, data may be sparser in the volume or hypervolume in many dimensions than it ought to be, based on the data for fewer dimensions. A negative association expressed most simply means, for example, that P ( A & B & C ) is much lower than we would expect on chance bases, as calculated by P ( A ) × P ( B ) × P ( C ) and P ( A ) ×P ( B & C ) and P ( A & B ) × P ( C ) and P ( A & C ) × P ( B ). The fi rst of these is the projection on three one - dimensional axes, the others on one axis and the implied plane formed by the two remaining axes. As with the “ black hole, ” strong negative associations ( “ pregnant males ” ) in the limit mean that the events linked by & in the probability measure never show up at all. That does not mean that there is inadequate data to support the implied negative associa-tion rule. The weight of such a rule is strengthened by the fact that P ( A & B & C ) seems to be zero or close to it as well as by a large value of P ( A &
B & C ) recalculated on the above bases of random association, say, as P ( A ) × P ( B ) × P ( C ). In the above, notice that there is strong data, a kind of prior data, of lower dimensionality, that sets an expectation of something.
That it does not occur is evidence to the contrary .