• No results found

Data and Metadata

In document Pharmaceutical Data Mining (Page 59-63)

PHARMACEUTICAL INDUSTRY

2.1.9 Data and Metadata

Any datum (entry) is susceptible to data analysis because of its relationship to other data. Whether qualitative or quantitative, as far as the product of data mining is concerned, it is always a countable entity by defi nition even if in practice we postulate it and it is never seen at all. Nothing prevents nonethe-less inserting rules into data - mined output that are based on human expertise and have appropriate weights representing human confi dence or degree of belief, and as described below, this can be a parameter combined with the count to obtain weights for data - mined rules too.

Even when it is countable does not mean that it is in a useful form.

Even the basic datum can have a composite form enhancing its utility by enlarging on its meaning. In the form age := 63, then it is age := 63, which is the item, age is called metadata, and 63 is the data value (parameter value). Metadata is indicated by the symbol :=, which we can consider as an operator meaning “ is metadata of. ” Optionally, there may or may not be metadata. Hence, other examples as actual plausible data items are male , Asian , height := 6 ft, systolic BP := 125, weight := > 200 lb, Rx := chloramphenicol ,

outcome := infection_eradicated . Though less commonly used in practice in the same context, there may be higher - order metadata, as in animal :=

vertebrate := primate := human := patient_#65488, which reveals the relation to ontology or taxonomy, i.e., classifi cation of things. Incidentally, one can of course with the use of brackets write a taxonomic tree, but for the present purpose, the descriptor relates to just one path from a selected point, as from the trunk of the tree to one particular selected leaf node. Though above these descriptors were described as atomic , it is clear that operations could be applied to them and this could, for example, take place in inference.

However, the above fundamental form represents the state in which they come to data mining, and they are typically immutable for the duration of that process.

The input for structured data mining can come in variety of formats, but very often as comma - separated value (CSV) fi les, which are interchangeable with Excel and Lotus spreadsheets, as well as relational databases such as Oracle and DB2. The records relate to patients, chemical compounds, and so on, and the fi rst or zero row, i.e., the column heads, is typically the metadata.

These and more complex inputs such as graphs are really better classifi ed in a more fundamental way, however, since many of them are essentially the same thing. In contrast, a graph structure for data relationships can be placed on a spreadsheet where each row of a spreadsheet represents, say, a node followed by its input and output arcs to other nodes, but the implied data structure is fundamentally different.

It is clear that the above way of treating a composite datum provides a universal description into which more structured data can potentially be ren-dered, even if the result of that rendering is as banal as Column_6 := smoker, Pixel_1073 := 1, or Base_Pair_10073 := G. This theme can now be expanded upon. To begin, note that, typically, structured records of maximum interest in current data mining may be classifi ed into

1. Graphs , in which data appear as nodes on a graph and are structured in their relationships by the arcs connecting them. They are harder to handle for data mining input since some self - consistent fragmentation of the network into maximally useful and logically sensible input chunks is required. Indeed it is best to think of this kind of data as a step in unstructured data analysis of which further kinds of further analysis transform the data into the following forms, 2 – 6. However, probabilistic semantic nets (concept maps, etc.) in which nodes are nouns or noun phrases connected by arcs representing verbs, prepositions, and so on, may themselves be the ultimate inference structure of the future.

2. Trees . Easier to handle for data mining input are trees, in which all items are nodes on branches going back to a common root. They lend themselves to the extended higher - order metadata description such as A := B := C := D , and to ontological systems for holding data, notably XML.

3. Lists , such as biological sequences, spreadsheet rows, and relational data entries, in which a specifi c order has meaning for descriptors. An image might be included here, as an array, i.e., in general a multidimensional list, of pixels. A vector or matrix of discrete elements is thus a generaliza-tion of a list, and so in principle is a continuous distribugeneraliza-tion, i.e., of indiscrete elements, since by one means or another, it can be rendered as discrete data including data that are parameters of a distribution.

4. Sets , in which descriptors can appear in any order but only once or not at all.

5. Collections or bags , in which descriptors can occur in any order but now more than once (or once or not at all).

6. Partially distinguishable item collections . Since an item can be counted more than once in a collection, the issue arises as to the extent to which they are really distinct. If A occurs twice or more and is not counted more than once, it refl ects the fact that they are considered identical, i.e., redundant duplications, and we are back to the set. If they are all counted, then they are distinguishable by recurrence , and measurements become repeated measurements that happen to be identical, to be taken into account in the statistics. Between these two, there are potentially intermediate degrees of distinguishability that can be discovered as strong relationships by a fi rst pass of data mining. Then the degree of distinguishability entered in a second pass .

The closer to the top of the list, the more rigid is the structure specifi cation.

Nonetheless, that is an illusion and, transformed properly from one to the other, the information content is equivalent. Consistent with Equation 2.3 and the associated discussion, the notation used here has abolished the distinctions of graphs, lists, sets, and collections by making collections (also known as bags) the general case. We know that G is the 100th item in a DNA sequence (a list) because we now write Base_100 := G . At the very worst in a spreadsheet without specifi ed metadata, we can always write, e.g., Column_26 := yes. In consequence also, original data could be a mix of the above types 1 – 4 and could be converted to a collection as the lingua franca form. Such mixed data are not unstructured, but are merely of mixed structure, providing the entries in each structure class of 1 – 4 are clearly indicated as such.

How does one build or chose such a composite datum? It is not always so easy. First, we specify a general principle of notation introduced informally above. In much of this review, it is found that it is convenient to use A , B , C , and so on, to stand generally for a datum for whatever structured form it is in, much as mathematicians use x to stand for any number. Occasionally, to avoid cumbersome use of subscript indices where they would be abundant in equations, it is important to recognize that B immediately follows A , and so on, i.e., A = X1 , B = X2 , and so on, and certainly A , B , C , … Z means all the data that there are for consideration, not just 26 of them (the number of letters

in the alphabet). Each of A , B , C , and so on, can stand for, for example, hydropathy of a molecule, the gender of a patient, the ethnic group, the height, weight, systolic blood pressure, an administered drug, a clinical outcome, and so on. At the point of structured data mining, a symbol such as A will be potentially a composite datum such as E := F . Prior to that, however, the A , B , may not have yet come together to form such a composite structure, e.g., prior to the reading of text. The most general approach for managing the A , B , C , … is to assume that all items are potentially data values, not just meta-data, and then to discover ontological relations such as A := B or A := C := F by unstructured data mining, being in part the process that defi nes which is the metadata. Where all nodes on a graph have unique names, one may note that one may fi nd B := A where A is always associated with B , though not the verse, suggesting that “ All A are B . ” Unstructured data analysis is not con-fi ned to ontology. In other instances, the fact that C is merely sometimes associated with B does not imply an ontological relationship. From the per-spective of higher - order logic, an association is an existential relationship, e.g., “ Some A are B , ” while an ontological relationship is a universal one, e.g., “ All A are B . ”

This building or choosing process for a composite datum is not always so easy. The human brain appears to handle concepts as a kind of concept map or semantic net , which is a graph that is used in a way that can handle uncertainty, e.g., probabilities. Representing and utilizing such a structure as effi -ciently as does a human is a holy grail of artifi cial intelligence and actually of data mining for rules and drawing inference from them. In the interim, in the absence of fulfi llment of that goal, defi ning and mining composite data (with metadata) in the best way can pose conceptual challenges that are practical matters. A descriptor can be a specifi c path through the graph represented by metadata of various orders (not just fi rst order) such as molecule := pharma-ceutical :=antibiotic := sulfonamide. A record could be transformed to a collec-tion form with items representing several such paths as descriptors (and thus separated by & s). The data mining then “ merely ” has to extract data leading to a terminal leaf node item such as “ sulfonamide ” to identify a descriptor. In this case, one is told or assumes that the structure is purely ontological (specifi cally, taxonomic). But there is, regarding the semantic net that the human brain somehow holds, more than one way to relate it to a practical graph for data analysis. One might have substance_abuse := legal_

substance_abuse := tobacco := cigarettes := emphysema. In such a case, one must extract indirectly linked combinations including, for example, substance_

abuse := tobacco, and worse still, need to recognize it as analogous to simpler useful entries in isolation such as smoker := yes.

To paraphrase the above, thinking about data and metadata in the above way provides a fl exible, though not traditional, way to think about proceeding.

Once a composite datum is constructed including any metadata and higher order metadata (involving several := symbols), it represents one of the data in the a bag or collection form. To some extent, the data mining can be

con-veniently phased: the data, whether structured, unstructured like images and text, or both, are converted fi rst to bag form, and then analysis proceeds again starting with that form. The fi rst phase is not considered much here because it is (arguably) starting to fall in the realm of unstructured data analysis and specifi cally analysis of written text. We are interested in the next step, consid-ering what do we do with structured data.

In document Pharmaceutical Data Mining (Page 59-63)