PHARMACEUTICAL INDUSTRY
2.1.8 The Datum, Element of Information
When discussing the information content of a human being, the numbers obtained relate to storage capacity , much as one would talk about a hard drive or the amount of RAM memory in a computer. A measure of information that is capable of imparting knowledge is information about something. Unlike
a computer, the associated information content does not come written on (or in a manual for) that something, but we must form opinions, statistically rigor-ous or otherwise, based on multiple occurrences of it.
The basic something that is collected for analysis, say the datum , is variously called an entry, item, observation, measurement, parameter, quality, property, event, or state. When discussing matters like patient records, the term entry is usually used. When discussing more abstract matters, and by analogy with quantum mechanics (QM) and statistical mechanics, the term state is fre-quently used, perhaps even when it may be that the measurable valuable of the property of a state is the intended meaning. An example of a datum is the weight of a patient.
In the most general defi nition of a biomarker, a biomarker is simply a datum and vice versa, though often the term “ biomarker ” is reserved for genomic, proteomic, image, and clinical laboratory data for a patient.
Structured data mining (in contrast to unstructured data mining , which addresses text and images) places emphasis not only on the datum but also on the record . The patient record, including specifi cally the clinical trial patient record, is an excellent example. The patient or the arbitrary unique patient identifi er is a kind of true underlying state, analogous to an eigen-value in QM, of which there may be many observable properties or qualities over time. The datum represents such properties or qualities and corresponds to an entry on the record for that patient, such as patient name and identi-fi cation (if the record is not “ anonymized ” ), date of birth, age, ethnic group, weight, laboratory work results, outcomes of treatment, and so forth. They are observables of that patient. In the complicated world of data analytics , including data mining, it is good news that in many respects, the above clini-cal examples of a datum all describe a form that can, for present purposes, all be treated in the same way, as discussed in the following section. Better still, anything can have a record. A molecule can also have a record with entries on it, for example, indicating a molecular weight of 654. When con-sidering theoretical aspects related to prior belief and its impact on statistics, then even more generally, a record is any kind of data structure that contains that entry, even if it is only a transient repository like the short - term working memory in our heads. The terms observation or measurement do imply a distinction as something that is done before placing it in a record, as the moment it is found that the patient weight is 200 lb. However, for analysts of other peoples ’ data, and for present purposes, it only comes into existence when we get our hands on a record and inspect it: that is an observation of a sort for the data analyst.
The set of records is an archive and the order of records in it is immaterial except of course when they are separated into specifi c cohorts or subcohorts, in which case each cohort relates to a distinct archive wherein the order in each cohort is immaterial . Perhaps contrary to the reader ’ s expectations, the entries on each record can be rendered immaterial with respect to order on the record, as discussed below, though a meaning can be attached to entries
that occur more than once on the record, as, say, multiple measurements with error (see below). In contrast, records cannot recur twice. Even if the record is anonymized, it has an implied unique index (analogous to eigenvalue), which may simply be its arbitrary position in the list of records that comprises the archive. A duplicate entry such as Hispanic on different records is, however, considered the same state; it is just that it is associated with a particular patient, implicitly or explicitly in each record. Each occurrence is an incidence of that state. More importantly, an entry is associated with all of the other entries on a record. Association analysis , which quantifi es that association as a kind of statistical summary over many records, is a key feature of data mining, both structured and unstructured.
Above all things, a datum is an observable, ideally based unambiguous state as in physics, though with the following two caveats (providing redundant information is removed in subsequent inference). First, a degenerate state, such that the blood pressure is greater than a specifi ed level, is allowable, whereas it is not in the world of QM. Second, states that show degrees of distinguishability (from none to complete distinguishability) are allowed. As the above examples imply, the observable may be qualitative or quantitative.
If it is qualitative and distinguishable by recurrence, it is countable ; if it is quantitative, it is measurable . The counting implied in countable is typically over the analogous state. Because states can be degenerate, a range of values, e.g., blood hemoglobin, can be used to represent a state, e.g., the state of being in the normal range for hemoglobin, and can be counted. Measurements that relate to the same state distinguishable by recurrence can also be counted. An event can also be considered as the appearance of a state or measurable value of a property of that state distinguishable by recurrence, ideally qualifi ed or “ time stamped ” at a moment of time or a range of time.
A measurement may not yield the same value twice or more due to error . An error is a process such that the measured values are random when applied to the same state or what is considered the same state, but are random in a way such that the mean square difference between measurements in an indefi -nitely large sampling set of measured values is not considered signifi cantly different from that for many subsequent sampling sets of an indefi nitely large number of measured values. A state that shows continuity in time but with a change in the measurable value of a property of it that is not attributable to error is not strictly the same state but represents an evolution of the previous state. However, a state that represents an evolution of a state or shows mul-tiple occurrences at the same time may be held to be the same state in an elected context, even if there are means to distinguish it outside that context.
This is such that we may, for example, consider the patients in a cohort as subjected to repeated measurements on the same state and ameliorable to statistical analysis based on the concept of error in observation, even though there are means to distinguish those patients. The model here is that measure-ments on different states are treated to represent repeated measuremeasure-ments on the same state with error, in which case the notion of the normal (Gaussian,
bell curve) distribution applies until proven otherwise. The mean or average value is the expectation or expected value of the measured property, and the variance in the values from that expectation is a function of the magnitude of the error, specifi cally the mean square value. In practice, sometimes with the same raw data, account is taken of patient differences. For example, pharma-cogenomics requires us to distinguish patients by their genomic characteristics, and if that is done, only patients with the same selected genomic features are treated as the same state (see below).
Countable states can be counted with one or more other states, so that the number of times that they occur together as opposed to separately is known.
This usually means incrementing by one counter function n ( A ) for any state A when encountering a state on a further record, and also n ( A & B ), n ( B &
F ), n ( A & B & C ), n ( C & F & Y & Z ), and so on for all combinations of states with which it is associated on the record encountered. The functions with more than one argument, such as n ( A & B ) and n ( A & B & C ), represent the count-ing of concurrences of, here, A and B , and A , B , and C , respectively.
Combinatorial mathematics reveals that there are 2 N such counter functions to be considered for a record of N entries, though one usually writes 2 N − 1 since one of these relates to the potential empty record and hence null entry.
Because duplicate entries on a record can have meaning as discussed above, the counter function would be incremented n times for n duplicate entries.
When the value of the counter function is greater than zero, the occurrence of the state such as A or the concurrence of states such as A and B indicates that the states are existentially qualifi ed , which means that the specifi ed state exists or the specifi ed states can coexist. For example, in terms of the PC discussed below, one can say that “ Some A are B ” and “ Some B are A , ” some meaning at least one. Computationally, that may be the fi rst time that a counter function is created to handle those arguments (why waste space creat-ing variables otherwise?); hence, from a programmcreat-ing perspective, they are not of zero value but are undefi ned, which data mining interprets as zero.
The number of concurrences observed as indicated by the fi nal or latest value of a counter function with more than one argument is a raw measure of some degree of association between the states, here “ degree of ” meaning that that as well as a tendency to occur together (positive association), random occurrence (zero association), and a tendency to avoid each other (negative association) are all degrees of association. A crude measure with these fea-tures, which can be thought of as associated primarily with the n ( A & B ) counter function, is the ratio N × n ( A & B )/[ n ( A ) × n ( B )], where N is the normalizing total amount of appropriate data. The value of the logarithm of this measure may be positive, zero, or negative relates to the notion of posi-tive, zero, and negative association. As noted above, because states can show degeneracy, continuous values can be partitioned into states (e.g., low, normal, and high values in clinical laboratory measurements) and can be counted, including other states. Association can thus be applied to both qualitative and quantitative data.
A related idea to association but applying only to quantitative data is that of a common trend in variance between lists of values, i.e., intervariance , covariance , or multivariance . But furthermore, because states can show degen-eracy and degrees of distinguishability, the results of intervariance between values could be expressed in a fuzzy set approach so that the result looks analogous to the case when the values are partitioned into two states above and below a value, say, a mean value. Essentially, a Pearson correlation coef-fi cient (which lies on the range − 1 … +1) is rescaled by the number of values analyzed in such a way that the values for very strong positive or negative correlation cover the same range as the true association values. Since that aspect is “ rigged, ” by “ looks analogous ” is basically only meant that a positive correlation refl ects a positive association, a zero correlation refl ects a zero association, and a negative correlation refl ects a negative association, though data that refl ect a strong linear regression will also show a strong correlation between the values from association interpretations and corresponding values from the corresponding covariance interpretations.
Also, we will take here the position that even continuous data, like a car-diogram, can be decomposed into datum elements for analysis. If a Fourier analysis is applied implying that the information is captured as a wave, the parameters of that wave still each represent a datum. It is true that much data can appear in forms that have various degrees of structure by virtue of their interrelationships , having a graphic structure or representing arrays like medical images or lists (such as biosequences, entities on a spreadsheet, or relational database), or data types called sets and collections. However, these distinctions are an illusion to the extent that each datum in such data can be represented in a form that can meaningfully stand alone (see next section).