4.4 Correlations and Problem Definition
4.4.1 Correlation between a Grouping and a Feature
i is correlated
withfj, objects in the same object subset of the partition tend to have similarfj values
while objects from different object subsets tend to have different fj values. We use the data matrix in Figure 4.2 as an example. Given featuref3 and two groupings indicated by trees T{f1} and T{f2} in Figure 4.3
• G{8f1} ={{s2, s3},{s4, s5},{s7, s8}}
• G{1f2} ={{s2, s5},{s6, s8},{s3, s4}}
Their feature values of f3 are plotted in Figure 4.4, using different markers to repre- sent objects in different subsets of a grouping. As we can see, grouping G{8f1} is more correlated with f3 than G{1f2}.
We use normalized RSS (residue squared sum of error) and BIC (Bayesian informa- tion criterion) to measure the correlation between a grouping and a feature.
Figure 4.4: Correlation: feature f3 and groupings G8{f1},P1{f2}
Given a grouping GFi ={S1, S2, ..., Su}, S =ul=1Sl and a feature fj, we calculate the total varianceSST, variance between object subsetsSSB, and variance inside object subsets SSE as follows
M(Sl) = 1 |Sl| sk∈Sl fj(sk), M M = 1 |S| sk∈S fj(sk) (4.1) SSB = u l=1 |Sl|(M(Sl)−M M)2 (4.2) SSE = u l=1 sk∈Sl (fj(sk)−M(Sl))2 (4.3) SST = sk∈S (fj(sk)−M M)2 =SSE+SSB (4.4)
Then we get the normalized RSS
RSS(GFi , fj) = SSE
SST =
SSE
SSE+SSB (4.5)
For example, given the two partitions and feature in Figure4.4, their RSS scores are
• RSS(G{8f1}, f3) = 283..5793 = 0.12 • RSS(G{1f2}, f3) = 3334..1414 = 0.97
As we can see, a lower RSS value implies a stronger correlation.
Property 4.4.1. Monotonicity of RSS: Given two partitionsGF
i andGF
j , ifGF j is a child (finer) partition ofGF
i , for any featurefu (fu ∈F −F), we haveRSS(GF
i , fu)≥ RSS(GF
j , fu).
Proof: Because GFj is a child (finer) partition ofGFi ,GFi andGFj contain the same subset of objects. Therefore, we have
SST(GFi , fu) =SST(GF j , fu)
For each subset Sil in PiF
• If∃Sjt ∈PF
j , such that Sil =Sjt, then
sk∈Sil (fu(sk)−M(Sil))2 = sk∈Sjt (fu(sk)−M(Sjt))2 • IfSil is partitioned into {Sj1, Sj2, ..., Sjw} in GF j , then sk∈Sil (fu(sk)−M(Sil))2 = w x=1 sk∈Sjx (fu(sk)−M(Sjx))2 + w x=1 |Sjx|(M(Sjx)−M(Sil))2 Thus, sk∈Sil (fu(sk)−M(Sil))2 ≥ w x=1 sk∈Sjx (fu(sk)−M(Sjx))2
Therefore, according to Equation 4.3, we have
SSE(GFi , fu)≥SSE(GF j , fu)
Since RSS = SSESST, we get
RSS(GFi , fu)≥RSS(GF j , fu)
According to Property 4.4.1, if we measure correlation using RSS only, we will find that the finest groupings always have the lowest RSS score and hence the strongest correlation. To correct this bias of favoring finest groupings, we normalize the score by the number of subsets in the grouping since a finer grouping must contain a larger number of object subsets.
We utilize BIC (Bayesian information criterion) (Schwarz(1978)) to define the corre- lation between a grouping and a feature, taking into account RSS, the number of subsets and also the total number of objects in the grouping.
Definition 4.4.1. BIC Correlation between a Grouping and a Feature: Given grouping GF
i and feature fj ,fj ∈F −F, the correlation between them is defined as C(GFi , fj) =log(RSS(GFi , fj)) +u· log(|S
|)
|S| (4.6)
in which, u is the number of object subsets in the grouping and |S| is total number of objects in the grouping.
A lower C(GF
i , fj) value indicates a stronger correlation between grouping GF i and
feature fj. For example, the correlation scores in Figure 4.4 are
• C(G{8f1}, f3) =log(0.12) + 3·log6(6) =−1.76 • C(G{1f2}, f3) =log(0.97) + 3·log6(6) = 1.25
The correlation score between G{8f1} and f3 is much lower than the score between
The incorporation ofu(the number of subsets in a grouping) into Equation4.6avoids the cases where finest groupings always have the lowest score. When two groupings have similar RSS scores, the BIC correlation favors the one containing smaller number of object subsets. Given two groupings which we used as examples before,
• G{8f1} ={{s2, s3},{s4, s5},{s7, s8}}
• G{6f1} ={{s2, s3},{s4, s5, s7, s8}}
G{8f1} is a finer grouping ofG{6f1}. Their RSS values are similar
• RSS(G{8f1}, f3) = 0.12 • RSS(G{6f1}, f3) = 0.13
However, since G{6f1} contains fewer subsets, its BIC correlation score is lower than that of G{8f1}.
• C(G{8f1}, f3) =−1.76 • C(G{6f1}, f3) =−2.08
The incorporation of |S|(the total number of objects (samples) in a partition) into Equation 4.6 is easy to see. A correlation supported by a large number of objects is always better than a correlation supported by a smaller number of objects.
As we discussed in Section 4.3, each tree hierarchy TF can imply a set of groupings
{GF}. Similar to TreeQA, we define the correlation between tree TF and feature f i, fi ∈F −F as the strongest correlation achieved by a grouping of{PF
}and featurefi.
Definition 4.4.2. Correlation between a Tree and a Feature: Given treeTF and
feature fi, fi ∈F −F, the correlation between them is C(TF, fi) =min{C(GF j , fi)|GF j ∈ {G F}} (4.7)