The optimal solution for the WTGPP is an undirected graph G0 = (V, E0) following the transitivity rule, i.e. all connected components are cliques. The clusters induced by this graph are exactly these cliques. The similarity threshold serves as the density parameter for this approach and hence defines the number of clusters and their sizes. While the WTGPP is NP-hard to solve and some applications may induce problem instances of large size, the main approach presented here uses heuristic methods. In TransClust, a clustering environment based on TC, a combination of heuristic and exact methods are applied to find a close to optimal solution for each problem in reasonable time. Currently available algorithms to solve the WTGPP are described in Section 3.4. The clustering framework TransClust, which integrates most of these algorithms, will be presented in the next chapter.
Advantages of TC over other approaches are its flexibility and the intuitive density parameter. The similarity threshold directly corresponds to the chosen similarity function, and by choosing such threshold, it is defined what to consider as "similar enough". Only some changes, the adding and deleting operations of the WTGPP, are necessary to detect outliers and produce homogeneous clusters. Edges of elements whose similarity is close to the threshold are more likely to be changed, since the costs for the modifications are rather low. The following property of a partitional clustering with TC gives an impression of the clustering results and thus helps specify an appropriate threshold.
Lemma 3.5. Let C = {C1, ..., Cm} be the cliques/clusters of a solution G0 = (V, E0)
3.2: Data partitioning by using weighted transitive graph projection 37
(i) The mean similarity between an element u and all other elements of its clique Cu is greater than or equal to the threshold t for all elements u ∈ V .
(ii) The mean similarity between all elements of one cluster Ci is greater than or
equal to t for all cliques Ci∈ C.
Proof. (ii) is a direct consequence of (i). To prove (i), the negative proposition is
assumed and lead to a contradiction. Let u be an element of the cluster Ci of size |Ci| ≥ 2. Assume the mean similarity between u and all other elements of Ci is below
t: meansim(u, Ci) = 1 |Ci| − 1 X v∈Ci\{u} sim(uv) < t ⇔ X v∈Ci\{u} sim(uv) < t · (|Ci| − 1)
C0 = {C1, ..., Ci\ {u}, ..., Cm, {u}} is a decomposition of the elements into cliques and hence a putative solution for the underlying WTGPP. The costs for C0 can be calculated by using the costs that appear to build C and adding all costs to remove edges between u and Ci. Note that these additional costs may be negative for edges that did not exist in the initial graph and had to be added to create Ci. Using the assumption that the mean similarity between u and all elements of Ci is below t, the cost difference between C and C0 is consequently:
X v∈Ci\{u} (sim(uv) − t) = X v∈Ci\{u} sim(uv) − (t · (|Ci| − 1)) < 0
This is a contradiction to the assumptions that C is a solution for the WTGPP, since there exists a decomposition into cliques with lower costs.
A statement about the average similarity between an object and all the objects of a foreign cluster is not possible. The following example illustrates that one element might have a mean similarity above the threshold to all elements of a different cluster. Example 3.6. Let V = {a, b, c} be the elements of interest. Let the similarity between these elements be sim(ab) = 0.5, sim(ac) = 0, and sim(bc) = 1. For a threshold t = 0.4 the clustering obtained by solving the corresponding WTGPP is
C = {C1, C2} = {{a}, {b, c}}. The mean similarity between objects within one cluster is obviously above the threshold and the mean similarity between these clusters is below the threshold:
sim(ab) + sim(ac)
2 = 0.25 < t
The mean similarity between b and a, which is one element from one cluster and all elements from the other, is 0.5 and hence above the treshold.
It is possible though to make a statement about the average similarity between two clusters.
Lemma 3.7. Let C = {C1, ..., Cm} be the cliques of a solution for a given WTGPP
with threshold t and similarity function sim. The mean similarity between two cliques Ci and Cj is below the threshold for all 1 ≤ i < j ≤ m.
Proof. Again the proof for this lemma is done by assuming the negated proposition
and leading it to a contradiction. Let Ci and Cj with i 6= j be cliques with average similarity above the threshold t. The decomposition of the objects into cliques C0 = (C \ {Ci, Cj}) ∪ {Ci ∪ Cj} is a putative solution for the WTGPP. The costs for C0 can again be calculated using the costs for C and adding all costs for adding the connective edges between Ci and Cj:
costs(C0) = costs(C) + X u∈Ci
X
v∈Cj
(− sim(uv) + t)
In order to see a contradiction to the assumption that C is a solution for the WTGPP all that remains is to show that the second term is below zero. This can be derived from the initial assumption that the average similarity between Ci and Cj is above the threshold: meansim(Ci, Cj) = 1 |Ci| · |Cj| X u∈Ci X v∈Cj sim(uv) > t ⇔ X u∈Ci X v∈Cj sim(uv) − t · (|Ci| · |Cj|) > 0 ⇔ X u∈Ci X v∈Cj (sim(uv) − t) > 0 ⇔ X u∈Ci X v∈Cj (− sim(uv) + t) < 0
3.3 Extensions
In order to improve the clustering results, one can modify the WTGPP. One option is to include existing knowledge. Objects, where it is known that they belong to the same cluster can be set to be inseparable for instance, or a second threshold may specify a limit above which two elements are also forced to be in one cluster. Another extension of TC is to compute a hierarchal or an overlapping clustering.
3.3.1 Upper and lower bounds
In some cases it is useful to force some elements to be in one cluster. If two elements are more similar than a second threshold tu (upper bound), they can be considered as "similar enough" to be inseparable. For instance, proteins whose best bidirectional BLAST E-value is below 10−200 can be forced to belong to the same cluster. The