• No results found

3.5 Examples and Results

3.5.1 Artificial 2-D Data Set

For demonstration purposes, the first example data set is two-dimensional. This should achieve a better understanding of the topological approach because we can illustrate relationships and intermediate results. Of course, the observations and explanations made for this example also hold for higher dimensional data. The example data consists of a noisy 2-D point cloud with clusters of varying shape,

compactness and size. It is illustrated in Figure 3.8a and explained in more detail in Appendix A.1.

Because the data is only two-dimensional, its density function can be imagined as a height field. As indicated in Figure 3.8c, the data can be extended to 3-D by assigning each data point a height value according to its density as determined by density estimation. This causes regions of higher density to stand out as separated hills. To extend the points to a domain suitable for topological analysis, we use the Delaunay triangulation and obtain the terrain shown in Figure 3.8d. The equally distributed isolines on the hills represent superlevel set borders to accentuate height information. Thinking of a single isoline that decreases in its height, it first appears on a peak, merges with another isoline in a valley, and, after it comprised the whole landscape, it (typically) vanishes at zero height. The objective of the topological analysis is to capture these events for all isolines of the landscape and to summarize their evolution in terms of critical points. Figure 3.8e shows all critical points of the height field. It contains a maximum (red) on each peak, a saddle (green) in each valley where superlevel sets merge, and one global minimum (blue) to represent the point of lowest density. The merge tree, which is not shown in the terrains, connects these critical points according to the merging behavior of neighbored regions. Because noise in the data easily complicates the structural description and visual complexity, we eliminate insignificant features with topological simplification. Figure 3.8f shows the critical points that remain after the simplification. There is one density maximum per cluster and saddles of lower density indicate subclusters or cluster separation. The merge for this data set is shown in Figure 3.8b.

A major advantage of the density-based approach is its capability to detect clusters of arbitrary shape, as long as dense regions are separated by regions of lower density. In terms of the density function’s topology this means saddle-maximum pairs evolve inside a cluster and take its form until they reach lower density at its border (cf. Figure 3.9a). However, this behavior depends on the selected filter radius σ. While σ must be sufficiently small to discriminate nearby clusters, a small filter radius also increases topological noise and can split those clusters that are larger than σ itself. Although evolving saddle-maximum pairs are the key to find clusters of arbitrary shape, resulting topological noise typically needs to be reduced prior to further exploration (cf. Figure 3.9b).

Upsampling is necessary to detect missing saddles of the density function. How- ever, as illustrated in Figure 3.9b, in case of noisy data, using the noise points as saddles could also suffice. Although a lower saddle might be found at an upsam- pled position in the noisy areas, its density is unlikely much lower. That is, the

(a) (b)

(c) (d)

(e) (f)

Figure 3.8: Artificial 2-D data set (cf. Appendix A.1) imagined as a height field: (a) Noisy point cloud with clusters of different shape, size, and compactness. Some clusters are intertwined, others contain each other. Colors highlight the relation between clusters and classes; noise points are colored in black. (b) The final merge tree accurately captures the clustering structure. There is one density maximum (red) per cluster and cluster hierarchy and separation are described by connecting saddles (green). (c) The topological analysis can be imagined as analyzing the implicit height field defined by the points’ densities. (d) Landscape-like representation of the height field that results from rendering the (2-D) Delaunay triangulation with augmented isolines to indicate some superlevel sets. (e) The large number of critical points (red=maximum, green=saddle, blue=global minimum) reflects a noisy density function. (f) These fluctuations are countered with topological simplification. Given suitable simplification thresholds, one density maximum per (sub)cluster remains. Remaining saddles and their densities indicate cluster hierarchy, separation, or ambient noise between the clusters.

(a) (b)

Figure 3.9: Artificial 2-D data set: (a) Height graph consisting of the (upsampled) RNG edges (red) and with vertices colored according to their density (dark=dense). The unsimplified merge tree is augmented. Depending on the point distribution and the selected filter radius σ, arbitrarily shaped clusters are detected as maximum- saddle pairs that evolve inside the clusters until they reach lower density. (b) The large number of insignificant saddle-maximum pairs is reduced with topological simplification. After simplification, only one density maximum per cluster remains.

overall clustering would still be captured by the merge tree and the time-intensive upsampling could be omitted. Nevertheless, because separation could be missed if saddles are non-zero, it is not recommended to rely on noise in unknown data sets. To demonstrate the importance of upsampling, Figure 3.10a shows the 2-D data set after noise removal together with the upsampled Gabriel graph edges and the critical points of the density function. Except for the subcluster hierarchy in the top left corner, all saddles and the global minimum are located on upsampled vertices between the clusters. Note that not all upsampled positions act as saddles between the dense regions and that most of them become regular nodes once two regions were identified to be separated. Moreover, upsampled vertices are not stored as implicit regular nodes because they do not represent real data points. Figure 3.10b shows the same scenario without upsampling. Because there are no virtual upsamples in the height graph anymore, only real data points of lowest density can act as the saddles. This implies that saddles are located on the cluster borders. From a clustering point of view, these non-zero saddles can only describe one big cluster with several sub-structures and, because saddle densities are now higher, the persistence of these features also decreases. This is why some density maxima vanish in Figure 3.10b because they are now considered noise and are removed by topological simplification

(a) (b)

Figure 3.10: The importance of upsampling demonstrated with the artificial 2-D data set: (a) After noise removal, the critical points of the density function are located on the upsampled midpoints (black dots) of the Gabriel graph edges. Cluster separation is detected reliably. (b) Without upsampling, the critical points reside on the cluster borders, which is why cluster separation is missed. From a clustering point of view, the non-zero saddles densities can only reflect one big cluster with several sub-structures.

using the same threshold like in Figure 3.10a. While reducing the simplification thresholds would restore the previously found features, their actual separation would still be missed.

The runtime of the topological analysis depends on the individual choice of parameters. Table 3.2 provides some statistics for several parameter settings. We distinguish primarily between the neighborhood graph and different sampling strate- gies. In all configurations, the filter radius is fixed to σ = 30.0 (the data domain has an extent of 800x800 pixels) and the simplification threshold is fixed to 10% of the maximum persistence—which is typically a good value to remove noise. Note that simplifying by another region property or using other thresholds would slightly change the runtimes for the simplification step. We use the Euclidean minimum spanning tree (EMST), the relative neighborhood graph (RNG) and the Gabriel graph (GG). The used sampling strategies include running the topological analysis (i) without sampling, (ii) with sampling only 20% randomly and density-based, and (iii) with sampling, but without reinserting the non-samples afterwards. The results

1Total times include preprocessing like computing a distance matrix or removing duplicates.

Duplicates still contribute to the density function and are added correctly to the merge tree.

Table 3.2: Statistics for the artificial 2-D data set (times in seconds). The filter radius is fixed to σ = 30.0 (pixels) and the simplification threshold is threshpers= 10% of

the maximum persistence.

neigh- borhood graph number of edges time for graph time for upsam- pling time for re- insert- ion time for merge tree time for sim- plifica- tion total time1 without sampling EMST 31 833 27.37 1.8 - 0.21 1.03 36.07 RNG 44 942 96.01 2.87 - 0.28 0.46 105.59 GG 77 077 98.14 4.94 - 0.91 0.22 111.05

with sampling (20% random, 20% density)

EMST 5 365 0.57 0.34 11.34 0.29 5.38 18.42

RNG 6 540 2.01 0.34 12.66 0.28 3.69 19.47

GG 10 046 1.80 0.47 12.95 0.27 3.46 19.44

with sampling (20% random, 20% density) without reinsertion2

EMST 5 177 0.51 0.28 - 0.02 0.04 1.33

RNG 6 287 1.73 0.33 - 0.03 0.04 2.59

GG 10 586 2.24 0.54 - 0.04 0.02 3.35

clearly reveal that the total times are dominated by the times required to construct the neighborhood graph and that there are huge differences in the total time de- pending on the sampling strategy. Because the Gabriel graph produces the most edges it also requires the most time to compute and consumes the highest amount of memory. Furthermore, subsequent upsampling of the graph increases linearly with the number of edges. Compared to the graph constructing and upsampling, creating and simplifying the merge tree is generally fast and takes only a small part of the total time. The runtime bottleneck, hence, is located in the first part of the analysis. This is also the reason why the sampling strategies aim to reduce the costs of these phases. The table reveals that sampling reduces the total time almost linear to the amount of sampling applied. The fastest result and lowest memory consumption can be obtained by applying sampling, but skipping the reinsertion. This approach not only minimizes the runtime and memory consumption, working on a smaller subset also accelerates the merge tree construction and the simplification step.

Changing the filter radius σ also has an effect on the runtime and memory consumption of the topological analysis. Table 3.3 summarizes statistics for different filter radii; using the Gabriel graph and fixing the simplification threshold to a

Table 3.3: Statistics for the artificial 2-D data set (times in seconds). The neighbor- hood graph type is fixed to the Gabriel graph and the simplification threshold is threshpers = 10% of the maximum persistence.

filter radius σ time for upsam- pling number of upsam- ples time for merge tree number of max- ima time for sim- plifica- tion total time 0.1 0.14 77 077 35.39 31 834 0.0 137.87 10.0 7.81 13 049 0.33 3 884 0.53 106.93 30.0 4.64 2 109 0.91 543 0.22 105.85 60.0 5.10 623 1.30 204 0.09 108.65 150.0 10.11 171 2.06 104 0.02 122.58 500.0 32.54 62 1.66 99 0.13 139.49

constant of 10% of the maximum persistence. The filter radius σ varies between the smallest possible value, which is σ < 0.5 for two neighbored pixels, and a too large value that cannot separate all clusters anymore. For an increasing σ, the table reveals an inverse relation between the time required for upsampling and the number of upsamples found. This is due to the cut-off radius, which allows us to skip the evaluation of a midpoint’s density once it cannot increase anymore because all other points are too far away. As a consequence, for smaller filter radii, less points are relevant for the density estimation of a single evaluation. For σ < 0.5, the filter radius is even smaller than (half of) the shortest possible edge length in this 2-D example data set. This is why the density estimation of a midpoint can be skipped entirely, which also holds in general if the filter radius is smaller than half of the length of the currently processed graph edge. For an increasing filter radius, the effect of this optimization vanishes. Furthermore, the number of upsamples found also decreases with an increasing filter radius because the midpoint densities increase as well and would turn into (uncaptured) regular nodes or local maxima of the density function. The worst-case of the number of required upsamples in this example scenario is for σ < 0.5; when every graph edge hosts a zero-density saddle. This is also the worst runtime for the merge tree construction, which is otherwise rather constant and independent from the filter radius.