• No results found

While the algorithmic core of the topological analysis requires only a few mandatory parameters, a couple of additional parameters, of which some are optional, are needed to make the algorithm applicable to more complex data and to control and simplify the accuracy of the topological abstraction. Although the topological approach was designed to work on raw point data, there are still some variable parts that are not apparent at first sight, but whose adjustment could be critical in particular fields of application. This section summarizes all mandatory and optional parameters. Table 3.1 provides an overview of all available parameters, sorted by the order in which they are needed.

Similarity measure. Although it defaults to the Euclidean distance, technically

speaking, the similarity measure is interchangeable. Using another metric adapted for higher dimensions can be beneficial, e.g., to overcome problems with the Euclidean distance very high-dimensional spaces. The default metric could even be replaced by an application-driven metric that works directly on the domain entities. For example, instead of representing text documents by vectors and using their distances or angles to describe similarity, they could also be compared based on their original content with automatic language processing and sophisticated text mining approaches. In this context, however, it should be mentioned that leaving the vector-based information space could lead to problems with respect to upsampling, which requires to compare real objects with samples at midpoints that do not have a valid representative in the data. Particularly for the text example, it is not immediate how to define a (virtual) document that resides in the middle between two real documents.

Sampling. This is an optional step to increase the algorithm’s scalability by

working on a representative subset of very similar clustering structure. This phase has two parameters: a threshold for random sampling to define a percentage of the input data that should be kept, and a threshold for density-based sampling to define a percentage of the maximum density that a particular point has to exceed.

Filter kernel. The filter kernel used for density estimation is also interchangeable.

For the sake of simplicity, it defaults to the Gaussian filter because this one is used frequently for density-based clustering. However, the Gaussian filter is isotropic in that it has the same variance in every dimension, i.e. the region of influence is actually a hypersphere. The estimated density function is also subject to the filter radius σ and requires to evaluate Euclidean distances. As it is conceivable that neither the assumptions about the underlying data, nor data size or dimensionality justify using this kernel type, replacing it by another kernel could be reasonable. Still, it is important to understand that basically every structural insight taken from the topological analysis depends on this parameter and its window width σ. While a too large σ combines actually separated regions, a too small σ splits clusters and can even assign every data point to its own cluster. Because finding a suitable σ is vital, Chapter 5.2.1 presents topology-based strategies and an interactive widget to determine the window width of this parameter effectively.

Neighborhood graph. The parameter type is a neighborhood graph to ap-

proximate each vertex’ neighborhood. Choosing between the Gabriel graph (GG), the relative neighborhood graph (RNG) or the Euclidean minimum spanning tree (EMST) is a trade-off between runtime and accuracy of the clustering description.

Not only are the RNG and the GG worse in runtime complexity, they also contain substantially more edges that require additional upsampling. As a rule of thumb, with increasing data size and dimensionality, less complex neighborhood graphs should be used. In our experiments it turned out that while the GG can handle up to around 30 000 points in some dozens of dimensions, for more complex data, the RNG or even the EMST are recommended. The effect of using sparser neighbor- hood graphs is increasing structural noise that can be countered with topological simplification. Instead of using the upsampled versions of these graphs, other graphs and (up)sampling schemes are also imaginable as long as they match the target application, i.e. cluster analysis, and can be processed by the merge tree algorithm, which requires connected graphs by default.

Upsampling. Strictly speaking, upsampling is rather optional than mandatory.

Because omitting this part can heavily distort the clustering description, it is actually recommended. Upsampling aims to improve the accuracy of the approximated density

function by detecting missing saddles of zero density that are required to confirm cluster separation. It is important to understand that without upsampling, the number of saddles does not necessarily change. Because a cluster typically exhibits less density at its border than in its center, without upsampling, saddles would likely only “move” from the empty space between the clusters to one of their borders. Although separation is likely missed in this case, the number of clusters found is still the same. For noisy data, upsampling could also be omitted because noise points between the clusters could act as saddle points of low density. Taking such considerations into account, skipping the upsampling step can accelerate the analysis. However, if the semantic and the structure of the data is largely unknown, upsampling is recommended to avoid misleading insights about the data.

Reinsertion. Reinserting non-samples is optional and primarily ensures that

the merge tree contains all input data points. It requires two boolean parameters to define whether randomly skipped points and low-density points should be reinserted into the neighborhood graph. Because sampling aims to work on a representative subset of the input data, skipping the reinsertion can accelerate the analysis if the focus is primarily on clustering structure rather than on analyzing individual points.

Topological simplification. Although the presence of noise might be an in-

teresting insight in some applications, in general, it is recommended to remove small fluctuations of the density function to obtain a clear and precise clustering description. Still, topological simplification is optional and requires up to three parameters to define minimum thresholds for a dense region’s persistence, size, and stability. Note that vague features could be missed if one of these thresholds is too large; especially if the clustering is inhomogeneous, e.g. if some clusters are scattered and thus less persistent and stable, but large in size. These aspects will be discussed in the conclusion at the end of this chapter (cf. Chapter 3.6).

Parameter changes

As already mentioned in Chapter 3.2 about the basic algorithm, the algorithmic core of the topological analysis follows a straightforward modular design. This implies that parameter changes usually affect subsequent parts of the algorithm. Because changing parameters will play an important role during the visual analysis, an efficient reuse of intermediate computational results is important to facilitate fluent and interactive exploration later on (cf. Chapter 5). Fortunately, the most time- consuming parts will not be changed frequently during the analysis. This primarily holds for computing the distance matrix, which is used by several sub-routines, but also for sampling and the choice of the neighborhood graph. Once these parameters

are adjusted, subsequent parts like the density estimation and upsampling can reuse the pre-constructed neighborhood graph. A parameter that likely requires repeated adjustment is the filter radius σ. As will be explained in more detail in Chapter 5.2.1, the analyst typically has to run the topological analysis for different values of σ in order to detect the clustering suitably. According to the parameter order, as summarized in Table 3.1, changing σ requires to repeat upsampling, the merge tree construction and simplification. Note that the situation changes if density-based sampling is part of the analysis. In this case, changing σ also affects the neighborhood graph construction and the reinsertion of non-samples. The most frequently changed parameters are probably the simplification thresholds. Because simplifying the merge tree is fast and actually the last sub-routine, changing these thresholds is not a runtime bottle-neck and hardly affects the overall runtime. Of course, changing any of the above parameters also updates the merge tree visualization later on (cf. Chapter 4).

3.4.1

Runtime Complexity

Although the sub-routines of the algorithmic pipeline have a defined limiting behavior with respect to the input data’s size and dimensionality, the expected runtime of a particular execution of the topological analysis also depends crucially on some factors unrelated to the data size. The three most important factors are the data structure itself, its dimensionality, and random factors, e.g., during random sampling. The data structure, i.e., the number of (sub)clusters and their hierarchy, affects the complexity of the topological description which, in turn, affects those steps that work on the merge tree, e.g., the simplification, which depends on the number of leaf nodes. The data dimensionality affects the runtime of the algorithm in that the number of neighbors typically increases with each additional dimension. This affects those operations working on the neighborhood graph edges, like upsampling or the reinsertion of non-samples. Finally, random factors affect the expected runtime because even equally-sized samples of the same data can lead to very different neighborhood graphs (and merge trees). Taking these considerations into account, it is difficult to quantify the expected costs for an arbitrary data set because the expected runtime is not only restricted by the data size. For example, it is easily possible that a specific number of points is faster to analyze in a high-dimensional space than in a two-dimensional space.

Nevertheless, the asymptotic runtime complexity results from the sum of the runtime complexities of each of the individual sub-routines. Taking only the most relevant sub-routines into account and assuming no implementation-specific opti-

mizations, these are the following steps (with n being the data’s size and d being the dimensionality): making the initial data set unique requires a sort, i.e., linearithmic time O(n log n), and a sweep in linear time O(n) to remove adjacent duplicates; random sampling requires a shuffle of the input data in linear time; calculating the distance matrix with (n2− n)/2 entries takes quadratic time O(dn2) in total;

estimating the density function in quadratic time; density-based sampling requires a sweep in linear time; constructing the neighborhood graph naivly for arbitrary dimensions requires cubic time for the Gabriel graph, cubic time for the relative neighborhood graph, and quadratic time for the Euclidean minimum spanning tree; performing the upsampling takes O(e n0), where e is the number of graph edges and n0 is the number of sampled points; density-based and random-based reinsertion each in quadratic time, i.e., O(2 n0n00) to determine in two sweeps for each of the non-samples n00 = n − n0 the nearest neighbored sample and the density of the mid-point of this edge; constructing the merge tree in O(ˆn log ˆn + N + M α(M )) [24] where ˆn is the number of neighborhood graph vertices, N is the number of edges and M is the number of union-find merges performed; and simplifying the merge tree with an asymptotic cost of mainly O(t log t) [25], where t is the original size of the merge tree.