3.3 Optimizations
3.3.3 Sampling and Reinsertion
Sampling is an optional phase to reduce the number of input points and, therefore, to reduce the algorithm’s overall runtime. As we assume the given points to be samples of an unknown probability density function ˆf , which we approximated with f , we can consider the effect of a further reduction in samples. The mean integrated square error between ˆf and f can be approximated ([154]):
= 1 4σ
4Z (∆ ˆ
f (x))2dx + 1 n(σ√2)d,
where the first term can be thought of as a systematic error resulting from kernel density estimation in general and the second term as a random error based on sampling. There is an inverse linear relation between the random error and the number of samples, and for high dimensions the systematic error dominates. As ˆf is unknown, the error cannot be computed, but it can still be minimized with respect to σ ([154]): σoptd+4= d Z (∆ ˆf (x))2dx −1 1 n(σ√2)d.
Substituted back into the original formula gives a relation between the total error and the number of samples: (n) ∼ n4/(4+d). This formula can be used to determine
the minimum number of samples m that does not increase the error by more than δ: m/n ≥ δ−d/4−1.
(a) (b)
(c) (d)
Figure 3.7: Comparison of different sampling strategies based on an artificial 2-D data set: (a) The complete data set. (b) The remaining points after randomly sampling 20% of the input points. (c) The remaining points after additionally sampling only those points that have a density greater than 20% of the maximum density. (d) The relative neighborhood graph after reinsertion of all non-samples.
We use both random sampling as it is very fast and indiscriminate as well as density-based sampling to remove samples with a density lower than a certain threshold. The threshold might be a fixed percentage of the maximum density or a value determined automatically. Figure 3.7 provides an example for both sampling strategies. Figure 3.7b shows the result of randomly sampling 20% of the artificial 2-D point set in Figure 3.7a, along with the density function’s merge tree based on the relative neighborhood graph. In Figure 3.7c, an additional density-based sampling step further removes samples with a density lower than 20% of the maximum density.
The persistence diagram in Figure 3.6b illustrates the topological effects for this scenario. Because the maximum density typically changes after sampling, density values are normalized in the diagram. The relevant observation is that all four clusters are detected reliably after random sampling (green), after density-based sampling (purple), and after applying both (blue). The four clusters can also be isolated from noise with a 20% persistence threshold. Note that density-based sampling does not change the maxima. This ensures that prominent features can still be compared in the reduced point set. However, since scattered and vague features in low-dense areas could be eliminated by sampling, this step is virtually related to simplification. Moreover, for noisy data, density-based sampling can increase the persistence of those features that are surrounded by noise. As a consequence, sampling can break subcluster relationships and overemphasize a region’s persistence because saddles can be lower than before sampling was applied. Therefore, sampling with high thresholds should be applied with caution as it decreases the signal-noise ratio and can eliminate existent structure or distort relationships among the features.
Although sampling accelerates the topological analysis and still preserves the main features, the final clustering description is incomplete. This is because non- sampled points are excluded from the remaining analysis process once they were discarded. If the analyst needs a visualization of the complete data set later on, non-sampled data points have to be reinserted into the neighborhood graph prior to the merge tree computation. To reverse sampling if both strategies were applied, non-samples are inserted in the inverse order of their removal. Let S denote the set of sampled points and ¯S the set of non-samples. We approximate the neighborhood for each point ¯p ∈ ¯S by graph edges to its sampled neighbors. Depending on the used neighborhood graph, ¯p may have several valid neighbors. However, because the reinsertion step is not intended to improve the approximation of the density function, but only aims to connect non-samples to their corresponding regions, we add a single edge from ¯p to its nearest neighbor N N (¯p) ∈ S to the graph of S. It is also necessary to test whether the newly inserted edge needs to be upsampled to avoid that noise points are related to nearby clusters that contains the nearest neighbors. A consequence of connecting ¯p with only one edge to the neighborhood graph of S are star -like structures that develop at some sampled points p ∈ S (cf. Figure 3.7d). If the non-samples are reinserted for both sampling strategies, these structures develop recursively. Again, the purpose of reinserting non-samples is to ensure that the merge tree contains all input data points on its superarcs. Therefore, if the analyst is only interested in a structural overview rather than visualizing the data points, skipping the optional reinsertion provides a faster structural preview.
Table 3.1: Parameters of the algorithmic core of the topological analysis.
Part of the Algorithm Requirement Type Number Similarity measure mandatory metric 1
Sampling optional numeric 2
Filter kernel mandatory kernel type 1 Neighborhood graph mandatory graph type 1
Upsampling recommended boolean 1
Reinsertion optional boolean 2
Simplification optional numeric 3