Data abstraction - Making choices in multi-dimensional parameter spaces

A common theme of our use cases is that, unlike classical data visualization settings, they do not start from given data, but work with a computational model, whose parameter space is sampled in order to construct a set of data points for interactive analysis. Here, the notion of a data source, which was recently introduced as abstraction of a file loader by Ingram et al. [IMI+10], could be extended to include basic information about the available variables, their types, and valid ranges. Enhanced with a capability to query new points that are not yet stored in a data table, this could be used as an interface to static data as well as dynamic computation, similar to a function call in a procedural programming language. This results in a synthetic data source, which we refer to as compute node. The design of a basic user interface for this abstraction will be discussed in Section 6.2.1. The conceptual organization of the required tasks along with their inputs and outputs is shown in Figure 1.3. It separately considers user interaction and computational pipeline, where all modules operate on the same data and share one flow of control.

A numerical model in our cases consists of relations among a set of variables or dimensions. Some background on relevant mathematical concepts is summarized in Appendix A. Variables may be inherent to the problem domain or internal to the model. Further possible distinction can be made considering the distribution of their values, which can be continuous or discrete, (un-)known, (un-)observable, (in-)expensive to sample from, determined empirically from data, or structurally inferred.

Set up compute node run default point show derived variables file IO Group variables/dims Specify ROI

View data (sub-)space

overview bi-variate view histogram detail view Assign variables assign manually trigger computation for points and resolution

Sample inputs

Derive variables Compute outputs

User Interaction Computation

Distance metric features objectives embedding coordinates cluster membership #dims region of interest restrict to ROI #dims 2 1 0 for resolution

Figure 1.3: Abstraction of data, interaction, and computational components. Lines indicate shared data among processing steps and arrows prescribe an order of execution. On a more detailed level,

Redis required input andbluedenotes information that is available after a processing step.

The overall model is represented by a function f : Rn → Rr_. _{It is parametrized}

over a multi-variate Euclidean domain, in which a point is denoted in vector notation as x = (x1, x2, . . . , xn). A point in the multi-field range of f , can be computed as f (x) =

y = (y1, y2, . . . , yr). The combination of domain × range of f gives its data space. Occa-

However, this notion does not apply in general, since the presence of constraints may intro- duce dependencies among the xi. Alternative terminology for inputs and outputs of f are

factors and responses, or parameters and derived variables, respectively5.

We assume that code to compute f is given as a black box and can be invoked for a set of points X = {xk} ⊂Rn of finite size m = |X|. This set X is referred to as a design or a

sample [SWN03, p. 15]. With a prescribed ordering this amounts to the construction of a design matrix X ∈Rm×n containing the points in its m rows (R2a). The mapping f gives a set of responses Y = {f (xk)}. By concatenating these values as [X Y] ∈Rm×(n+r)the data

table is obtained, giving the main input to further processing or visualization. Applying the concept of a function f , we impose that the output of the code is deterministic. Uncertainties of the system can still be modelled by specifying probability distributions for additional environmental variables xi [SWN03, p. 121]. Even for non-deterministic code, such an

additional variable could simply index the order or the time of a particular invocation. Using the conceptual ingredients of Figure 1.3, f is formed by composing a potentially costly image computation h (R1) and variable derivation g (R4) to give f = g ◦ h. The image of h is meant in a mathematical sense, but can represent an actual picture or a disk image that captures the result of the computation for a particular configuration x. This indirection should emphasize the possibility to cache output images ˜y ∈ D, but in simple cases the derived response variables yi are computed directly and g is just the identity

D =Rr→Rr_.

Depending on what derived variable yi = gi(˜y) is specified (R4), its information may be

interpreted as a feature, embedding coordinate, cluster membership label, likelihood, distance from a template point, or objective measure (R6) — to give a few practical examples. In each case it may be possible to compute gi or to assign values manually, depending on

whether a function definition or a user’s concept is available.

Some processing steps (R5+7) require a notion of distance or similarity among points (R4a). Technical background on distances and norms is provided in Appendix A.1.1.

One method considered here is the Euclidean distance dr between feature vectors in Rr.

Beyond that, distance dccombines all information about each configuration point, including

its parameter coordinates x ∈Rnor a domain specific function operating on the disk images. In order to accommodate simulations with a large number of variables n + r, an early

step of the interaction allows the user to divide variables into groups of smaller sizes nl.

This way one can separate input and output, indicate other semantic information inherent to the simulation model, and produce more focussed multi-variable views (R3).

An important aspect of Figure 1.3 is that the sample creation is split into a specification of a region of interest M ⊂Rn+r, where areas in the input space are outlined (R2), see also Section 6.2.3. This is input to an automatic method that generates a point distribution of good numerical quality in that region that also fulfills a given budget m. What good quality means in the context of multi-dimensional point distribution is subject of the following chapter.

Acquisition and visualization of

multi-variate data

Enabling people to inspect and understand complex data sets is a core objective of computational visualization. While algorithms may apply to general data types, staying aware of the original problem domain is crucial to allow for meaningful interpretation of a visualization. Aside from this cognitive motivation there are also computational reasons to maintain the connection to the data generating source. In particular, if a computational model is used to generate the data set, it may be invoked to obtain further data to refine or extend the region of interest to sample from. While data in its original Latin meaning is “something given” its acquisition can be influenced by deliberate choices, turning it into a response or “something asked for”. This more active perspective on data explains two related threads in this chapter, namely to discuss criteria and methods for sampling to “ask good questions” in Section 2.3 that leads into a discussion of “making sense of the answers”, which on a numerical level begins with the topic of integration in Section 2.2 and reconstruction in the subsequent sections. This structure repeats in Appendix B at a different level of depth. A closing discussion of interactive visual interfaces in Section 2.4 that facilitate comprehension of the so-acquired numerical data gives background for Chapters 4 and 6.

The following historical excursion will show that without a notion of continuity of the underlying space, a special treatment of the multi-dimensional setting is not required — all variables could be folded into one without loosing any of the non-existent structure. Because a fundamental notion of continuity is readily implied in our multi-dimensional setting, the title of this thesis does not mention it explicitly.

2.1 Effects of dimensionality

An intuitive notion of dimension goes back to Euclid’s “Elements” (300 BC), in which he begins: “1. A point is what has no part. 2. A line is what has lengths but not width. (...) 5. A surface is what has length and width only.”1 The dimension of these objects is determined by the number of parameters required to refer to each of their elements: line 1, surface 2, solid 3. This notion of dimension, while intuitive, has a remarkable counter example.

In 1887 Jordan proposed a rigorous definition of a curve to be a continuous function of a single parameter, whose domain is the unit interval [0, 1]. Soon after, Peano and also Hilbert [Hil91] devised a continuous mapping of the unit interval onto the full unit square creating a space-filling curve that one can follow and pass through all points of the two- dimensional square. Extensions of these mappings cover the entire unit cube [0, 1]n with a Jordan curve, still depending on a single parameter only.

All of these curves are densely self-intersecting one-to-many mappings. In particular, Hilbert points out that with a slight modification of his square filling curve the number of self-intersections at a point can be reduced to three. The Lebesgue covering theorem mentioned in Appendix A.1 asserts that this number may not be reduced further.

To restore the intuition about the dimensional number as the number of parameters needed to represent each element of a set, one has to add the property of uniqueness to the continuous mapping that parameterizes the set. Such a homeomorphism maps one set into another leaving all its topological invariants intact, such as dimensional number, number of connected components, or genus (number of holes). To return the focus to data analytic aspects, the definition of a topology, continuous functions, and manifolds are deferred to Appendix A. After briefly leading into topological topics involving basic notions of neighbourhood, the following discussion is again of geometric nature.

Volume of the hyper-sphere in Minkowski p-norm: A basic mathematical object is the n-dimensional p-norm sphere

S_np = {x ∈Rn: xp₁+ xp₂+ · · · + xp_n≤ rp_} _(2.1)

of radius r with p ∈ [1, ∞], defined here to contain its boundary. It is of relevance in

numerous theoretical and practical geometric settings and arises often in the context of metrics and norms discussed in Appendix A.1.1. In Euclidean space Rn with Lebesgue measure voln the sphere is the set with the smallest surface area [Mat02, pp. 222]. A closed

form expression for its volume is derived by Newman [New72, p. 101] as

voln(Snp) = 2nrn

Γ(1 + 1/p)n

Γ(1 + n/p), (2.2)

using the Gamma function Γ as a continuous extension of the factorial, giving Γ(n) = (n−1)! for n ∈Z+. The graph in Figure 2.1 shows its behaviour for increasing n and p. A curious

0 5 10 15 20 25 30 10−25 10−20 10−15 10−10 10−5 100 105 1010

volume of n−sphere for p−norm

n volume p = 1 p = 2 p = 3 p = 4 p = 5 p = 10

Figure 2.1: Semi-log plot of volumes of n = 1 . . . 30 – dimensional p-norm unit spheres.

observation is that the 2-norm sphere hyper-volume reaches its peak at n = 5 and volumes of all p-norm spheres ultimately converge to 0 for growing n except in the case of the hyper-cube for p = ∞ with a volume of 2n. It is somewhat misleading to perform this interpretation along the n-axis, because a 3-dimensional sphere, for instance, contains infinitely many non- intersecting 2-dimensional disks of non-zero area. However, vertical comparison for different choices of p is fine and makes a striking case for non-box shaped regions, when using p-spheres to mark out regions of interest for exploration. The discussion in Section 6.2.3 on page 105

will come back to this aspect. Rearranging Equation 2.2 one obtains the radius

rn=

Γ(1 + n/2)1/n

2Γ(3/2) (2.3)

for an n-dimensional 2-norm sphere of unit volume. This radius will be relevant in Sec- tion 2.3.4 on page 32 to provide an upper bound for the density of periodic sphere packings. The implications of this discussion are: the more variables or dimensions in a metric space are to be inspected, the larger is the volume to cover. Volume is directly proportional to the number of configuration points that need to be computed in order to maintain a certain density. In most settings, this directly corresponds to computational cost, which we would like to minimize. One way to do so, is to determine first, how many variables actually matter. While algorithmic development on this topic is current research not covered in this thesis, the following brief technical discussion of the issue can provide an entry point in the future.

Estimating dimensionality: Lebesgue’s covering theorem points out a connection between the dimensional number n of a region M and the minimum number of n + 1 simulta- neous intersections when covering M with small open neighbourhoods. In a metric space one can use interiors of spheres of radius R for this purpose. Let mR(M ) denote the minimum

number of such neighbourhoods of diameter < R needed to cover a set M and note that if M has an n-dimensional volume, this count fulfills mR(M ) ∼ R−n. This is the idea behind

the so-called capacity dimension, also called Hausdorff- or fractal-dimension that estimates the largest polynomial degree n of the above count for arbitrarily small neighbourhoods as:2

dimcap(M ) = − lim R→0+

ln mR(M )

ln R . (2.4)

A more easily computed lower bound to this number is given by the correlation dimension.3 It is also considered as a measure of intrinsic dimension by Levina and Bickel [LB05]. To allow for a statistical analysis, they construct a Poisson process for the point set that is uni- formly distributed inside M counting the number of points that lie within a growing radius

2_{Note that this works for any non-zero volume of M , as it comes out of the logarithm in the enumerator}

as an additive constant and then vanishes against the magnitude of the denominator.

The correlation dimension [Wei09] is counting pairs of points of a set M within a radius R of each other as a measure of connectedness. This count rises in the order of Rn_{, with the monomial degree n corresponding}

of each other. The growth rate of this process depends on the local density, dimensionality, and implicitly the sphere volume Vn = voln(Sn2) derived in Equation 2.2. Based on this, a

closed form solution is derived for the maximum likelihood estimate of the dimensionality that is averaged over all points of M . Their discussion points out that dimension may vary with position x and with a choice of scale, which is specified in their method by either a maximum radius or an index of the k-furthest neighbour to consider in the estimate.

Beyond these geometric perspectives on dimensionality, the approximation of functions in linear spaces also requires choices of sample points for which a discussion is provided in Appendix A.2. This setting can be interpreted as looking at a family of integrals — one member for each point to reconstruct. The following discussion more specifically considers the computation of a single integral.

In document Making choices in multi-dimensional parameter spaces (Page 33-41)