Making choices in multi-dimensional parameter spaces

(1)

PARAMETER SPACES

by

Steven Bergner

MSc, Otto-von-Guericke University of Magdeburg, 2003

a Thesis submitted in partial fulfillment of the requirements for the degree of

Doctor of Philosophy in the

School of Computing Science Faculty of Applied Sciences

c

Steven Bergner 2011 SIMON FRASER UNIVERSITY

Fall 2011

However, in accordance with the Copyright Act of Canada, this work may be reproduced, without authorization, under the conditions for Fair Dealing. Therefore, limited reproduction of this work for the purposes of private study, research, criticism, review and news reporting is likely to be in

(2)

Name: Steven Bergner

Degree: Doctor of Philosophy

Title of Thesis: Making choices in multi-dimensional parameter spaces

Examining Committee: Dr. Mark Drew Chair

Dr. Torsten M¨oller, Senior Supervisor Professor of Computing Science

Dr. Derek Bingham, Supervisor

Associate Professor — Industrial Statistics

Dr. Steven J. Ruuth, Internal Examiner

Professor of Applied and Computational Mathematics

Dr. Min Chen, External Examiner Professor of Scientific Visualization University of Oxford

Date Approved:

ii

(3)

Last revision: Spring 09

Declaration of

Partial Copyright Licence

The author, whose copyright is declared on the title page of this work, has granted to Simon Fraser University the right to lend this thesis, project or extended essay to users of the Simon Fraser University Library, and to make partial or single copies only for such users or in response to a request from the library of any other university, or other educational institution, on its own behalf or for one of its users.

The author has further granted permission to Simon Fraser University to keep or make a digital copy for use in its circulating collection (currently available to the public at the “Institutional Repository” link of the SFU Library website <www.lib.sfu.ca> at: <http://ir.lib.sfu.ca/handle/1892/112>) and, without changing the content, to translate the thesis/project or extended essays, if technically possible, to any medium or format for the purpose of preservation of the digital work.

The author has further agreed that permission for multiple copying of this work for scholarly purposes may be granted by either the author or the Dean of Graduate Studies.

It is understood that copying or publication of this work for financial gain shall not be allowed without the author’s written permission.

Permission for public performance, or limited permission for private scholarly use, of any multimedia materials forming part of this work, may have been granted by the author. This information may be found on the separately catalogued multimedia material and in the signed Partial Copyright Licence.

While licensing SFU to permit the above uses, the author retains copyright in the thesis, project or extended essays, including the right to change the work for subsequent purposes, including editing and publishing the work in whole or in part, and licensing other parties, as the author may desire.

The original Partial Copyright Licence attesting to these terms, and signed by this author, may be found in the original bound copy of this work, retained in the Simon Fraser University Archive.

Simon Fraser University Library Burnaby, BC, Canada

(4)

STATEMENT OF

ETHICS APPROVAL

The author, whose name appears on the title page of this work, has

obtained, for the research described in this work, either:

(a) Human research ethics approval from the Simon Fraser University

Office of Research Ethics,

or

(b) Advance approval of the animal care protocol from the University

Animal Care Committee of Simon Fraser University;

or has conducted the research

(c) as a co-investigator, collaborator or research assistant in a

research project approved in advance,

or

(d) as a member of a course approved in advance for minimal risk

human research, by the Office of Research Ethics.

A copy of the approval letter has been filed at the Theses Office of the

University Library at the time of submission of this thesis or project.

The original application for approval and letter of approval are filed with

the relevant offices. Inquiries may be directed to those authorities.

Simon Fraser University Library

Simon Fraser University

Burnaby, BC, Canada

(5)

Visualization techniques are key to leveraging human experience, knowledge, and intuition when establishing a connection between computational models and real world systems. At this interface my dissertation enables effective choices of parameter configurations for dif-ferent levels of user involvement.

Based on a characterization of several domains of computer experimentation that include a model of biological aggregations, image segmentation methods, and rendering algorithms, I derive a set of requirements to propose paraglide — a framework for user-driven analysis of parameter effects. One outcome of the workflow I suggest is a partitioning of the continuous space of model configurations into distinct regions of homogenous system behaviour.

To facilitate progressive exploration of a parameter region, I develop a space-filling sam-pling method by constructing point lattices that contain rotated and scaled versions of themselves. All levels of resolution share a single type of Voronoi polytope, whose volume grows independently of the dimensionality by a chosen integer factor as low as 2.

To optimize rendering time while ensuring image quality when viewing data in a 3-dimensional volume, I perform a Fourier domain analysis of the effect of composing two functions. Based on this, I relax a previous lower bound for a sufficient sampling frequency and apply it to adaptively choose the numerical integration step size in raycasting.

By assigning optical properties to data using a spectral light model, it becomes possible to improve physical realism and to create colour effects that scale the level of distinguishable detail in a visualization. To help modellers to cope with the freedom in a large design space of synthetic lights and materials, I devise a method that generates a palette of presets that globally optimize user-specified criteria and regularization. This is augmented with two alternative user interfaces to unobtrusively choose a desired mixture.

(6)

On my way through graduate school I had the luck to cross paths with many remarkable people. I am grateful to my supervisors Torsten M¨oller and Derek Bingham for inspiring discussions, constructive input, and ongoing support. Min Chen and Steve Ruuth deserve credit for their willingness to examine my thesis on rather short notice and still providing very good feedback.

The folks at the Graphics, Usability, and Visualization laboratory (GrUVi) at Simon Fraser University made this an inspiring and fun place to work. While I had many inspiring discussions throughout the years in the lab, I would in particular like to acknowledge Ramsay Dyer, Alireza Entezari, Ahmed Saad, Tai Meng, Zahid Hossain, and Tom Torsney-Weir for their input and collaborations. In particular, I would like to thank Usman Alim, Niklas R¨ober, and Nhi Nguyen for proof reading parts of my thesis.

There are also several people beyond the lab that I had the honour to learn from during different stages of my research: Mark Drew, Dave Muraki, Thierry Blu, Dimitri Van De Ville, Melanie Tory, Tamara Munzner, Michael Sedlmair, and Stephen Ingram.

Also, I would like to thank past students for their great work that they did under my (co-)supervision (or despite of it): Vincent Cua, Vladimir Kim, Rishabh Iyer, Matt Crider, Theresa Sanchez, and Sareh Nabi Abdolyousefi.

Apart from proof reading, I am indebted to Nhi Nguyen for always having an open ear and for being and amazing friend and partner in all regards beyond grad school.

Last, but not least, none of this journey would have been possible without the encour-agement and support of my parents and my whole family to whom I would like to extend my deepest gratitude.

(7)

Approval ii Abstract iii Acknowledgments iv Contents v List of Tables ix List of Figures x

1 Connecting formal and real systems 1

1.1 Contributions of this dissertation . . . 3

1.2 Problem domains that require parameter tuning . . . 4

1.2.1 Mathematical modelling: Collective behaviour in biological aggregations 5 1.2.2 Bio-medical imaging: Segmentation algorithm . . . 7

1.2.3 Engineering: Fuel cells . . . 7

1.2.4 Visualization: Scene setup and rendering algorithm configuration . . . 9

1.3 Task structure . . . 12

1.4 Data abstraction . . . 15

2 Acquisition and visualization of multi-variate data 19 2.1 Effects of dimensionality . . . 20

2.2 Quadrature error . . . 23

2.3 Discretization of multi-dimensional functions . . . 25

2.3.1 Useful concepts for metric data representation . . . 25

(8)

2.3.4 Optimal packing and bounds by Minkowski and Zador . . . 32

2.4 Visual interfaces for multi-variate computer model data . . . 34

2.4.1 Computational modelling tasks . . . 36

2.4.2 Reconstruction and refinement of spatial data . . . 39

2.4.3 Direct volume rendering . . . 40

2.4.4 Visualization systems for discrete multi-variate data . . . 41

3 Sampling lattices with low-rate refinement 43 3.1 Change of lattice basis and similarity . . . 44

3.2 Construction of sampling lattices with low-rate rotational dilation . . . 47

3.2.1 Constructing rotational dilation matrices . . . 49

3.2.2 Characteristic polynomial of a scaled rotation matrix in Rn . . . 50

3.2.3 Construction algorithm . . . 52

3.3 Further dimensions and subsampling ratios . . . 53

4 A sampling bound for composed functions 57 4.1 Frequency domain analysis . . . 59

4.1.1 Visual inspection of the frequency transfer kernel ˆK(ω, ν) . . . 60

4.1.2 Determining the boundary of the cone . . . 62

4.1.3 Error analysis . . . 64

4.1.4 Limits of the model . . . 65

4.1.5 Relationship to Carson’s rule . . . 65

4.2 Application to volume rendering . . . 66

4.3 Discussion and outlook . . . 69

5 Designing a palette for spectral lighting 71 5.1 Related work . . . 74

5.1.1 Previous approaches to constructing spectra . . . 75

5.1.2 Linear light models . . . 76

5.1.3 Accuracy . . . 78

5.2 Designing spectra . . . 79

5.3 Matrix formulation . . . 80

(9)

5.4 Evaluation and visual results . . . 84

5.4.1 Example palette design . . . 85

5.4.2 Design error with respect to number of constraints . . . 85

5.4.3 Spectral surface graphics . . . 89

5.4.4 Rendering volumes interactively . . . 90

5.5 Discussion and conclusions . . . 91

5.5.1 Future directions . . . 93

5.5.2 Conclusions . . . 94

6 Interactive parameter space partitioning 95 6.1 Background . . . 96

6.1.1 Interactive parameter adjustment in computer experiments . . . 97

6.1.2 Parameter space partitioning . . . 98

6.1.3 Unfulfilled design requirements . . . 99

6.2 Design of the paraglide system . . . 99

6.2.1 System components . . . 100

6.2.2 Browsing computed data . . . 102

6.2.3 Representing a region of interest . . . 105

6.2.4 Non-linear screen mappings . . . 107

6.3 Excursion: Steering a multi-dimensional cursor . . . 109

6.3.1 A light dial to control additive mixtures . . . 109

6.3.2 Enabling simultaneous parameter adjustments using a mixing board . 111 6.4 Validation of paraglide in different use cases . . . 118

6.4.1 Movement patterns of biological aggregations . . . 118

6.4.2 Bio-medical imaging: Tuning image segmentation parameters . . . . 120

6.4.3 Fuel cell stack prototyping . . . 123

6.5 Discussion and future work . . . 124

7 Discussion and conclusion 128 7.1 Conclusion . . . 131

(10)

(11)

1.1 Summary of the requirement analysis. . . 12

7.1 Summary of thesis contributions where the level of user involvement decreases and the inclusion of theoretical analysis increases from top to bottom, with rows corresponding to different chapters in the order 6, 5, 4, and 3. . . 130

(12)

1.1 Classification of a slice of d-PET data using two different parameter config-urations. The classes are 1: background (BG), 2: skull (SK), 3: grey matter (GM), 4: white matter (WM), 5: cerebellum (CM), and 6: putamen (PN). . 8 1.2 (a) SPloM view evaluation of the effect of value reconstruction filter (top

row) and sampling distance (bottom row). The columns show different image comparison metrics by comparing with a best possible ground ’truth’ image (b) showing the 643 hipiph data set using a transfer function that includes smooth and sharp opacity transitions. Value filters are sorted by order of approximation (ef:1,2,3,4) and, within each order, by degree of smoothness (c:1,2,3) [MMMY97]. The sampling step size along a ray is colour coded from bottom to top / blue to orange for [1/200, 1/32, 1/16, 1/8, 1/4, 1/2, 1] grid spacing units. . . 10 1.3 Abstraction of data, interaction, and computational components. Lines

in-dicate shared data among processing steps and arrows prescribe an order of execution. On a more detailed level,Red is required input and bluedenotes information that is available after a processing step. . . 16

2.1 Semi-log plot of volumes of n = 1 . . . 30 – dimensional p-norm unit spheres. . 21 2.2 Set of 20 randomly distributed sample points in [0, 1]2 with Voronoi regions

outlined in blue and Voronoi relevant neighbours connected by grey lines. Notice how the convex hull and the minimum spanning tree are part of the grey Delaunay graph connecting Voronoi relevant neighbours. For points in general position the Delaunay graph turns out to be a triangulation or a simplicial complex in higher-dimensional spaces. . . 26

(13)

between the densities given at the named abscissae may be in disagreement with the actual best packing. The packing radius of the Cartesian lattice is 0.5 in any dimension, which is represented in the figure by the horizontal axis. 33 2.4 Zador’s bounds for the mean squared quantization error of optimal quantizers

inRn. . . 34 2.5 Schematic overview of tasks related to studying effects in model parameter

spaces. Thebluecoordinate axes symbolize the construction of the parame-terization with one dependent response variable indicated by iso-lines in the background. The two-sided blue arrow in the center represents the task of fitting a model to observed field data. The blue chip along this line repre-sents the possiblity to fit in a digital substitute model that can include model assumptions to make up for missing data or is simply more efficient to eval-uate than a more complex model or direct field measurements. The green

itinerary or schedule of parameter adjustments could be provided by com-putational steering interfaces for a time-evolving simulation model. Thered

target indicates the goal of a search for an optimal configuration. The region labels and outlines in black illustrate a partitioning of the parameter space into regions of homgenous behaviour of selected responses. An important part of this picture, but not part of the drawing, is the human observer who is responsible for interpretation of the analysis in the context of a particular purpose. . . 37

3.1 2D lattice with basis vectors and subsampling as given by G and K in the diagram title. The spiral points correspond to a sequence of fractional sub-samplings GKsfor s = 0..1 with the notable feature that for s = 1 one obtains a subset Λ(GK) (thick dots) of the original lattice sites Λ(G) (small black dots). This repeats for any further integer power of K, each time reducing the sampling density by |det K| = 2. . . 44 3.2 The best 3D lattice obtained for a design with dilation matrices having

|det K| = 2. The letters f and v in the title line indicate faces and vertices, respectively. The different colours encode the different zones. . . 54

(14)

sampling with a rotation of θ = 45◦. The other two are new schemes with different rotation angles. The thick dots show the sample positions that are retained after subsampling by K. The second row shows the same lattice at twice the density, with more iteration levels of similarity transformed Voronoi cells. . . 55 3.4 Four non-equivalent 2D lattices obtained for a design with dilation matrices

having |det K| = 3. The lattice on the left is the well known hexagonal lattice with a θ = 30◦ rotation. The other three are new schemes with different rotation angles. . . 55 3.5 Comparison of packing radii of best known packings, Cartesian packing, and

the upper bound as in Figure 2.3 on page 33. In addition, some designs of this chapter are shown that enjoy the low-rate rotational reduction property of Equation 3.2 for rates β ∈ {2, 3}. If proceeding directly from a companion K as provided by Algorithm 2 unoptimized constructions can be generated instantly for any n. The optimized designs are obtained by maximization of the packing radius over choices of S in Equation 3.9. The depicted results beat Cartesian packing in all cases except for n ∈ {1, 3}. . . 56

4.1 Sampling comparison. The data y = f (x) (a) is composed with a transfer function g(y) (b). Figures (c) and (d) show sinc-interpolated samplings of g(f (x)). The tighter bounding frequency (d) suggested in this chapter re-sults in 5 times fewer samples for these particular f and g, still truthfully representing the composite signal. . . 58 4.2 The frequency map ˆK(ω, ν) for a function f (x) determines how much a

fre-quency ν of ˆg contributes to a frequency ω of the spectrum of the composed function g(f (x)). The examples are (a) single and (c) mixed non-normalized Gaussians, using ϕ(µ, σ, x) = exp −(x − µ)2/(2σ2) and their correspond-ing ˆK(ω, ν) in (b) and (d), respectively. The upper and lower slopes of the low-valued cones (black) are given by the reciprocal of the maximum and the minimum values of f0, respectively, as shown in Section 4.1.2. . . 61

(15)

The value for t = 0 in Equation 4.18 is attained at the band edge θ = θe. . . . 64

4.4 Same sampling rates are suggested by both estimates if a single sinusoidal signal is composed, using the lower frequency of the example in Figure 4.1a, with the mapping in Figure 4.1b. Both estimates have been 2× over-sampled, using a sampling frequency that is four times the respective limit frequency. 65 4.5 Examples of the hipiph data set sampled at a fixed rate (0.5) (a) and sampled

with adaptive stepping (b). The adaptive method in (b) uses about 25% fewer samples than (a) only measuring in areas of non-zero opacity to not account for effects of empty-space skipping. The similarity of both images indicates that visual quality is preserved in the adaptive, reduced sampling. . . 67 4.6 Quality vs. performance . . . 68 4.7 Visual comparison . . . 69

5.1 Spectral design of two material reflectances shown on the left of their repre-sentative rows. The colors formed under two different illumination spectra are shown in the squares in the respective columns where D65 (right column) produces a metameric appearance. . . 80 5.2 The reflectance spectra on the left of each row are designed to be metameric

under daylight (colours column 1) and to gradually split off into 3 and 5 distinguishable colours under two artificial ‘split light’ sources. The resulting reflectance spectra are given below the figure. . . 86 5.3 Each graph shows the average L∗a∗b∗ error in the design process for palettes

of given sizes, constraining all light-reflectance combination colours for several palettes of different sizes. Changing spectral models and constraints results in different design error: a) the positivity constrained 31D model, b) the positivity constrained 100D colour model, c) 31D without positivity constraint. 87

(16)

along the horizontal axis indicates the L∗a∗b∗ distance between the desired colours and the vertical position indicates distance of the resulting colour pair after the design. A position close to the diagonal indicates how well the distance within a pair was preserved in the design. a) 31-D spectra, uncon-strained; b) and c) positivity constrained spectra with 31 and 100 dimensions, respectively. . . 88 5.5 Car model rendered with PBRT. (a) The spectral materials used in the

tex-ture are metameric under daylight D65, resulting in a monochrome appear-ance. (b) Changing the illumination spectrum to that of a high pressure sodium lamp, as used in street lighting, breaks apart the metamerism and reveals additional visual information. . . 89 5.6 Engine block rendered using metamers and colour constancy. The three

im-ages in the figure are re-illuminated without repeating the raycasting. . . 90

6.1 Paraglide GUI running inside a MATLAB session to investigate the animal movement model of Section 1.2.1. Initially, deliberately chosen parameter combinations are imported from a switch/case script (a) by sampling the case selection variable of that script and recording the variables it sets. An overview (b) of the data is given in form of a scatter plot matrix (SPloM) for a chosen dimension group (h). Jython commands can be issued inside the command window (c) demonstrating the plug-in functionality of the system by manually importing the experiment module, which adds a new item to the menu bar (d). This allows to create a set of new sample points inside the region that is selected for parameters qaand qal (e). The configuration dialog

for theMATLAB compute node (f) sets up a show command that produces a detail view of the spatio-temporal pattern (1D+time) (g). For the configu-ration point highlighted in yellow in the SPloM, this results in a pattern of two groups that merge and then progress upwards in a ’zigzag’ movement. . 100 6.2 Dialog to set up aMATLAB compute node . . . 102

(17)

(bulb icons) (see Eq. 6.1). . . 109 6.4 (a) Graphical user interface (GUI) for the BCF2000 mixing board (b) showing

an experimental trial in progress. . . 112 6.5 Recorded slider motion paths for item 6: [ . . . 110 110]+/-1 shown in

front of non-manipulation intervals (gray) and mistake intervals (red). The top row is mixer interaction of a participant performing item ID 6 of forthcoming Figure 6.6 in three trial blocks for each of the two input methods in Figure 6.4. An error moving irrelevant slider 6 is indicated in the third block. The path correlation matrix in the right column is discussed in Section 6.3.2.1. . . 113 6.6 Simultaneous manipulation of sliders as indicated by the maximum (in

abso-lute value) of the normalized slider-slider velocity cross-correlations of Equa-tion 6.2. Simultaneity varies for different items and participants. . . 115 6.7 Illustrating the sample creation in a sub-region of the parameter space

it-erating from coarse to finer sampling. (Un-)filled circles indicate parameter configurations that lead to an (un-)stable steady state. . . 119 6.8 Scatter plot matrix view that compares the point embedding (lower left) with

the objective measures that went into computing its underlying similarity measure. The numbering of the responses corresponds to the class labels of Figure 1.1 (ID 5: cerebellum and 6: putamen). . . 121 6.9 Scatter plot matrix view of the good cluster (yellow) identified in Figure 6.8

viewed in the subspace of input parameters. In this view sigma and alpha3 indicate clear thresholds beyond which the good configurations are found. . 122 6.10 Two layouts for 204 example experiments. a) input space showing variation

in current and input temperature, b) embedding of the same samples where spatial proximity reflects plot similarity for cell current density, c) similarity embedding for membrane electrode assembly (MEA) water content using the same clusters as assigned in (b). Cluster representatives are shown in Fig-ure 6.11 and FigFig-ure 6.12. These screenshots are from the 2007 C++ version ofparaglide, and are also attainable in the currently discussed Java implemen-tation. . . 124 6.11 Cell current density plots for the clusters in Figure 6.10b . . . 125

(18)

A.1 Illustration of the concept of forward cones F of Equation A.1 in a di-rection u for the vertices of the white quadrangle, similar to Leydold and H¨ormann [LH98, Fig. 3]. F (v1) begins in white, F (v2) and F (v3) in light

red and F (v4) in dark red. All cones extend infinitely towards the right, but

are artificially cut of in the picture. Even in practical computations this step could be required to ensure that the valuations in Equation A.2 stay finite. . 135

B.1 (a) The sample density inside the unit cube jumps when scaling Cartesian lattices in Rn for n = 2..4 using a scaled identity generating matrix R = αI. (b) Discrepancy of various 3D point sets including some randomly rotated regular lattices at differently scaled density. . . 141

(19)

Connecting formal and real systems

To record observations that are either made directly or are acquired with the aid of carefully crafted devices is the starting point of the scientific method that prescribes how to construct models that represent real-world systems. Here a system is understood as a collection of things or concepts that exist by certain mechanisms or rules that in a formal system can be laid out and verified rigorously. A model denotes a system that is intended to mimic certain aspects of another system. A computational model is a formal system, where an algorithm can be run for a given input configuration to compute an output description of the simulated system. While such an algorithm is assumed to be deterministic, additional input may introduce randomized behaviour. Having a model whose behaviour fits with available observations, it becomes possible to diagnose or predict phenomena that occur at different levels of structural organization, allowing to choose actions that influence development of the system, or to utilize the observed effects in artificial constructions1 that, for instance, improve quality of life.

Within this broad picture, the focus of this dissertation is to facillitate development, analysis, and interpretation of computational models through efficient human interfaces. This objective is approached at different stages of the modelling process through the lens of parameter adjustment. Typical tasks and research challenges that arise when working with computer model parameters are discussed in Section 2.4.1. The different approaches pursued in the following chapters include a) methods for direct parameter choices, b) specification of criteria that characterize good choices, and c) theoretical study of model properties that

1

Note that the meaning of artificial as in human-made is a special case of natural.

(20)

guide general parameter choice. The work of Chapter 6 enables manipulation of control parameters in an interactive manner (a). A less direct path is taken in the computer graphics setting of Chapter 5, where the user specifies criteria (b), such that globally optimal parameter choices can be made automatically. Pursuing the adjustment of internal model parameters, Chapter 4 performs a theoretical study (c) of properties of a fundamental data processing step and uses the findings to determine a volume rendering parameter. On a similarly theoretical level, Chapter 3 uses general criteria of uniformity in the design of a space-filling sampling pattern where user input is merely required in form of a constraining region and a budget for the number of sample points.

The underlying theme of human guidance is somewhat contrary to typical objectives of algorithm development that seek to eliminate the need for human input as much as possible in order to obtain consistent results efficiently. When asking someone on the hallway of the computing science department how this is supposed to be done, the typical answers boil down to: “Think harder about the problem, incorporate into your algorithm what is known and improve the theory where it is insufficient.” While it is imaginable to eventually automate information collection [PBMW99] and inference mechanisms [BWB09], thinking about a problem in its original domain and improving a model abstraction of it are still very much human-centric tasks. Also, when a ready-made model is employed that influences choices that affect people, e.g., regarding the health of a human being, the livability of a city in planning, or if it informs decisions regarding the economy of a country or our environment — why would we want to give up the ability a) to convince ourselves that the model does the right thing (is valid); b) to give hints that make up for missing information from our experience, knowledge, or intuition; or c) to make human inventiveness available to get a better2 model more quickly?

The specific examples that I will give in Section 1.2 are a bit smaller scale than what was just mentioned. However, these computational modelling settings also share a need for human input. This need provides a basic motivation for visualization3 as a discipline at the core of scientific study. The particular focus on visual methods to map properties of data to the screen is justified due to our well developed visual perception. The simultaneity

2

Better here means a model with a strong linkage to the real-world system that it is intended to repre-sent. [SRA+08, Ch. 1]

3_{Beyond the technical scope, the term visualization is also understood as a method of thinking,}

learn-ing, and communication that works with mental imagery and its transformation, e.g., to improve athletic performance [GC08]. However, in our research context the meaning is confined to the computational domain.

(21)

and spatial resolution of visual processing are unparalleled by other senses, such as touch or hearing [Gol10, pp. 888]. It enables us to integrate a massive stream of cues for location, shape, motion, colour, or textural properties that are presented by a still image or an ani-mated sequence. Visualization research seeks to utilize this natural processing apparatus by devising effective visual encodings and interaction metaphors that work with data [Mun12]. However, the quality of any data display hinges on the relevance of the given data for the driving questions. Hence, in extension of the classical scope of visualization research, my dissertation is looking at user involvement in the acquisition of data. Specifically, this concerns sampling of continuous phenomena, as well as adjustments in parameter spaces of computational models.

1.1 Contributions of this dissertation

The remainder of this first chapter will perform a characterization of several domains of computer experimentation to derive a set of requirements for human-guided study of models in Section 1.3. Building on this analysis towards the end of this thesis, Chapter 6 then proposes paraglide — a framework for user-driven analysis of parameter effects. Its main point is to facilitate a workflow that results in a partitioning of the continuous space of model configurations into regions of distinct system behaviour.

Establishing a model often involves the adjustment of multiple parameters. The contri-bution of the theoretical survey4 in Chapter 2 is to show effects that arise when discretizing multi-dimensional Euclidean space with a minimal budget of points. This also prepares the background for the remaining chapters.

One feature of paraglide is to elevate user interaction from specifying point locations in parameter searches to a higher level of outlining constraints of a region that can then be filled with a given budget of points (see Section 6.2.3). To be able to fill this space efficiently is the motivation behind Chapter 3, where a class of point sets is designed to directly fulfill a number of quality criteria. To facilitate progressive regional exploration, point lattices are constructed that contain rotated and scaled versions of themselves. Their distinguishing property is that different levels of resolution share a single type of Voronoi polytope, whose volume grows independently of the dimensionality by a chosen integer factor as low as 2.

4

(22)

Many real world phenomena can be viewed in 3-dimensional Euclidean space together with a time coordinate. Hence, a common type of data in numerical visualization are grids that represent some physical quantity for each point in a bounded volume. In order to reduce sampling costs while maintaining image quality when rendering such data, Chapter 4 presents a Fourier domain analysis of the effect of composing a data signal with a function that assigns visibility to relevant values. Based on this, a previous lower bound for a sufficient sampling frequency is relaxed and applied to adaptively choose the step size in raycasting.

Using a spectral light model, it is possible to improve physical realism and to create colour effects that scale the level of distinguishable detail in a visualization. To help mod-ellers to make useful decisions in a high-dimensional design space of palettes of lights and materials, Chapter 5 devises a method that generates a palette of presets that optimally fulfill a set of design criteria. A brief excursion in Section 6.3 discusses two alternative user interfaces for steering a multi-dimensional cursor that can also be applied to unobtrusively search for a suitable mixture.

A word on the narrator’s perspective: While I take care to not claim contributions in this thesis that are not mine, I recognize that ideas are not born in a vacuum and will therefore in the following adapt the pronoun we to refer to me as a researcher, the group I was working with, or to engage the readership, which should be clear from context. I will explicitly state external contributions where sources are otherwise not clear.

1.2 Problem domains that require parameter tuning

In order to get a more detailed understanding of needs and requirements for user-driven parameter space navigation, we engaged in a problem characterization phase by conducting contextual interviews with six experts from three different domains: engineering, mathe-matical modelling, and segmentation algorithm development. Based on that, we summarize design requirements in Section 1.3 that are more general yet grounded in real-world appli-cation areas.

Our methodology corresponds to Munzner’s nested model [Mun09], which casts good practices of software engineering into the realm of visualization research validation. With this background, we prepare the discussion of effective visual encodings and interaction techniques in Chapter 6, where our toolparaglideis presented and validated. This motivates the remaining chapters that discuss efficient algorithms for related purposes.

(23)

1.2.1 Mathematical modelling: Collective behaviour in biological aggre-gations

Our first target group are two researchers studying properties of a mathematical model that describes spatial and spatio-temporal patterns of biological aggregations. Furthering the understanding of such patterns helps to predict animal migration behaviour, e.g., to better understand how, where, and when fish aggregations form to suggest more efficient fishing strategies [Par99]. Modelled patterns can also inform measures to contain plagues of locusts and positively affect quality of life in developing countries [BSC+06].

To study those spatio-temporal patterns, our participants developed a mathematical model [FE10, LS02] consisting of a system of partial differential equations (PDEs) that express in one spatial dimension how left and right travelling densities of individuals move and turn over time, with details provided at the end of this section. The basic idea is to take three kinds of social forces into account — namely attraction, repulsion, and alignment — that act globally among the densities of individuals. Attraction is the tendency between distant individuals to get closer to each other, repulsion is the social force that causes indi-viduals in close proximity to repel from each other, and alignment represents the tendency to sync the direction of motion with neighbours. Solving the model for different choices of coefficients produces many complex spatial and spatio-temporal patterns observed in na-ture, such as stationary aggregations formed by resting animals, zigzagging flocks of birds, milling schools of fish, and rippling behaviour observed in Myxobacteria.

Our use case is part of a Master’s thesis on this subject with a focus on comparing two versions of their model [Abd11]. In the first one the velocity is constant. In the second one the individuals speed up or slow down as a response to their social interactions with neighbours. Comparing these models requires to solve them numerically for several different configurations. Each instance of the model corresponds to one specific choice of the 14 model parameters that include the coefficients for the three postulated social forces. The output of the simulation is a spatio-temporal pattern of population densities. The number of basis functions is an internal parameter that gives the resolution in space and time and influences the trade-off for accuracy vs. runtime, which for reasonable output lies between 2 minutes and half an hour. With 5 minutes each, one can perform a full computation of close to 300 sample points in the duration of a single day.

(24)

the parameter combinations of their model and demonstrated its capability to reproduce a variety of complex patterns. While it is difficult to classify all possible patterns, there are a few standard solutions among them, for which established analysis techniques exist. In particular, they focus on the solutions of the system that do not change over time and space — so called spatially homogeneous steady states. A linear stability analysis of these steady states results in negative or positive growth rates for different perturbation frequencies, which respectively indicate stable and unstable solutions.

There is a hypothesized relationship between the stability of steady states and the po-tential for pattern formation. This leads to a derived, more specific goal of the study. In particular, it enables comparison of constant and non-constant velocity models by inspect-ing the change in shape of the parameter regions that lead to (un-)stable steady states. A discussion on how paraglideaffects our participant’s workflow involving, for instance, quick prototypic of new dependent feature variables, is given in Section 6.4.1. An overview of the parameter space is part of Figure 6.1.

Details about the mathematical model of biological aggregations: Eftimie, de Vries, Lewis, and Lutscher [FE10, LS02] introduced the one-dimensional model discussed above to describe animal aggregations as

∂tu+(x, t) + ∂x(Γ+u+(x, t)) = −λ+u+(x, t) + λ−u−(x, t)

∂tu−(x, t) − ∂x(Γ−u−(x, t)) = λ+u+(x, t) − λ−u−(x, t)

u±(x, 0) = u±₀, x ∈R,

where u+(x, t) and u−(x, t) represent the density of the right and left moving individuals at position x and time t. Γ represents the velocity. Eftimie et al. considered two different cases for Γ: constant and density dependent velocities. The functionals λ+_{and λ}−_{amount to the}

turning rates for initially right or left moving individuals that turn left or right respectively. Attraction, repulsion, and alignment interactions are modeled in these operators via convo-lutions with three different kernels that represent each type of response with respect to the current environment of densities u±. The amount of individuals that turn to left/right, but were initially moving to the right/left is given by the terms λ+u+ and λ−u−, respectively.

(25)

1.2.2 Bio-medical imaging: Segmentation algorithm

Saad et al. [SHMS08] use a kinetic model to devise a multi-class, seed-initialized, iterative segmentation algorithm for molecular image quantification. Due to the low signal-to-noise ratio and partial volume effect present in dynamic-positron emission tomography (d-PET) data, their segmentation method has to incorporate prior knowledge. In this noisy setting, the segmentation of a basic random walker [Gra06] would just result in Voronoi regions around the seed points. An extension by Saad et al. makes this method usable for noisy data by adding energy terms that account for desirable criteria, such as data fidelity, shape prior, intensity prior, and regularization.

In order to attain the superior segmentation quality of the algorithm, a proper choice of weights for the energy mixture is crucial. To facilitate this choice of weight parameters, their code provides numerical performance measures that assess the quality of each segment. One such measure is the Dice coefficient [Dic45], which gives a ratio of overlap with labelled training data. A second measure expresses an error of the quality of the kinetic modelling. Overall, the algorithm is influenced by eight factors (parameters). Ten response variables provide two quality measures per segment, disregarding background.

The model calibration could proceed by numerical optimization of the performance. However, for that a choice of importance weights has to be given, which again is a step where in the general setting human input is required. However, for instance the Dice coefficients that indicate agreement of the segmented shape with given training data for putamen, using the two configurations of Figure 1.1(c) and (d), are both above the 90th percentile of the sampled configurations and less than 0.003 standard deviations apart. Numerically, this means that both segmentations are of the same, near optimal quality. Yet by visual inspection, it is possible to tell that the putamen (PN) shape in (d) is favourable over the one obtained in (c). Hence, guidance of a domain expert is desirable to sort among several candidate solutions in order to find an improved segmentation, which is hard or impossible to choose automatically. A workflow for such a procedure is subject of Section 6.4.2.

1.2.3 Engineering: Fuel cells

A fuel cell takes hydrogen and oxygen as gaseous input and converts them into water and heat, while generating an electric current. Affordable, high-performance fuel cells have the

(26)

(a) raw data (b) ground truth

(c) config# 13, dice6= .8136 (d) config# 44, dice6= .8128

Figure 1.1: Classification of a slice of d-PET data using two different parameter configurations. The classes are 1: background (BG), 2: skull (SK), 3: grey matter (GM), 4: white matter (WM), 5: cerebellum (CM), and 6: putamen (PN).

potential to enable more environmentally friendly means of transport without any CO2

emis-sion. To manufacture a prototypical cell stack costs tens of thousands of dollars. Hence, a reliable synthetic model can greatly bring down the price of finding an optimal configuration for production.

The example investigated here is a simulation of a fuel cell stack developed by Chang et al. [CKPW07]. Their stack model is a system of coupled one-dimensional PDEs de-scribing the individual cells in the stack. It can be adjusted with about 100 parameters, where suitable choices of values are known for most of these parameters from fitting to available measurements. Computing a simulation run outputs 43 different plots that show how certain physical quantities, such as current density, temperature, or relative humidity vary across the geometry of the cell stack. The computer model can be rerun for differ-ent configurations and, thus, allows for much broader exploration of design options than real prototyping. In particular, engineers are interested in studying failure mechanisms and to optimize performance under varying conditions for different groups of parameters

(27)

that represent the geometry of the assembly (size and number of cells in a stack), material properties (permeability), or running conditions (temperature, pressure, concentration for cathode and anode). Experiments demonstrating the use of the suggested interactions are given in Section 6.4.3.

1.2.4 Visualization: Scene setup and rendering algorithm configuration

The following two use cases are located in a research laboratory, which is a natural envi-ronment for model development. A topic of particular interest at the Graphics, Usability, and Visualization (GrUVi)-Lab at Simon Fraser University is to improve techniques for the visualization of volumetric data. This kind of data can capture physical quantities, such as density, flow, or tensor information in a bounded spatial domain and is used to answer diagnostic and predictive questions in fields as diverse as health care, environmental studies, or engineering. In both cases, I was part of the group, but the objectives and developed tool chain apply to different follow-up projects, beyond the initial one described here.

1.2.4.1 Rendering algorithm parameters

The goal of this project in 2005 was to make headway into the open problem of providing guidelines to choose optimal configuration parameters of a volume rendering algorithm. Ideally, this should lead to informative views of the studied data that are of sufficient quality, do not miss any important features, and that can be computed at minimal cost.

While the previous cases where suggesting interactive experimentation, such a proce-dure amounted to current practices in this particular use case that is described in the following. To facillitate a pilot study that would allow first statements about the rendering problem, I developed an object oriented implementation of a raycaster based on C++ tem-plate mechanisms and integrated it into the lab software vuVolume, which is published on sourceforge.net.

The customizable raycaster could be interacted with at runtime by writing and read-ing a hierarchical XML description of its state. This way, computation for many different configurations could be done offline with bash scripted for-loops that would combine pre-setXML snippets to set data, transfer function, lighting model, camera, and reconstruction parameters [MMMY97]. Different configurations influence the resulting image quality and processing costs for time and memory. For each setup we generated one image with a

(28)

best possible and expensive choice of reconstruction parameters, which in a slight abuse of terminology was referred to as ground truth.

(a) scatter plot matrix (SPloM)

(b) ground truth image using best parameter settings

Figure 1.2: (a) SPloM view evaluation of the effect of value reconstruction filter (top row) and sampling distance (bottom row). The columns show different image comparison metrics by compar-ing with a best possible ground ’truth’ image (b) showcompar-ing the 643 _{hipiph data set using a transfer}

function that includes smooth and sharp opacity transitions. Value filters are sorted by order of approximation (ef:1,2,3,4) and, within each order, by degree of smoothness (c:1,2,3) [MMMY97]. The sampling step size along a ray is colour coded from bottom to top / blue to orange for [1/200, 1/32, 1/16, 1/8, 1/4, 1/2, 1] grid spacing units.

It was then possible to experimentally validate the quality of the resulting renditions with a number of numerical and perceptually based image distance metrics (e.g., [Dal93]), whose implementation is due to Alireza Entezari. The resulting table of factor choices vs. truth distance responses was then visualized using a scatter plot matrix (SPloM) view pro-vided by the wekatoolkit. An example for one particular volumetric scene setup is shown in Figure 1.2a. This is part of a SPloM that focusses on the effects of two parameters, namely the type of basis function that is used for value reconstruction, and the sampling

(29)

distance that is used when numerically accumulating radiance along rays through the re-constructed volume. When combined for all wavelengths this gives the colour information that is displayed in each pixel of the rendered image as shown in Figure 1.2b.

While these graphs do not lend themselves to derive concise, generalizable statements about the rendering problem, they do show significant effects of sampling distance, and of the order of approximation of the reconstruction basis. This pilot study also showed that a satisfactory exploration of all rendering parameter effects using nested for-loops is infeasible.

1.2.4.2 Scene parameters

In order to render a scene it has to be constructed first. As will be explained in Section 5.1, modelling techniques for geometry have already received much attention in graphics re-search, whereas the construction of synthetic materials still has important open questions to offer.

An aspect of particular interest in our lab, was the support of more complete light models considering full spectral power distributions (SPDs) instead of tri-chromatic (red,green,blue)-tuples In our previous work [BMDF02], we outlined the usefulness of a more complete light-ing model, by pointlight-ing out that it enables two complementary directions: improved physical realism, as well as increased design freedom for artificial effects.

Materials and light sources: In order to find SPDs to configure light sources and mate-rials in a scene, one can either acquire them using a measurement device, obtain data bases of spectra from the web, or construct new ones by mixing combinations of a smaller set.

Note that the inclusion of lights in this construction is already a step beyond classical computer graphics, where one simply sets up (red, green, blue)-triples for each surface point and illuminates with a white light. Designers of a scene that care about correct colour reproduction when illuminating with something else than just white light need to find spectra that produce specific colours for particular combinations of illuminant and surface. Alternatively, one could also ask for perceived colour differences between certain materials to be zero. The task of a designer who constructs a scene is to come up with these criteria and to find materials that fulfill them.

Considering the huge size of the design space in relation to the small size of desirable solutions, this is a very difficult, if not infeasible adjustment problem. This poses a research problem that is addressed in Chapter 5, where also the lack of current practices for this

(30)

purpose and directions for improvement are elaborated further.

Light mixtures: In the scenes we consider the superposition principle of light applies. So, given a designed palette with light sources, a remaining task is to define a linear mixture of these lights or a transition of mixtures that produce a convincing visualization. For a complete setup one would adjust the geometry of the lights and move them around to achieve the desired effects. More abstractly, this amounts to choosing a vector of weights that results in a set of desired colours for all surfaces in view. Due to the central role of human judgement in this setting, this poses an interactive adjustment task for which Section 6.3 investigates two different practical solutions.

1.3 Task structure

All previous use cases are motivated by questions about a real-world system. In each setting, domain knowledge about the problem has been expressed in form of a computational model and the parameter space of the model is explored in order to relate observations about its properties to the corresponding real setting. Please refer to the previous subsections for details about the parameter spaces in the respective settings. The studies share a set of requirements as summarized in Table 1.1.

Table 1.1: Summary of the requirement analysis.

R1: integrate with existing practices and code R2: specify parameter region of interest (ROI) R2a: sample ROI and compute data set

R3: browse data providing overview (R3a) and detail (R3b) R4: construct feature variables (manually labelled or computed) R4a: combine features to derive a distance metric

R5: identify region(s) of similar outcome in parameter space

R6: find optimal point for a particular dependent variable or user notion R7: analyse sensitivity of feature values to change in input

R8: save state of the project for later reproduction

(31)

Biological aggregation patterns: During our interviews, it became clear that it is impor-tant for these target users to inspect the behaviour of an existingMATLABimplementation for their PDE system (R1). This allows to invoke the simulation for different combinations of parameter values. Since multiple different parameter combinations have to be explored, it is necessary to narrow down the computations to suitable regions in parameter space (R2) and to assist the choice of sample points in these regions (R2a). Visual judgement of the computed solutions (R3) is one method to enable a qualitative distinction among different patterns of movement (R3b) and to determine, which different sets of parameter choices produce a given behaviour (R3aand R5). The growth rate of a linear perturbation can be included as a feature variable (R4). Due to the size of the space of possible solutions, com-putational help to generate an overview of all possible behaviours is desirable. To ensure findings are reproducible, it should be possible to save the state of the project (R8). Bio-medical imaging: In this setting, assistance in choosing a suitable parameter region to sample is again an important task, where a visual approach is required (R2). At this early stage of model development the chosen law-driven approach [SRA+08, p. 5] is most effective. In order to determine plausibility of a solution the ability to use visual judgement is crucial (R3). While initially mathematical modellers perform this task, input from biology experts is imaginable at a later stage. This could also yield ideas for primary sources of field data that could feasibly be included. Also, dependent feature variables (objective measures) are already constructed for segmentation quality (Dice coefficients) and kinetic modelling error. RequirementsR1-4 apply here, except for the sample creation R2a. The complexity of the segmentation problem can only be captured by multiple performance measures. The main goal is to find a robust parameter configuration that leads to good performance and is robust under a number of varying factors. In particular, performance should be invariant for different noise levels or patient scans. Automatic optimization (R6) is challenging with multiple competing quality measures. One step towards that goal is to produce a weighted sum of performance measures (R4). For the algorithm to work robustly under different factors, it is important that the segmentation quality does not decay too quickly for slight changes to the chosen input parameter configuration (R7). To enable the user to assess robustness of a performance optimum, it is helpful to identify the region of parameter configurations that lead to ’good’ segmentation results (R3a+5), in order to make a robust choice within it.

(32)

Fuel cell stack design: This case of simulation model inspection invokes basic require-ments R1-3. The goal of constructing a high-performance cell stack is akin to R4 and R6. Also, fitting the model can be seen as an inverse problem of finding parameter settings that match measured output behaviour. Instead of introducing a new requirement, we recognize that this can be achieved through an optimization (R6). Related to that is the need to have a reliable and trusted model requires to identify parameter region boundaries that indicate transitions in stack behaviour. Such a decomposition (R5) can greatly support reasoning about plausibility of the model.

Renderer configuration: The use case in Section 1.2.4.2 is different from the previous ones in that interactive experimentation has already been done using a tool chain involving vuVolume, bash, MATLAB, and weka. The outcome of a pilot study in Figure 1.2a gave first insight into the complexity of the rendering problem, indicating the limits of an exhausive exploration via a factorial design that combines all possible design options.

However, the overview in Figure 1.2 is not the solution, but rather the initial problem setting in this case. Open questions arising from this study are to find theoretical approaches that address suitable choices of reconstruction kernel and sampling distance. The PhD thesis of Entezari [Ent07] made significant contributions to the former aspect that were also applied in the volume renderer of the tool chain developed in this pilot study. The latter aspect regarding the choice of sampling distance gave inspiration for the theoretical analysis (R9) provided in Chapter 4.

Setting up spectral illumination for a scene: In order to see how the problem of find-ing spectral power distributions for lights and materials fits in the frame of multi-parameter optimization, consider a palette for a linear light model (see Section 5.1.2) to be a point in a Euclidean space with a dimensionality of [(#materials + #lights) × #lightmodel com-ponents]. Some cases that we encoutered involved as many as 248 dimensions, such as the example of Figure 5.2 on page 86 with 5 materials under 3 lights using a 31-dimensional spectral model.

The design criteria in this setting are specified by combination colours for each light vs. each material. For human tri-chromatic perception it is sufficient to specify a perceived colour with 3 components. In the 5 × 3 example this amounts to 45 dependent colour components for that both, a function to compute them and a desired target (R,G,B)-tuple are given (R4).

(33)

The task of finding suitable spectra in this setting is an inverse problem where one has to choose factors (light and material SPDs) that produce given combination colour responses. Rather than defining a new requirement item, this can be framed as an optimization prob-lem (R6), for which manual search through this space is virtually impossible and, hence, automatic assistance would be desirable. Rather than analysing sensitivity (R7) in this case it actually needs to be controlled by introducing regularization terms that are decribed further in Section 5.2.

Some of the spectra may be required to match real-world measurements and all of them should be positive in order to be physically plausible, restricting the search to the positive orthant. Constraining the solutions in this way can be seen as non-interactive construction of a feasibility region (R2).

1.4 Data abstraction

A common theme of our use cases is that, unlike classical data visualization settings, they do not start from given data, but work with a computational model, whose parameter space is sampled in order to construct a set of data points for interactive analysis. Here, the notion of a data source, which was recently introduced as abstraction of a file loader by Ingram et al. [IMI+10], could be extended to include basic information about the available variables, their types, and valid ranges. Enhanced with a capability to query new points that are not yet stored in a data table, this could be used as an interface to static data as well as dynamic computation, similar to a function call in a procedural programming language. This results in a synthetic data source, which we refer to as compute node. The design of a basic user interface for this abstraction will be discussed in Section 6.2.1. The conceptual organization of the required tasks along with their inputs and outputs is shown in Figure 1.3. It separately considers user interaction and computational pipeline, where all modules operate on the same data and share one flow of control.

A numerical model in our cases consists of relations among a set of variables or dimen-sions. Some background on relevant mathematical concepts is summarized in Appendix A. Variables may be inherent to the problem domain or internal to the model. Further possible distinction can be made considering the distribution of their values, which can be contin-uous or discrete, (un-)known, (un-)observable, (in-)expensive to sample from, determined empirically from data, or structurally inferred.

(34)

Set up compute node run default point show derived variables file IO Group variables/dims Specify ROI

View data (sub-)space

overview bi-variate view histogram detail view Assign variables assign manually trigger computation for points and resolution

Sample inputs

Derive variables Compute outputs

User Interaction Computation

Distance metric features objectives embedding coordinates cluster membership #dims region of interest restrict to ROI #dims 2 1 0 for resolution

Figure 1.3: Abstraction of data, interaction, and computational components. Lines indicate shared data among processing steps and arrows prescribe an order of execution. On a more detailed level,

Redis required input andbluedenotes information that is available after a processing step.

The overall model is represented by a function f : Rn → Rr_. _{It is parametrized}

over a multi-variate Euclidean domain, in which a point is denoted in vector notation as x = (x1, x2, . . . , xn). A point in the multi-field range of f , can be computed as f (x) =

y = (y1, y2, . . . , yr). The combination of domain × range of f gives its data space.

(35)

However, this notion does not apply in general, since the presence of constraints may intro-duce dependencies among the xi. Alternative terminology for inputs and outputs of f are

factors and responses, or parameters and derived variables, respectively5.

We assume that code to compute f is given as a black box and can be invoked for a set of points X = {xk} ⊂Rn of finite size m = |X|. This set X is referred to as a design or a

sample [SWN03, p. 15]. With a prescribed ordering this amounts to the construction of a design matrix X ∈Rm×n containing the points in its m rows (R2a). The mapping f gives a set of responses Y = {f (xk)}. By concatenating these values as [X Y] ∈Rm×(n+r)the data

table is obtained, giving the main input to further processing or visualization. Applying the concept of a function f , we impose that the output of the code is deterministic. Uncertainties of the system can still be modelled by specifying probability distributions for additional environmental variables xi [SWN03, p. 121]. Even for non-deterministic code, such an

additional variable could simply index the order or the time of a particular invocation. Using the conceptual ingredients of Figure 1.3, f is formed by composing a potentially costly image computation h (R1) and variable derivation g (R4) to give f = g ◦ h. The image of h is meant in a mathematical sense, but can represent an actual picture or a disk image that captures the result of the computation for a particular configuration x. This indirection should emphasize the possibility to cache output images ˜y ∈ D, but in simple cases the derived response variables yi are computed directly and g is just the identity

D =Rr→Rr_.

Depending on what derived variable yi = gi(˜y) is specified (R4), its information may be

interpreted as a feature, embedding coordinate, cluster membership label, likelihood, dis-tance from a template point, or objective measure (R6) — to give a few practical examples. In each case it may be possible to compute gi or to assign values manually, depending on

whether a function definition or a user’s concept is available.

Some processing steps (R5+7) require a notion of distance or similarity among points (R4a). Technical background on distances and norms is provided in Appendix A.1.1.

One method considered here is the Euclidean distance dr between feature vectors in Rr.

Beyond that, distance dccombines all information about each configuration point, including

its parameter coordinates x ∈Rnor a domain specific function operating on the disk images. In order to accommodate simulations with a large number of variables n + r, an early

5

(36)

step of the interaction allows the user to divide variables into groups of smaller sizes nl.

This way one can separate input and output, indicate other semantic information inherent to the simulation model, and produce more focussed multi-variable views (R3).

An important aspect of Figure 1.3 is that the sample creation is split into a specification of a region of interest M ⊂Rn+r, where areas in the input space are outlined (R2), see also Section 6.2.3. This is input to an automatic method that generates a point distribution of good numerical quality in that region that also fulfills a given budget m. What good quality means in the context of multi-dimensional point distribution is subject of the following chapter.

(37)

Acquisition and visualization of

multi-variate data

Enabling people to inspect and understand complex data sets is a core objective of computa-tional visualization. While algorithms may apply to general data types, staying aware of the original problem domain is crucial to allow for meaningful interpretation of a visualization. Aside from this cognitive motivation there are also computational reasons to maintain the connection to the data generating source. In particular, if a computational model is used to generate the data set, it may be invoked to obtain further data to refine or extend the region of interest to sample from. While data in its original Latin meaning is “something given” its acquisition can be influenced by deliberate choices, turning it into a response or “something asked for”. This more active perspective on data explains two related threads in this chapter, namely to discuss criteria and methods for sampling to “ask good questions” in Section 2.3 that leads into a discussion of “making sense of the answers”, which on a numerical level begins with the topic of integration in Section 2.2 and reconstruction in the subsequent sections. This structure repeats in Appendix B at a different level of depth. A closing discussion of interactive visual interfaces in Section 2.4 that facilitate comprehension of the so-acquired numerical data gives background for Chapters 4 and 6.

The following historical excursion will show that without a notion of continuity of the underlying space, a special treatment of the multi-dimensional setting is not required — all variables could be folded into one without loosing any of the non-existent structure. Because a fundamental notion of continuity is readily implied in our multi-dimensional setting, the title of this thesis does not mention it explicitly.

(38)

2.1 Effects of dimensionality

An intuitive notion of dimension goes back to Euclid’s “Elements” (300 BC), in which he begins: “1. A point is what has no part. 2. A line is what has lengths but not width. (...) 5. A surface is what has length and width only.”1 The dimension of these objects is determined by the number of parameters required to refer to each of their elements: line 1, surface 2, solid 3. This notion of dimension, while intuitive, has a remarkable counter example.

In 1887 Jordan proposed a rigorous definition of a curve to be a continuous function of a single parameter, whose domain is the unit interval [0, 1]. Soon after, Peano and also Hilbert [Hil91] devised a continuous mapping of the unit interval onto the full unit square creating a space-filling curve that one can follow and pass through all points of the two-dimensional square. Extensions of these mappings cover the entire unit cube [0, 1]n with a Jordan curve, still depending on a single parameter only.

All of these curves are densely self-intersecting one-to-many mappings. In particular, Hilbert points out that with a slight modification of his square filling curve the number of self-intersections at a point can be reduced to three. The Lebesgue covering theorem mentioned in Appendix A.1 asserts that this number may not be reduced further.

To restore the intuition about the dimensional number as the number of parameters needed to represent each element of a set, one has to add the property of uniqueness to the continuous mapping that parameterizes the set. Such a homeomorphism maps one set into another leaving all its topological invariants intact, such as dimensional number, number of connected components, or genus (number of holes). To return the focus to data analytic aspects, the definition of a topology, continuous functions, and manifolds are deferred to Appendix A. After briefly leading into topological topics involving basic notions of neighbourhood, the following discussion is again of geometric nature.

Volume of the hyper-sphere in Minkowski p-norm: A basic mathematical object is the n-dimensional p-norm sphere

S_np = {x ∈Rn: xp₁+ xp₂+ · · · + xp_n≤ rp_} _(2.1)

of radius r with p ∈ [1, ∞], defined here to contain its boundary. It is of relevance in

1

(39)

numerous theoretical and practical geometric settings and arises often in the context of metrics and norms discussed in Appendix A.1.1. In Euclidean space Rn with Lebesgue measure voln the sphere is the set with the smallest surface area [Mat02, pp. 222]. A closed

form expression for its volume is derived by Newman [New72, p. 101] as

voln(Snp) = 2nrn

Γ(1 + 1/p)n

Γ(1 + n/p), (2.2)

using the Gamma function Γ as a continuous extension of the factorial, giving Γ(n) = (n−1)! for n ∈Z+. The graph in Figure 2.1 shows its behaviour for increasing n and p. A curious

0 5 10 15 20 25 30 10−25 10−20 10−15 10−10 10−5 100 105 1010

volume of n−sphere for p−norm

n volume p = 1 p = 2 p = 3 p = 4 p = 5 p = 10

Figure 2.1: Semi-log plot of volumes of n = 1 . . . 30 – dimensional p-norm unit spheres.

observation is that the 2-norm sphere hyper-volume reaches its peak at n = 5 and volumes of all p-norm spheres ultimately converge to 0 for growing n except in the case of the hyper-cube for p = ∞ with a volume of 2n. It is somewhat misleading to perform this interpretation along the n-axis, because a 3-dimensional sphere, for instance, contains infinitely many non-intersecting 2-dimensional disks of non-zero area. However, vertical comparison for different choices of p is fine and makes a striking case for non-box shaped regions, when using p-spheres to mark out regions of interest for exploration. The discussion in Section 6.2.3 on page 105

(40)

will come back to this aspect. Rearranging Equation 2.2 one obtains the radius

rn=

Γ(1 + n/2)1/n

2Γ(3/2) (2.3)

for an n-dimensional 2-norm sphere of unit volume. This radius will be relevant in Sec-tion 2.3.4 on page 32 to provide an upper bound for the density of periodic sphere packings. The implications of this discussion are: the more variables or dimensions in a metric space are to be inspected, the larger is the volume to cover. Volume is directly proportional to the number of configuration points that need to be computed in order to maintain a certain density. In most settings, this directly corresponds to computational cost, which we would like to minimize. One way to do so, is to determine first, how many variables actually matter. While algorithmic development on this topic is current research not covered in this thesis, the following brief technical discussion of the issue can provide an entry point in the future.

Estimating dimensionality: Lebesgue’s covering theorem points out a connection be-tween the dimensional number n of a region M and the minimum number of n + 1 simulta-neous intersections when covering M with small open neighbourhoods. In a metric space one can use interiors of spheres of radius R for this purpose. Let mR(M ) denote the minimum

number of such neighbourhoods of diameter < R needed to cover a set M and note that if M has an n-dimensional volume, this count fulfills mR(M ) ∼ R−n. This is the idea behind

the so-called capacity dimension, also called Hausdorff- or fractal-dimension that estimates the largest polynomial degree n of the above count for arbitrarily small neighbourhoods as:2

dimcap(M ) = − lim R→0+

ln mR(M )

ln R . (2.4)

A more easily computed lower bound to this number is given by the correlation dimension.3 It is also considered as a measure of intrinsic dimension by Levina and Bickel [LB05]. To allow for a statistical analysis, they construct a Poisson process for the point set that is uni-formly distributed inside M counting the number of points that lie within a growing radius

2_{Note that this works for any non-zero volume of M , as it comes out of the logarithm in the enumerator}

as an additive constant and then vanishes against the magnitude of the denominator.

3

The correlation dimension [Wei09] is counting pairs of points of a set M within a radius R of each other as a measure of connectedness. This count rises in the order of Rn_{, with the monomial degree n corresponding}