Extending Haber–McNabb Dataflow Reference Model

In this section we revisit the Haber–McNabb dataflow model: we elaborate the data en- richment step so that it can better describe higher dimensional visualization problems; and then we show how this same model can effectively describe both multivariate and multidimensional problems from information visualization and scientific visualization by

In this new version there are five data stages (Problem Data, Visualization Data, Re- duced Data, Abstract Visualization Object, and Displayable Image); four data transfor- mation processes (Data Analysis, Data Picturing, Visualization Mapping and Rendering); and, one control element (Data Interaction). Some of the new elements of the suggested model have been taken from the TSV ontology presented previously in Chapter 2 (c.f. Section 2.3). They are the data transformation process Data Analysis and Data Picturing, and the control element Data Interaction. Figure 3.7 shows all the elements of the model and how the interconnect.

Data Reduced

Data Picturing Visualization

Mapping Abstract Visualization Object Displayable Image Rendering User Reference Model High−dimensional Visualization Data Analysis Problem Data Visualization Data Data Interaction

− Data − Process − Control

Legend:

Figure 3.7: Proposed high-dimensional visualization reference model. The new elements of the model are coloured as follows: the categories of the TSV ontology are in yellow, and the three new data stages are in blue. The other modules are the original elements of the Haber–McNabb model (shown in Figure 3.3).

In our extended model, we replace the data enhancement process with two separate processes: ‘Data Analysis’ and ‘Data Picturing’. In the ‘Data Analysis’ step, the raw data undergo one of the data analysis tasks belonging to the Data Analysis Stage of the TSV ontology (c.f Section 2.3.1, Chapter 2). The aim here is to employ a pre-processing procedure to reduce the dimensionality of the data prior to the visualization itself. This could be, for instance, a multidimensional scaling algorithm, or a clustering analysis. The data analysis step can be seen as a pre-processing step, and it is possible to return to alter the controlling parameters of the task. The manipulation of the parameters that control a data analysis task is described in the model by the control element called ‘Data Interaction’. However, changing the control parameters of a data analysis task is the exception rather than the rule. Since there is little interaction with the user, one can see this as a ‘computer-centred’ operation.

In the ‘Data Picturing’ step, we apply one of the four strategies described in the Data Picturing Stage: filtering, embedding, mapping, or projection. For example if the filter-

ing approach is adopted this step would involve the extraction of the portion of the data we wish to visualize; while if the embedding approach is chosen this would generate an hierarchical structure to accommodate the original data. In other words, the ‘Data Pictur- ing’ transformation step creates a new data domain that would constitute the basis for the graphical representation to be done in the next steps. In contrast with the ‘Data Analysis’ step, the ‘Data Picturing’ step is mostly interactive – the interaction again is accounted for by the ‘Data Interaction’ control element. The user will typically interact with the system at this level, experimenting with various configurations – thus data picturing step can be seen as ‘human-centred’.

Therefore our suggested model works the following way:-

1. We start with our Problem Data – this is the original data or it may be the data after a basic transformation whose goal is not to reduce the data dimensionality but simply to prepare the data for the visualization process.

If we consider the earlier example in which one wishes to visualize a collection of documents this data stage corresponds to the list of documents stored as a set of n- dimensional vectors, after being converted from an array of strings viavector space analysis.

2. The Problem Data might undergo any of the pre-processing methods described in the Data Analysis Stage of the TSV ontology, such asPCA,MDS, orclustering, to become the Visualization Data.

In the collection of documents example this corresponds to applying a MDS algorithm on the set of normalized vectors in order to map them to a set of triplets

(X ,Y, Z) corresponding to locations in 3D.

3. The next step takes in the Visualization Data and applies any of the data picturing approaches described in the TSV ontology (i.e. filtering, embedding, mapping, or projection). The result of the transformation is the Reduced Data: the data trans- formed into a new data domain whose dimensionality is usually lower than the original data and, therefore, can be rendered more easily.

This is equivalent to representing the documents as spheres in a3D scatterplot visualization, using the generated triplets as coordinates for the spheres in the scatterplot.

4. The third and fourth steps correspond to the mapping and rendering processes of the original Haber–McNabb model. The mapping step takes the Reduced Data

and creates some geometrical representation, thus generating Abstract Visualization Object. The ‘Rendering’ step creates Displayable Image for display on a monitor. At this stage the spheres representing the documents are created and receive a posi- tion in the geometric space of a window where the final image is shown.

The last new element of this model is the presence of the ‘Data Interaction’ control, which works as a controlling layer between the user (and their task) and the ‘Data Analysis’, ‘Data Picturing’, and ‘Rendering’ processes. The ‘Data Interaction’ element describes the several types of interaction that the user can apply to control the different transformation processes. This could be, for example, zooming in or out in a cone tree representation (‘Rendering’ level); setting up and controlling a brushing and linking ele- ment in ascatterplot matrix representation (‘Data Picturing’ level); or, changing the parameters of aMDS algorithm to be applied to the data (‘Data Analysis’ level) or changing the parameters of a interpolant process used, say, in a multidimensional application.

We believe that the introduction of an element in the model representing the interaction activities that may occur in a visualization is an important addition because, as men- tioned before, interaction plays an important role in the visualization of high-dimensional data and, as such, deserves a separate treatment. Indeed, as observed by Ma in [128], “a good visualization comes from experimenting with visualization, rendering, and viewing parameters to bring out the most relevant information in the data.” Therefore a data visualization system should allow users to explore the parameter space experimentally, using their experience to achieve the visualization goal.

Finally, we think that this new version of a reference model can now better describe a visualization process for high-dimensional data with all its idiosyncrasies, such as the data analysis task commonly used to reduce the data dimensionality. Furthermore this reference model is not restricted to a particular visualization strategy but instead can describe any visualization technique in terms of its three major processes: data analysis, data picturing, and data interaction.

Next we show an instance of this model adapted to describe a novel visualization technique, namelyHyperCell, that follows the filtering approach and can be used to tackle both multidimensional and multivariate data.

3.2.2.1 Filtering visualization for multidimensional data

We begin with the case of multidimensional data, that is, data sampled from a function F(X ), where X = (x1, x2, . . .xn). The visualization mapping and rendering processes are now well understood, but rather less attention has been paid to the data enhancement

process. The original intent was that it should be an interpolation process, for example generating a regular grid of data from a given set of scattered data. In reality it has often been interpreted as a transformation process that selects data of interest from a larger initial set.

In the ‘Data Analysis’ step of our extended model, the raw data would have associated with it an interpolation function, with the ability to recreate throughout the domain, an es- timate of the underlying entity being visualized. One can view this interpolation function being tagged to the data as it passes along the pipeline. It is possible to return to alter the interpolation, but this is the exception rather than the rule.

In the ‘Data Picturing’ step, we have adopted the filtering approach. Therefore at this stage we extract the portion of the data we wish to visualize and generate the ‘Re- duced Data’ (in this particular case it has been called ‘Focus Data’). This involves placing bounds on the domain D. We have found it convenient to see this as a pair of distinct op- erations: the definition of an n-dimensional window with upper and lower bounds, and an n-dimensional focus point within these bounds; together with a constraint term which controls the parameter values within the window – for example, we can reduce the dimension by fixing certain parameters at their focus point values. Thus a slice operation would be seen as both defining a window of interest, and also applying a constraint to specify the slice through the window. The interpolation function created in the data analysis step is used to provide the values of the function on the slice. The filtering process is interactive, thus the user will typically apply a number of filters in a particular session.

The adaptation of the extended reference model to the filtering strategy is shown in Figure 3.8. Again we have an overall view as a dataflow pipeline in which one process receives data, operates on it, and passes on the result to another process.

3.2.2.2 Filtering visualization for multivariate data

We now revisit this model from a multivariate data viewpoint. Encouragingly, we find that it describes this case quite effectively. The ‘Problem Data’ now consists of raw multivari- ate data Fi= ( fi

1, f2i, . . . fki), i = 1, 2, . . ., S. The ‘Data Analysis’ step is again computer- centred and consists of some analysis technique. Two popular ones arePrincipal Compo- nent Analysis,PCA, which projects the data into a lower-dimensional – i.e. lower number of variates – subspace that accounts for most of the variance in the data [99], andMultidi- mensional Scaling,MDS, which uses nonlinear optimization to lay out the observations in a lower dimensional subspace, in such a way that their separation corresponds as closely as possible to their separation in the original higher dimensional space [138]. Although

User Mapping Visualization Rendering Multivariate/Multidimensional Extended Model Data Visualization Data

Problem Abstract Visualization

Object Displayable Image Data Interaction Data Focus Haber−MacNabb Model Filtering Data Analysis

Figure 3.8: The suggested high-dimensional reference model adapted to the filtering strat- egy. The darker blocks on the left-hand side of the dashed line replace the first three compo- nents of the original Haber–McNabb dataflow model (see Figure 3.3). The modules in green are the adaptation of the high-dimensional visualization reference model for the filtering strat- egy.

these techniques are not general means for clustering their outcome can sometimes be useful in identifying clusters and trends in the data.

Both PCA and MDS have the disadvantage, however, that the original set of variates are no longer retained. That is, the data analysis step produces ‘Visualization Data’ whose variates are not easily interpreted in terms of the variates of the ‘Problem Data’. Moreover in extreme cases clusters could be lost by the dimension reduction process. As an alternative approach, aiming to retain the original variates, Yang et al. [218] proposed the Visual Hierarchical Dimension Reduction (VHDR) approach. Here the variates are placed into clusters and a representative variate is selected (either the ‘centre’ dimension of the cluster, or a new variate which is an average of those in the cluster). This reduces the complexity of the final display, without destroying the meaning of the variates.

The filtering step takes the multivariate ‘Visualization Data’, however produced, and applies a very similar operation to filtering in the multidimensional case. Again we can see the filter as a pair of operations. We define a window in the value space of the k variates, which we can, as before, interpret geometrically as a k-dimensional region. This specifies the bounds of interest on the values of the variates. In addition we apply constraints, which in this case is a selection from the k variates (similar to the multidimensional case where we used constraints to identify dimensions of interest). In multivariate data visualization, this filtering step of identifying data of interest is often called brushing.

The resulting ‘Focus Data’ (which is the filtering ‘version’ of the ‘Reduced Data’) then passes to the Visualization Mapping step, which applies a suitable technique for multivari- ate visualization such as those described in Chapter 2 (see for example Section 2.3). Note

that in the case of techniques, such asscatterplot matrices, we can see these as requiring (for each scatter plot) a filter which extracts a given two variates (i.e. a slice) from the set of k. The final Rendering step is as before.

For data which is both multidimensional and multivariate, we can use exactly the same model. The filtering step now applies a filter first to the multidimensional aspect of the data, and then to the multivariate aspect, using the approaches described above. Indeed the filters can be applied in either order. Please refer to Table 3.3 for a summary of how these two operations relate to multidimensional and multivariate data.

Data type Data Analysis Data Picturing (Filtering)

Multidimensional Interpolation Window on domain D, selection of dimensions

Multivariate PCA, MDS, VHDR Window on variate space, selection of variates Table 3.3: Listing some techniques associated with the Data Analysis & filtering steps for multidimensional and multivariate cases.

3.3 Summary

In this chapter re-visited the main research problem, focusing on several related issues. We have introduced a reference model to describe the high-dimensional data visualization process. We have shown that this model relates to the TSV ontology introduced earlier in Chapter 2. These two elements – the TSV ontology and the suggested reference model - comprise the formal basis for a framework that describes the visualization of high-dimensional data under a similar foundation.

We have also described in more detail how the model can be used, for example, to describe a visualization technique based on the filtering approach. This particular case is further explored in the next two chapters. Chapter 4 deals with the first step of our proposed visualization technique – HyperCell, which involves setting up the filtering pa- rameters and extracting subspaces of a high-dimensional dataset. Chapter 5 deals with the task of organizing the cells into workspaces, in an attempt to build a mental model of the data.

Implementing the Framework: The

HyperCell Visualization Technique

W

_{E INTRODUCE A} _{novel visualization technique called} _HyperCell _{in this chapter.}

This technique has been designed to address the problem of visualizing multivariate and multidimensional data. Its design is based on the filtering approach, which tries to reduce the dimensionality of a dataset (consequently its complexity) by providing tools for the extraction of subspaces – called cells – from the original dataset.

Firstly we review some major design requirements that guided the implementation of HyperCell. Then we proceed to describe the three core tools that are responsible for the creation of the subspaces needed for the exploration of a high-dimensional data space. Their interface is described and some examples are given to illustrate its functionality. Finally we discuss three enhancements to the filtering process: the incorporation of a fourth dimension, time, in the animation of a 3D cell; the Splitting Cell mechanism; and, the use oflinking and brushing.

All these tools have been implemented as modules in IRIS Explorer [203]. By do- ing so we gain access to the data analysis, visualization mapping, and rendering facilities already developed for that environment. Further motivation for this together with implementation details are presented and discussed in Chapter 6. In the next chapter we address the problem of organizing these generated subspaces into a meaningful structure, describing other tools designed for this task.

4.1 HyperCell’s Design Guidelines

The major design guidelines considered in the development of this method were listed previously in the Section 3.1.4, Chapter 3, and are summarized below:-

1. The rationale for the HyperCell technique is the filtering philosophy, which has been considered an appropriate strategy to tackle the numerical multivariate and multidimensional data.

Reasons for that are: (1) the human mind when dealing with complex information prefers to simplify it into small patterns or configurations than trying to grasp it as a whole – hence the advantage of presenting the visualization as a series of low- dimensional ‘filtered’ subsets; (2) low-dimensional subsets are easier to visualize because there already exist standard well established visualization methods for such cases; and, (3) the filtering approach preserves the original relation between dimensions or variates, as opposed to the other approaches, i.e. mapping, embedding, and projection. (Detailed discussion on this matter was presented in Section 2.3, Chapter 2, and Section 3.1.1, Chapter 3.)

2. The ‘filtered’ subspaces are to be organized into workspaces, which are designed in such a way as to represent the region in the n-dimensional space being explored. This type of organization supports the process of exploring a high-dimensional space because a workspace can be thought of as a metaphor for a location in the n-space. Therefore changing the parameters that define a window in n-dimensional space, such as focus point coordinates and window ranges, automatically affects all subspaces (i.e. cells) stored in the workspace associated with that window.

3. The visualization of a high-dimensional dataset is realized through a set of dynamic low-dimensional subspaces, which aims to reduce the overall complexity of the task of visualizing a high-dimensional dataset.

By low-dimensional subspaces we mean subspaces with up to three spatial dimensions selected from the original set of variables. By dynamic subspaces we mean that the user can, interactively, change a subspace by increasing or decreasing its dimensionality or simply changing the choice of variables that compose a subspace. 4. Supplying the user with intuitive tools with tight coupling interfaces [1] enables

and encourages the application of the ‘filtering’ of subspaces.

a subspace – this is the essence of the filtering process. Tight coupling interface in this context denotes an interface that consistently reflects the current status of the filter environment, which may dynamically change as the user explores the n-space. 5. To develop the elements that comprise theHyperCelltechnique in such a way that complies with the suggested dataflow reference model for high-dimensional visual- ization presented in the previous chapter.

To achieve that it is necessary to separate the core elements of the filtering pro- cess into independent modules, which, in turn, are to be implemented in a modular visualization environment.

The ‘realization’ of the high-dimensional entity we wish to visualize arises from the inspection of several subspaces, grouped into workspaces that reflect location in the n- space1.

In document A framework for the visualization of multidimensional and multivariate data (Page 85-94)