Conclusions - Visualisation Studio for the analysis of massive datasets

The i-Raster Visualisation

12 Conclusions

This chapter reviews the research and the contributions it has made, there is a

discussion of future work, and finally, the achievements in the context of the original

research question.

12.1 Contributions

This section reviews the various contributions made by the thesis’s software

project. The software draws from the three primary fields of neuroscience, software

engineering and data visualisation uniting aspects of all three to produce an

extensible software tool for the analysis neural spike train recordings.

12.1.1 The Visualisation Studio (i-Pipeline)

The visualisation studio exploits the “pipelining” introduced by dataflow / visual

programming languages to create pipelines of data processing activities. These

pipelines process raw data into a form suitable for visualisation. The modern

computer increasingly delivers its computational power through multi-core

processors. To fully utilise this power a modern software application must be written

in a parallel form with units of work being performed by multiple compute cores. The

pipelines produced by dataflow programing are amenable to paralysation permitting

efficient execution of data processing activities in a multi-core environment.

Visual programming of a data processing pipeline also permits a researcher to

rapidly introduce, remove or re-order data processing activities to make the

processed data set more amenable to analysis. Visual analysis of the final data set is

the preferred approach and the visualisation studio provides tools to facilitate big

data analytics.

While this research has demonstrated the effectiveness of the visualisation

studio using neuroscience data the framework itself is generic and can be applied to

almost any field of research.

This research has also served to demonstrate that the researcher’s typical

desktop / laptop computer has the potential to be utilised far more effectively. A great

deal of its computational power is wasted when traditionally developed software is

used for data processing and analysis. This is most clearly demonstrated by the

visualisations produced which now handle thousands and not hundreds of spike

trains while remaining highly interactive.

12.1.2 The Neural Science Problem Domain Library (i-Pipeline)

The creation of the Visualisation studio’s problem domain layer demonstrates

how developers and domain experts working together can create libraries of

algorithms and visualisations. The libraries creator has been left almost completely

free to create a data representation for their problem domain. Any data analysis

algorithm in the problem domain can be “wrapped” into a Visualisation studio

Page | 184

process. The process wrapper ensures the simple deployment and incorporation of

the algorithm into a Visualisation studio dataflow pipeline.

Recognition of the modern trend of delivering increased computational power

through a multi-core architecture allows algorithms and visualisations to fully exploit

available compute resources. The implemented problem domain layer demonstrates

this by writing its most computationally expensive algorithm (cross-correlation) in a

way that adapts to available resources. Resources may range from a limited two

core laptop, through a 4-6 core desktop system to a full HPC compute cluster.

Visualisations that apply the techniques of visual analytics have been re-

engineered to also exploit the delivery of increased compute power through multiple

cores. This has allowed visualisation previously limited to presenting hundreds of

spike trains to present thousands while remaining highly interactive. To permit visual

exploration of these large data sets Ben Shneiderman’s Visual Information-Seeking

Mantra of “Overview first, zoom and filter, then details-on-demand” has been applied.

The final result is a significant improvement in the amount of data that can be

processed and effectively visualised on the typical researcher’s computer.

12.1.3 The i-Raster Visualisation

Somerville’s i-Raster visualisation (Somerville et al., 2011) has been re-

engineered. Its many spike train sorting algorithms are now available not only within

the visualisation but as data pre-processing operations in the pipeline. Paralysation

of the visualisations code has been the key to expanding its ability to present

thousands rather than hundreds of spike trains. Grouping of spike train data has

been used to provide an overview of the larger data set and a burst sort and

grouping algorithm introduced. Interactive zoom and filtering tools permit the visual

examination of detail in the spike train data set. Time filtering permits the detailed

examination of a large data set and the generation of multiple (smaller) data sets.

12.1.4 The i-Grid Visualisation

Stuart’s i-Grid visualisation (Stuart, Walter & Borisyuk, 2005) was originally

used to assist researchers in visually identifying clusters of inter-connected neurons

from their spike train recordings. The clustering technique of the original i-Grid has

been expanded to become the foundation of a new “overview” – the cluster

dendrogram. This overview serves to provide a means to zoom and filter i-Grid’s

display, while maintaining a complete overview of the data set.

The computationally expensive component of creating an i-Grid visualisation

is the cross-correlation process. This algorithm has been re-coded to fully exploit the

delivery of compute power through multi-core systems. In addition it has been re-

written to use Message Passing Java (MPJ) allowing it to effectively utilise high

performance compute (HPC) clusters. All of this is wrapped into a Visualisation

studio pipeline process that allows even this complex algorithm to be rapidly used in

any data processing pipeline.

Chapter 12: Evaluation and the Way Forward

Page | 185

12.1.5 The i-Animate Visualisation

I-Animate is a new visualisation that creates a representation of the modern

(large) multi-electrode array used to simultaneously record spike trains. An animation

of the recorded neuron spiking events over time is providing allowing the researcher

to visually identify potentially connected neurons. Overall and time filtered views of

the electrodes recorded activity levels are available through a selection of heat maps.

Available heat maps range from the classic (but visually ineffective) rainbow heat

map to Moreland’s “Diverging Colour Map for Scientific Visualization”.

12.2 Future Development

The Visualisation Studio and the implemented neuroscience problem domain

library represent a significant change in the way data is processed and visualised.

However modern technology still provides avenues through which greater gains can

be realised. This section examines some of these avenues to advance the data

processing elements and performance of the Visualisation Studio.

12.2.1 Exploitation of high performance GPU hardware

As described in chapter 4 techniques to code applications in a manner that

makes full use of the modern computers multiple compute cores has lagged far

behind the hardware’s ability to deliver increased performance. This research has

insisted that the application generated should make full use of multiple compute

cores as a means of increased performance delivery. This has been taken to its

furthest extreme with the computationally intensive pairwise cross-correlation

calculations on which the i-Grid visualisation is based. The algorithm as implemented

can be run on any computer supporting an MPJ installation from the humble laptop

to the HPC cluster without modification. The algorithm will determine the available

compute cores; assign cross-correlation operations and process results completely

independently from the user. Most users will not have the benefit of access to a HPC

cluster at will. The modern graphics processing unit (GPU) has already

revolutionised the field of interactive data visualisation. Now the computing power of

the graphics card is being opened up to accelerate scientific, analytics, engineering,

consumer, and enterprise applications. This provides even a simple laptop with

access to what is effectively a mini-HPC cluster. Programmatic access to GPU’s is

now available through application programming interfaces (API’s) such as:



NVIDIA’s Compute Unified Device Architecture (CUDA),

 Khronos groups Open Computing Language (Open CL) and the



OpenMP Architecture Review Board’s Open Multi-Processing

(OpenMP) API

The Visualisation Studio is a cross platform Java based application entirely

capable of using these tools either through Java bindings or by invoking native code

from Java using the Java Native Interface (JNI). This opens the possibility of using

these mini-HPC clusters, present in almost all modern computers, to perform

complex and computationally intensive data analysis in addition to their more

Page | 186

traditional roles in producing interactive displays. The Visualisation Studio could also

optionally implement core components to run in such an environment. The parallel

execution engine would be a prime candidate for such a conversion.

12.2.2 Distributed Computing and Cloud Computing

This research has focused on the production of an application deployed to the

researcher’s desktop. However modern technology affords other options such as

distributed or cloud computing as a means to bring increased compute power to bear

on a problem. The Apache Hadoop framework is an open source Java application to

facilitate the distributed processing of very large data sets. Interfacing the

Visualisation Studio with the Hadoop framework would offer access to HPC scale

compute power even in the absence of a HPC or GPU option.

12.2.3 Application to other problem domains

Finally while the field of neuroscience is used in this research as the primary

problem domain the developed Visualisation Studio application has been designed

from the ground up as a general solution that can be applied to many different

problems. Science and nature are replete with problems that are “embarrassingly

parallel”. Such problems are well suited to the hardware and software tools (such as

the Visualisation Studio) that are now emerging. Alternate problem domains and

computationally intensive tasks which could be the subject of an implementation of

the Visualisation Studio’s problem domain layer might include:

 Financial analysis and reporting.

 Event simulation in particle physics.

 Ensemble calculations of numerical weather prediction.

 Genetic algorithms and many other evolutionary computing techniques.

 Brute-force searches in cryptography.

 Computer simulations comparing many independent scenarios, such as

climate models.

 Fluid Mechanics

12.3 Conclusion

This research began by asking the question “How can Software Engineering

and Visual Analytics be applied to aid the general analysis of scientific data and

specifically current neural spike train data?” The development and testing of the

Visualisation Studio application has shown that:

 Software tools that rely on delivering compute power from a single, ever

faster; CPU can only provide limited interactive data visualisations.

 Software tools that embrace the delivery of compute power through multi-

core hardware and a parallel programming model can offer greatly

enhanced interactive data visualisations.

Chapter 12: Evaluation and the Way Forward

Page | 187



Practical parallel programming models can emerge from dataflow

programming’s pipeline model and the creation of a visual programming

language (VPL) for the problem domain under study.

 These VPL programs can tackle complex data analysis task while being

efficiently and efficiently executed on modern multi-core computer systems.



Interactive data visualisations can manage the resulting “big data sets”

even on limited desktop hardware. This is achieved by combining Software

Engineering with the techniques of Visual Analytics. In the case of this

research the amount of data visualised increased by a factor of 10 from

hundreds to thousands of spike trains.

 In future the most effective research will combine the efforts of a multi-

disciplinary team to produce valuable results. At the core of these teams

will be a problem domain expert and a software engineer.

Summary

This appendix serves as a guide for problem domain developers looking to build their own implementation of the iPipeline problem domain layer. It reviews the creation of a new data processing algorithm ready for inclusion into a workflows directed graph.

In document Visualisation Studio for the analysis of massive datasets (Page 183-188)