Data Tools Challenges - Technological Challenges

6. Technological Challenges

6.3 Data Tools Challenges

Currently, the available data tools for most scientific disciplines are not adequate. It is essential to build better tools in order to make scientists more productive. There is a need for better computational tools to visualize, analyze, and catalog the available enormous research datasets in order to enable a data-driven research.

Scientists will need better analysis algorithms that can handle extremely large datasets with approximate algorithms (ones with near-linear execution time), they will need parallel algorithms

that can apply many processors and many disks to the problem to meet CPU-density and bandwidth-density demands, and they will need the ability to “steer” long-running computations in order to prioritize the production of data that is more likely to be of interest [28].

Scientists will need better data mining algorithms to automatically extract valid, authentic and actionable patterns, trends and knowledge from large data sets. Data mining algorithms such as automatic decision tree classifiers, data clusters, Bayesian predictions, association discovery, sequence clustering, time series, neural networks, logistic regression integrated directly in database engines will increase the scientist’s ability to discover interesting patterns in their observations and experiments [28].

Large observational data sets, the results of massive numerical computations, and high- dimensional theoretical work all share one need: visualization. Observational data sets such as astronomic surveys, seismic sensor output, tectonic drift data, ephemeris data, protein shapes, and so on, are infeasible to comprehend without exploiting the human visual system [28].

In essence, scientists need advanced tools that enable them to follow new paths, try new techniques, build new models and test them in new ways that facilitate innovative multidisciplinary/interdisciplinary activities and support the whole research cycle.

6.3.1 Data Visualization

Visual data analysis, facilitated by interactive interfaces, enables the detection and validation of expected results while also enabling unexpected discoveries in science. It allows the validation of new theoretical models, provides comparison between models and datasets, enables quantitative and qualitative querying, improves interpretation of data and facilitates decision making. Scientists can use visual data analysis systems to explore “what if” scenarios, define hypotheses, and examine data using multiple perspectives and assumptions. They can identify connections among large numbers of attributes and quantitatively assess the reliability of hypotheses. In essence, visual data analysis is an integral part of scientific discovery and is far from a solved problem. Many avenues for future research remain open [29].

Fundamentals advances in visualization techniques must be made to extract meaning from large and complex datasets derived from experiments and from upcoming petascale and exascale simulation systems. Effective data analysis and visualization tools in support of predictive simulations and scientific knowledge discovery must be based on strong algorithmic and mathematical foundations and must allow scientists to reliably characterize salient features in their data. New mathematical methods in areas such as topology, high-order tensor analysis and statistics will constitute the core of feature extraction and uncertainty modelling using forma definition of complex shapes, patterns, and space-time distributions.

New visual data analysis techniques will need to dynamically consider high-dimensional probability distributions of quantities of interest.

New approaches to visual data analysis and knowledge discovery are needed to enable researchers to gain insight into the emerging forms of scientific data. Such approaches must take into account the multi-model nature of the data; provide the means for scientists to easily

transition views from global to local model data. Tools that leverage semantic information and hide details of dataset formats will be critical to enabling visualization and analysis experts to concentrate on the design of these approaches rather than becoming mired in the trivialities of particular data representations [28].

6. 3.2 Massive Data Mining

Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help researchers conducting many of their research activities. Data mining tools predict future trends and behaviours, allowing researchers to make proactive, knowledge-driven decisions [30].

Data mining techniques are the result of a long process of research and product development. Data mining overcomes the retrospective data access and navigation and allows for prospective and proactive information delivery. Data mining is supported by three technologies that are now sufficiently mature: massive data collection, powerful multiprocessor computers and data mining algorithms.

The most commonly used techniques in data mining are: artificial neural networks, decision trees, genetic algorithms, nearest neighbour method and rule induction. Many of these technologies have been in use for more than a decade in specialized analysis tools that work with relatively small volumes of data. These capabilities are now evolving to integrate directly with large data warehouses.

Researchers need timely and sophisticated analysis on an integrated view of data stored in huge warehouses. However, there is a growing gap between more powerful storage and retrieval systems and the user’s ability to effectively analyse and act on the information they contain. Both relational and OLAP technologies have tremendous capabilities for navigating massive data warehouses, but brute force navigation of data is not enough. A new technological leap is needed to structure and prioritize information for specific end-user problems. The data mining tools can make this leap.

Massive data mining: In many scientific disciplines, data now arrives faster than we are able to

mine it. To avoid wasting this data, we must switch from the traditional “one-shot” data mining approach to systems that are able to mine continuous, high-level, open-ended data streams as they arrive.

In document GRDI2020 Final Roadmap Report (Page 31-34)