Computer-Driven Data Analysis - A Quantitative Investigation into the Design Trade-offs in Deci

This section work on studying the interactive data analysis process. Here, theories, definitions, and requirements specifications from visual analytics are presented.

2.2.1 Data, Information, Knowledge, and Wisdom

The abstract idea of data and its representation/analysis has been extensively stud- ied in information theory, database management, scientific computing, perception, and among many other fields which are not strictly concerned with computation, most no-

tably mathematics and statistics. Perhaps the most useful modern definition of data falls into Russell Ackoff’s Data/Information/Knowledge/Understanding or Wisdom (DIKW) framework [9] which has been widely adapted in many fields, including information visualization [40] and artificial intelligence [41]. Despite this, the definition of the knowledge and understanding/wisdom portions of the DIKW framework remain nebulous, and definitions may differ based on individual field of study. The more straightforward concepts, data and information, have more agreement in definition - most sources will at least agree that data is not information. More specifically, data is often defined as simply a raw number or symbol with no significance attached (e.g. the binary string 01101110 or the tuple [32,5]), and data which has been given meaning by way of a model or human interpretation has become information. For instance, the familiar relational model spec- ifies data as a series of rows where each column and row has some significance that gives the data structure and meaning. Definitions beyond these two simple concepts diverge.

Bellinger has defined knowledge as the process whereby information is amassed or accumulated, synonymous with the idea of memorization, and understanding has been defined as the knowledge of rules that can explain the ’why’ of data and information (for instance, the function that generates a series of random values). Knowledge is distinct from understanding in the sense that knowledge has no appreciation of ’why’ or ’how’, and Bellinger asserts that understanding is the essential catalyst where an analyst can up through the hierarchy from data to knowledge. Finally, Bellinger defines wisdom to be on a higher order than knowledge, and that we must move beyond understanding patterns to understanding principles to achieve this level of data comprehension. Bellinger asserts that it is not possible for a machine to obtain wisdom, implying that the human an indispensable part of the data analysis process.

Chen also created a useful dichotomy of the DIKW framework for analysis, specifically visualization [40]. Chen shares the perceptual definitions of data and information,

but is more specific about these definitions in the computational space. Chen defines computational information as data that represents the results of a computation, such as a statistical analysis, that assigns meaning to the data. Knowledge is thus data that represent the result of a computer-simulated cognitive process, such as the rules formu- lated by a decision tree or the deductive reasoning applied by intelligent systems and case-based reasoning. Chen does not explicitly bring wisdom into computational space, stating that knowledge is sufficient to capture other high levels of understanding as far as computation is concerned. In Chen’s framework, data, information, and knowledge can serve as both input AND output to a computational system, and the analysis ends when a sufficient amount of knowledge has been amassed in the user. In other words, the computational tool assists in the process of transferring information or knowledge in the computational space to the user’s perceptual space.

2.2.2 Insight and Exploratory Data Analysis

Insight-based evaluations of visualization and data analytics systems have recently appeared in visualization research. While not all authors agree on an exact definition, Saraiya et al writes: ’insight is an individual observation about the data by the partici- pant, or “a unit of discovery” [42]. This definition was leveraged in [7]. Unfortunately, this definition is not really useful in practice. North et al offers a compelling characterization of insight [6]: insight is complex, deep, qualitative, unexpected, and relevant. To elaborate:

• Complex: Insight involves all or most of the data and is not concerned with individual values. This means that insights that involve more of the observed data are therefore more meaningful.

dent on others. This means that multiple passes over the data might be necessary to generate a complete understanding.

• Qualitative: Insight is not exact, and can be subjective or uncertain. This means that some insights might only be able to be captured by text descriptions or prob- abilistic models.

• Unexpected: Insight is unpredictable, serendipitous, and creative. This implies that analysis systems need to be designed to support exploratory analysis, rather than fixed pipelines. It also implies that automated algorithms that ignore domain semantics to find patterns can significantly contribute to this process, since their data search is not biased by prior theory.

• Relevant: Data is deeply rooted in the data domain, meaning that generalized analysis of the raw variables is not enough to generate an insight. This implies that patterns discovered by automated approaches must be related back to the theory of the source domain before they can become useful.

Furthermore, insight has been contextualized in a three stage cyclical framework of hypothesis, exploration, and insight. This is one proposed model of exploratory data analysis (EDA). Supporting and developing for more exploratory systems that trigger insight has become an important end-goal of visualization tools, and the necessity of measuring such a quantity has been emphasized [43]. This has caused some shift in the design of visualization systems from rather straightforward tools that create visuals to exploratory toolboxes with a large suite of built-in analyses and programmability [44]. North’s characterization has gained some traction among researchers [45][46]. The theory, while primarily for evaluating the effectiveness of visualization systems, can be reasonably applied to any human-machine analytic system as a metric for success. Thus

questionnaires that attempt to measure insight are dependent on their ability to represent North’s five characteristics. Plaisant later used these guidelines for a visualization contest where open-ended search goals had participants submitting subjective evaluations of data [47].

In the visual analytics community, EDA has been defined as the extraction of meaningful insights from large, noisy sets of data [48], [49]. The primary approach is to use hybrid data mining and visualization systems and draw on the flexibility, creativity, and background knowledge of human analysts to improve the knowledge base from which decisions are made. Creating flexible and scalable systems that employ complex models yet remain usable to domain experts is still an open challenge.

The most recent model of exploratory data analysis is found in [7]. This work describes a high-level model for the visual analytics process. Sacha notes that computers miss the creativity of human analysis that allows them to create unexpected connections between data and the problem domain, but they are not able to deal efficiently and effectively with large amounts of information. For this reason, “models” need to be employed, which can be as simple as descriptive statistics or as complex as a data mining algorithm. Sacha also differentiates his definition of insight and knowledge, since weak evidence might lead to an insight that still needs to be validated to become knowledge. Sacha then provides a looping model of interactive data analysis, where users are choosing between various system actions (such as model usage, visualization interaction, etc.) based on their current internal state, which can be “exploration,” “verification,” or “knowledge generation.” They compare their system to other models that have been developed previously (e.g. Green’s human cognition model [50]), and note that the analysis of real world problems requires both expertise about the analysis and the domain, and thus domain experts and analytics experts will need to continue to collaborate.

2.2.3 Evaluation of Interactive Interfaces

The visual analytics community has begun to favor open-ended protocols over benchmark tasks for the evaluation of interactive interfaces [25][6][24]. Researchers recommend that participants be allowed to explore the data in any way they choose, creating as many insights as possible, and then measuring their insight with a think-aloud proto- col or qualitative measures, such as quantity estimation or distribution characterization. This contrasts starkly with typically well-defined benchmark tasks, which usually have users do things such as find minimum or maximum values, find an item that meets a specific criterion, etc. North [6] cautions that most benchmark tasks may only evalu- ate an interface or visualization along a very narrow axis of functionality. North seems to be striving towards measuring a latent variable (insight) with a battery of questions that could be understood as indicator variables, but statistical theory such as structural equation modeling [51] has never been applied in their work.

2.2.4 Joining Information Visualization and DSS

At its core, visual analytics aims at employing more intelligent methods in the analysis process [49]. Keim writes that for informed decisions, it is indispensible to include humans in the data analysis process to combine flexibility, creativity, and domain knowledge with the computational power of modern computers. Complex computational capabilities should augment the discovery process, but the ultimate goal is to gain insight into the dataset from the human perspective. Keim goes on to identify several challenges in visual data analytics, one of which is the creation of visual analytics methods for the field of problem solving and decision science. Decision-support systems already exist to reproduce expert knowledge, results from experimentation with these kinds of systems will be discussed in the section on expert systems below. Another problem that was

identified was the issue of user acceptability, or the problem that users are very resistant to changing their working routines, and new automated methods for extracting information from complex data sets need to communicate their goals and abilities more clearly to users.

In document A Quantitative Investigation into the Design Trade-offs in Decision Support Systems (Page 31-37)