DATA VISUALIZATION THINKING WITH YOUR EYES

(1)

DATA VISUALIZATION – THINKING

WITH YOUR EYES

Kavyashree H V

Software Senior Engineer Dell EMC

[email protected]

Ramachandra Thejasvi

Software Principal Engineer

Dell EMC

[email protected]

Shwetabh Srijan

Associate Software Principal Engineer Dell EMC

[email protected]

(2)

2017 Dell EMC Proven Professional Knowledge Sharing 2

What Is Data Visualization? ... 4

Why Data Visualization? ... 4

ROLE OF VISION IN DATA VISUALIZATION ... 4

ROLE OF MEMORY IN DATA VISUALIZATION ... 5

VISUAL BATTLE –TABLES VS.GRAPHS ... 5

Tables ... 6

Graphs ... 6

How to Create Visualization ... 7

PRE-ATTENTIVE ATTRIBUTES AND THEIR ROLE IN VISUALIZATION ... 7

ANALYTICAL INTERACTION WITH THE VISUALIZATION ... 9

FINDING PATTERNS ... 10

VARIOUS STAGES IN DATA VISUALIZATION PROCESS ... 11

THINGS TO AVOID IN DATA VISUALIZATION ... 11

Types of Graphs – What to Choose, When ... 13

TYPES OF DATA ... 13

Categorical/Qualitative ... 13

Numeric/Quantitative ... 13

VARIOUS METHODS OF ANALYZING DATA WITH VISUALIZATIONS ... 13

Time Series Analysis ... 13

Single-Series ... 13 Multi-Series ... 14 Pie Chart ... 17 Treemap ... 18 Distribution analysis ... 20 Histogram ... 20 Frequency polygons ... 21 Box Plot ... 22 Strip Plots ... 23

Stem and leaf plots ... 23

Correlation Analysis ... 24 Scatter Plot ... 25 Scatterplot matrix ... 26 Table lens ... 27 Multivariate Analysis ... 28 Heatmap ... 28

TYPES OF CHARTS –SUMMARIZED ... 31

Data Visualization Tools ... 32

D3.JS ... 32 CHART.JS... 32 GOOGLE CHARTS ... 32 HIGHCHARTS ... 32 TABLEAU ... 33 MICROSOFT POWER BI ... 33

(3)

TIBCO ... 33

Role of Visualization in Business Growth ... 34

From Theory to Practice ... 35

IMPACT ON BIG DATA ... 35 BIG DATA CHALLENGES ... 35 Internet of Things ... 36 Network Theory ... 36 Complexity Theory ... 36 Multidimensional Visualization ... 36

Conclusion ... 37

References ... 38

RESEARCH MATERIALS AND LEARNING RESOURCES ... 38

Books ... 38

Whitepapers ... 38

TOOLS USED FOR GENERATING THE CHARTS IN THIS DOCUMENT ... 38

Disclaimer: The views, processes or methodologies published in this article are those of the authors. They do not necessarily reflect Dell EMC’s views, processes or methodologies.

(4)

What Is Data Visualization?

The term ‘Visualization’ applies to visual representation of data and information, broadly called as ‘Data Visualization’. The process includes exploring data, making sense out of it and

presenting it in a form that it is easily interpreted by the users, ultimately leading them to make right decisions.

The origin of data visualization dates back to 17th_{century with the invention of two-dimensional}

graphs. Until then, Table, a simple arrangement of data in columns and rows was the way of representing data. The power of visualization was illuminated, as a means of exploring and understanding data, by John Tukey in 1977. Since then, many researchers and scientists have contributed to the advent of data visualization for it to be what it is today.

Why Data Visualization?

It is hard to grasp something immediately if it is only described in words. Our brain takes some time to process text whereas it can process pictures or images extremely efficiently. Also, Visual representation of data helps to communicate a significant amount of information when

compared to text. These two ideas can be explained better by describing how human vision and brain perceives information.

Role of Vision in Data Visualization

Vision is the single most important mode of communicating the information. It is intimately connected with cognition (the process of understanding something through senses) and collaborates with the brain to make sense of what is seen.

Data visualization is about external cognition, which means how resources outside the brain can be utilized to improve the cognitive capabilities of the brain.

The sight that is perceived by our eyes as an object is built up in our brain as a composition of several visual properties, which are the building blocks of vision. Even though we perceive an object as a whole, we can still distinguish the properties composing it since vision is specially tuned to sense them. Some examples of these properties are length, location, width, area, shape, color and orientation. By using these easily perceived attributes to represent data visually, we can direct much of the work required to make sense of data to the view. Data visualization works on the same concept. Visual displays like graphs help us understand values at once by combining values into patterns(e.g., line) that can be perceived as a whole(by perceiving line length) rather than reading individual values one at a time (the way we perceive tables of text).

(5)

2017 Dell EMC Proven Professional Knowledge Sharing 5 Choosing the best visualization for representing the data requires the understanding of the Perception. There are few principles of perception that one has to consider before designing visualization:

 Visual perception does not measure absolute values but instead registers the difference between values. E.g., we perceive the difference between two lines as ratios (percentages) rather than as absolute values; we perceive color not in absolute terms but as a difference between that and the color surrounding that.

 Our ability to perceive expressions of an attribute as distinct decreases to the extent that distractions clutter our field of vision.

Colin Ware, an expert in the field of visual perception, says:

“The eye and the visual cortex of the brain form a massively parallel processor that provides the

highest-bandwidth channel into human cognitive centers. At higher levels of processing, perception and cognition are closely interrelated. If we can understand how perception works, our knowledge can be translated into rules for displaying information. Following perception-based rules, we can present our data in such a way that the important and informative patterns stand out.”

Role of Memory in Data Visualization

Two types of memory come into the picture in the process of visual information:

1. Long-term memory, where information is stored permanently (until brain cells die) 2. Working memory, where information is stored briefly until we are thinking about it. Working Memory consists of different storage sets for various data types. Visual working memory is what is used for data visualization and is limited to handle about three chunks (storage units) of information at a time. How much data is a chunk of information depends on how we represent it. If represented as a table of numbers, each number could be stored as separate chunks whereas if represented as a line graph, each line (pattern formed by an entire line) can be a single chunk. This is the biggest advantage of using visualization for data

analysis. A huge amount of information is chunked together in visual images exhibiting meaningful patterns as compared to a table of numbers.

Thus, effective information visualization is built on an understanding of how we see and think.

Visual Battle – Tables vs. Graphs

Who wins the data visualization battle between table and graph? Answer: It depends.

Both play a vital part in business interaction, mainly because they are the primary ways of communicating quantifiable information.

(6)

Tables

With tables, we mostly interact with the verbal system. We normally scan the rows and columns of a table to compare any correlated values. It is great when there is a diverse audience, where each wants to look at their own piece. The tabular format works best when the presentation of data is used to look up or compare individual values or when it involves various units of measurement.

As an example, here is a monthly survey of the real estate sales of a major city:

Market Matrix Current

Month %Change Previous Month %Change

Average Sales Price $1,439,909 5.1% $1,369,486 17.6% Average Price per

Square Foot

$1,180 3.1% $1,144 18.2%

Median Sales Price $850,000 -1.7% $864,397 6.4%

Number of Sales 2,518 -28% 3,499 3.2%

Another example is the oxygen requirement for species of Streptomyces.

Organism Optimal Growth Condition

Streptomyces hardaur Aerobic

S. huesis Anaerobic

S. rainiour Anaerobic

In the above example, both of the independent and dependent variables are qualitative. As a result, there is not an appropriate chart format to put it in.

Graphs

Unlike tables, Graphs are like images that interact with the visual system of humans, showing variations of data. They present the data in remarkable ways. An enormous amount of data can be represented quickly and in an easy to consume manner. Graphs work best when the

presentation of data is used to convey a message that is contained in the shape of the data and is also used to reveal relationship among the values.

(7)

2017 Dell EMC Proven Professional Knowledge Sharing 7 Here’s a sample data of variation of the share value of a company over a month.

All forms of tables and graphs allow the audience to consume and analyze data. While debating table vs. graph, the key is to analyze the data and choose one of the two, based on how the data would be used and which one would be more explainable for the audience.

How to Create Visualization

Ultimately, the desired result of data visualization is to guide the audience to take the right decisions. This involves choosing the right graph and providing appropriate means to the user for interacting with the graphs for a better understanding of the information represented. As mentioned earlier, understanding the perception is crucial in the process of creating

visualizations. Perceptual building blocks of information visualization consist of objects and the

properties such as pre-attentive attributes (explained in the sections below) that can visually

represent quantitative data. The process of interacting with visualization by making comparisons and examining quantitative relationships is what is called ‘Quantitative reasoning.' These quantitative relationships are displayed as visual patterns, trends, and exceptions in the visualization leading the user to understand the information and make right decisions. These terms are explained in detail in the sections below.

Pre-attentive attributes and their role in Visualization

For any information, there are effective ways to encode its meaning visually. This can be achieved by effectively using certain basic visual attributes which are perceived pre-attentively, that is, without the need of conscious awareness. These basic building blocks of the

visualization process are called ‘Pre-attentive attributes.’ Here’s a list of pre-attentive attributes:

(8)

Color (Intensity and Hue)

Spatial Position

These attributes will be substantial when visualizing any data. Most quantitative data analysis can be performed well with graphs using only four types of objects:

 Points, 2D positions of simple objects like dots, squares, etc.  Lines, connection of points to represent series of values  Bars, heights, and lengths of rectangle bars

 Boxes, similar to bars but also displays other meaningful points like Medians

Of this list, Position and Length can be used to perceive quantitative data with precision. The other attributes like width, size and intensity are not so precise and are useful for perceiving categorical or relational data. Other graphs such as tree maps/ heat maps which use the pre-attentive attributes color and space are used to encode values when more accurate

visualizations cannot be used, or it is not necessary.

If used interchangeably, visualization may not serve the purpose it is designed for. Thus, considering pre-attentive attributes helps when deciding which chart type to use for our data. Building an effective visualization means presenting data in ways that allow us to see what is meaningful which can augment our cognitive abilities in a manner that allow us to make sense of what is perceived.

(9)

Analytical Interaction with the visualization

Visualization, no matter how rich and elegant, is not useful unless we can interact with the data. The effectiveness of information depends on two things:

1. Representing information clearly and accurately

2. Providing the ability to interact with it to understand what the information means Interaction helps to discover relationships hidden in the wealth of data. Finding these relationships is crucial for taking the right decisions.

Several ways of interacting with data and how visualization should support interaction are described below:

Comparing

 Support all commonly needed comparisons, e.g. accurate values, relative comparison and so on.

 Should support easy comparison of the values and relevant patterns without distraction o e.g., relative comparisons between two graphs are better represented when the

base of the bar begins at a value of zero rather than a different value.

 Display entire spectrum of information that is to be compared on the screen at the same time (avoid switching between screens).

Sorting

 Provide means to sort items based on various values

 Provide easy ways to report data in different order (i.e. with just a click of a button)  Link and sort the data in multiple graphs if they share a common categorical value Adding Variables

It is not practical to always know in advance every element that is required to analyze a data set. Thus, Visualization should provide convenient means to add/remove variables which are needed/not required for analyzing data, and this should be made as simple as grabbing a variable and placing/removing it.

Filtering

 Provide easy filtering on any connected data irrespective of whether they are displayed currently or not

 Provide quick filtering by using controls such as checkboxes, radio buttons, sliders, etc.  Provide ways to define complex filter logic by combining different conditions

Highlighting

Highlighting is to allow focusing on a subset of data while still keeping it in the context of the whole. Graphs (Visualization) should provide an easy means to highlight a subset of data by directly selecting it on a chart, e.g. using the mouse to draw a rectangle around the items to be highlighted.

(10)

2017 Dell EMC Proven Professional Knowledge Sharing 10 Aggregating

Provide a way to aggregate values like summing, counting, calculating average quickly. Re-expressing

Provide an easy way to re-express values by switching the unit of measure. Re-visualizing

 Provide easy switch from one graph type to another

 Provide and limit (only one appropriate for the data represented) the list of graph types to be selected.

Zooming and Panning

Provide Zooming and Panning just by a mouse click. Re-scaling

Provide means to change the base and the quantitative scale quickly. Accessing details on demand

Display extra information to the user on-demand, e.g. providing a tooltip to hover over a point in the graph.

Annotating

Provide a means of writing and saving notes on the chart which gets repositioned automatically along with the graph.

Bookmarking

Provide an easy means of saving the current analysis state for future reference and also maintain the steps history to review later.

Finding Patterns

Certain graphs display certain patterns and trends that can be readily identified. Such graphs help in making sense of the data to take the right decisions. Understanding them can assist in choosing the right graph for the data to be presented. Below are the charts and the patterns they display:

Chart Patterns

Bar High, Low and in between

Line Going up, Going down, steep, gradual, steady, fluctuation Points (scatter plot) Clusters, gaps, tightly/loosely distributed

(11)

Various stages in Data Visualization Process

The whole data visualization process can be divided into six broad stages as shown below:

Information visualization process diagram, illustrating the steps necessary to generate a graph (Reference – “Applied Security Visualization” by Raffael Marty)

1. Define the problem. Identify what the user is interested in. Determine the questions to be answered by the graph.

2. Assess available data.

Identify what data is available.

3. Process information. Parsing/filtering to extract the necessary information.

4. Visual transformation. Identify the graph properties needed. Determine how pre-attentive attributes (color, size, and shape) can be used in the graph.

5. Interaction/View transformation.

The generated graph can be viewed in different ways. Provide ways to scale, translate, zoom or clip the graph to focus on the important parts.

6. Interpret and decide. Identify if the graph answers the initial problem and addresses all the objectives.

Things to avoid in Data Visualization

Apart from just the right things to do, it is equally important to understand what is not suited for visualization in order to let the user focus on what is critical. Data must be included in ways that make what is interesting and meaningful to stand out from what is not. Inappropriate use of graphs, poorly designed displays like the use of noisy fill patterns/saturated colors will leave the visualization as merely a colorful voice which is often heard but not understood.

(12)

2017 Dell EMC Proven Professional Knowledge Sharing 12 Below are some points to keep in mind while designing visualization:

 Reduce data-ink ratio, i.e. ink that was used to paint non-data elements in the graph. Three-dimensional bars and background images are some of the examples of this paradigm. They add nothing to make a graph more legible and do not help communicate information more clearly.

 Never show graphs without legends/axis labels or units on the scale. They are not very useful.

 If multiple data dimensions are to be displayed on the same graph, it is ideal not to exceed five distinct attributes to encode them. Though the human visual system can identify many different attributes, short-term memory cannot retain more than about eight of them in a simple image.

(13)

Types of Graphs – What to Choose, When

There are various types of graphs available to visualize data. The choice of graph depends on the type of data being displayed, and the kind of analysis a user needs to do on the data.

Types of Data

Data can be broadly classified into two categories - Qualitative and Quantitative.

Categorical/Qualitative

Nominal

When two values differ only in name (=, !=) e.g. usernames, area codes, etc.

Ordinal

When values can be ordered (=, !=, <,>) e.g., risk (low, medium, high), latitude/longitude, hardness (good, better, best), grades, street numbers

Numeric/Quantitative

Interval

When difference between the values are meaningful (=, !=, <, >, +, -) e.g. months, dates, temperature (in Celsius)

Ratio

When the data involves a true zero point; Both diff and ratios are meaningful (=, !=, <, >, +, -, *, /) e.g. monetary quantity, counts, age, length

Various methods of analyzing data with visualizations

Time Series Analysis

A time-series is a set of observations collected over a series of equally-spaced time intervals. The analysis begins with a plot of the data points in the time series, and these indicate how each point is correlated with earlier values in the series. Each point in the series represents the value at that particular time. This is useful for exploring, analyzing the trend and forecasting. Time series plots are typically done using line graphs or area graphs, although some cases demand the use of column graphs as well.

Some examples where time series is used are to analyze stock prices, sales volumes, interest rates, quality measurements, etc.

Time series can be broadly classified into 2 categories – Single Series and Multi-Series analysis,

Single-Series

Single time series is when one type of data is collected over time. It consists of a single line connecting all the collected values (data points).

(14)

Chart 1 - Single series

The picture above shows a Single time series represented by a line chart. This example shows the variation of Net Asset Value (NAV) of the mutual fund of a hypothetical company. The value of the NAV is collected at the end of each day, which represents a point in the chart, and then finally all these points are joined with a single line. However, this helps us analyze the NAV trend of just one single mutual fund. Sometimes it is helpful to have a comparison between two or more mutual funds. For such cases, multi time-series representations are more suitable. Multi-Series

Multi-time series is when two or more types of data of the same category are collected over time. The time instance of one type of data may or may not coincide with the other. This type of time series helps in comparison of different data of the category. For example, the trend of two or more stocks can be compared and analyzed.

(15)

Chart 2 – Multi series

The example above shows a representation of multi time-series. There are 3 different mutual funds shown with NAV collected for each of them per day. This helps the audience easily grasp the performance of one over the rest without even thinking too much about the numerical values of the NAV.

Time-series Smoothing

There are many instances in which time series may have a lot of irregular components or noise. Sometimes, even when the time series is stable on a long term consideration, there may be high-frequency variations which may hinder the readability of the chart.

(16)

2017 Dell EMC Proven Professional Knowledge Sharing 16 The chart above shows the atmospheric pollen level measured at different times of the day. The level of pollen is not at all uniform; there will be high-frequency variations in the series. To overcome the noise, various smoothing methods may be applied to average out the irregular components of the time series. The most common smoothing methods used are:

 Moving Averages

 Weighted moving averages  Centered moving averages  Exponential smoothing

The above chart shows an example of atmospheric pollen level measurements taken at each hour of the day and exponential smoothing applied over the time series to generate a second smoothened line which is easier to read and analyze (smoothened line shown in red).

Forecasting

A major application of time series analysis, forecasting (sometimes called projection) is the mathematical prediction of future behavior of the time series based on past/historical data. A typical example where forecasting is used is projecting stock performance.

The pale blue stroke in the above chart is plotted using the forecasted values from historical data already available in the system. This projection has been calculated using “Linear Regression” method.

Forecasting has proven to be extremely helpful in analyzing the behavior of any stable system. However, there are a few imperative assumptions that we have to be aware of when

forecasting.

1. Forecasted time series is predicted and hence, never perfect. The expectation should not be to be perfect but to be less inaccurate.

(17)

2017 Dell EMC Proven Professional Knowledge Sharing 17 2. Forecasting assumes that the past time series till date is reasonably stable and the future

tends to be like the past.

3. The greater the length of the forecast, the less accurate it gets.

Part-to-whole and Ranking analysis

There are many frequently performed types of analysis done which involve comparing parts of a whole and ranking them by value. For example, when trying to make sense of the total

expenses incurred towards something (the whole), we usually aggregate them by the type of expense (the part) to see how much each type adds to the total expense. Part-to-whole relationships are an important component in several different visualization types.

Mathematically, this relationship is equivalent to percentages. The most straightforward type of visualization to represent part-to-whole relation is (but not limited to) the Pie chart.

Pie Chart

Pie chart has a full circle which represents the whole (100%), and this circle is divided into various slices proportional to each part being represented in comparison to the whole. This is easy to understand, which has made the pie chart very popular.

Chart 4 – Pie Chart

Shown above is a pie chart representing sales volumes of different types of businesses. Each slice clearly indicates the proportion of each part in comparison with other parts or with the whole. Pie charts are best used to compare values of a dimension as proportions or

percentages of a whole. The data should be categorical. Pie charts may sometimes be used for continuous values, but only when the number of values to be represented is relatively small. As the number of slices increase, the chart becomes increasingly unreadable.

Visual representation of data with a pie chart can remove quantitative information. Percentages do not show the total; they only show the proportion of the parts to the whole. But sometimes there are data sets which have a hierarchy of part-to-whole relationships and these demand quantitative data to be visualized. These may be represented using Treemap charts.

(18)

2017 Dell EMC Proven Professional Knowledge Sharing 18 Treemap

A Treemap chart shows a hierarchical representation of data and makes it simple to spot

patterns. In a Treemap chart, the branches are represented by rectangles, and each sub-branch is shown as a smaller rectangle. Treemap charts are suited best for comparing proportions within the hierarchy.

Chart 5 – Treemap with hierarchical data (all data hypothetical)

This chart represents sales volumes of different categories of automobiles. Notice how easily the hierarchical data can be visualized without too much clutter. Cars are the most sold, and ATVs are the least sold. Among cars, Honda has the highest market share whereas, among motorbikes, Suzuki leads the chart. This type of visualization can be a challenge to represent just by using pie charts.

Pie charts are commonly used in many industries, so much so that sometimes they may not be suitable for an application. There are some instances where pie charts are used even when there is no part-to-whole relationship. One has to take care not to overuse this visualization. There are times when the user needs to do a “Ranking” on a dataset. Consider the below case of a pie chart which represents the population of different states.

(19)

Chart 6 – Pie chart with many slices

Imagine how much time a user would take if required to sort these states by order of their population. The user would rely on estimations, which unfortunately are almost always prone to error.

Therefore, the downside of this visualization is that it is readable only when there is a significant difference between different parts of the whole. This is because they force us to compare either the 2D areas formed by each section or angles formed by each slice (in the case of pie chart) where the slices meet in the center. Visual perception does not handle these comparisons easily or accurately.

In these cases, it makes more sense to use Bar Charts. Look at the below representation of the same data using Bar Chart.

Chart 7 – Bar chart

Since bar charts represent values using rectangles of different lengths but the same width, it is easy for the user to visually consume this much quicker when compared to area.

(20)

Distribution analysis

A graph that represents the range or the distribution of the data set rather than just the

average/mean is the distribution graph. It is not always enough to calculate the average/ mean of the data. Range or distribution of values in a data set gives more insight to the important stories in the data, which could otherwise be missed out.

For example, the sales statistics distribution shown over the week gives more insight compared to the average sales per week of a supermarket. By looking into the range of sales, it is easy to identify which days of the week record the highest and the lowest sales and further analyze what contributes to the trend.

Histogram

If bars are used for distribution visualization, they form the Histogram. Bars in histogram touch each other to display the connection between sequential intervals thereby displaying the overall shape of the distribution. This graph is used when the overall distribution trend is to be identified along with comparing magnitudes of individual distributions.

There are certain characteristics that a distribution graph displays like

 Spread: A measure of the distribution of values. That is a full range from lowest to highest values.

 Center: A measure of the middle of the values in a data set, i.e. most typical values; median/mean. These are the values around which other values are distributed.

 Shape: Displays where the values are located/packed tightly throughout the spread. The shape of the graph could be curved or flat. The patterns to be identified are curve – upwards or downward, peaks, concentrations, and gaps.

Flat distributions show the values are relatively uniformly distributed, i.e. the value plotted

on x-axis has very minimal effect on the value plotted on the y-axis. Any peak in the graph represents the effect of that particular x-value on the y-value. More concentrated regions in the graph show the most common behavior exhibited. A gap represents an exception from the usual trend and may require further investigation.

 Outliers: These are the values that fall outside the statistical norm mostly found near the ends of a graph. These can be considered as values that fall out of the thresholds. To illustrate, suppose an organization wants to come up with a benefit scheme for its employees. To be able to bring the best scheme, an analysis is made to understand the strength of employees grouped by their ages. In this case, it is important to realize the whole spread of the employees’ age and the range where the strength is at the maximum. The below histogram chart represents this distribution in the best possible way. It can be clearly identified that most of the employees fall into the age group of 25-30 whereas least strength is seen in the age group of 45-50 and there ends the distribution.

(21)

Chart 8 – Histogram

Having said that, though Histogram is the best and most used chart for displaying distribution, ‘Center’ (Median/Mean) and ‘Spread’ characteristics (described above) are not very precisely visible in this graph.

Frequency polygons

If a line graph is used to visualize distribution, it forms Frequency polygon. This graph is best suited for focusing on the distribution’s shape since it just uses a line to trace the shape of the graph and nothing else to distract. Not ideal for magnitude comparison between the intervals. The same data displayed above (Age group vs. Number of employees) is plotted using the frequency polygon graph below. As can be seen, with this graph, the shape is very clearly displayed which shows a peak in the 25-30 age range and keeps decreasing towards the end.

(22)

Chart 9 – Frequency Polygons

Box Plot

Box plot is a rich extension of range bars (bars whose base does not start from zero). It uses a line that encodes the top and bottom ranges of values and a rectangle (box) in the middle to encode the midspread. Box plots are mostly used to display multiple distributions.

Chart 10 – Box Plot

The above graph shows the distribution of grades of students in different Majors. Median is the mid-point of the data and is shown by the line that divides the rectangle (box) into two parts.

(23)

2017 Dell EMC Proven Professional Knowledge Sharing 23 The top 50% of the grades scored by the students are represented by everything above the median and the rest is represented below it. Small circles represent those grades which are a lot more than normal or a lot less than normal (outliers). For example, if you consider computer science, the entire distribution ranges from grades 67 to 87, very few people have scored more than 78 grade, 78 being marked by the median (approximately), and the majority of them fall in the range 71 to 78. The one score that is represented by the circle outside reads 65 and is the outlier for this box plot.

The box plot (also called box-and-whisker plot) shows which way the data sways. For example, if there are more people who score greater grade, the median would be higher, or the top whisker would be longer than the bottom one. Thus, box plots provide good insight to the distribution of the data.

Some of the best practices that can be followed while creating a distribution graph are:

 Keep intervals consistent – Interval size must be kept equal unless a majority of the values fall within a particular range.

 Select the best interval – There should be proper balance in the number of intervals chosen (not too few or too many) that displays the best picture of the overall shape of the distribution without losing useful details.

Some other graph types used in distribution analysis are: Strip Plots

If dots are used to visualize distribution, they form Strip plots. This is useful in laying out the distribution of a small set of values to show values individually. This provides the details of each value. However, this graph does not demonstrate a clear overview of the shape of the

distribution.

Stem and leaf plots

This uses a column of numbers to form something similar to the bars in a histogram. They work well for small data sets.

The types of graphs described above can be used to view distributions of a single set of values. They can as well be used for viewing/comparing multiple distributions with multiple instances. For example, when we want to compare the shapes of multiple distributions, line graphs can be utilized with multiple lines in a single graph.

(24)

Correlation Analysis

A graph representing the relation between the quantitative variables and how one can affect another is termed a correlation graph. Also, there could be a possibility that two variables are correlated, without causing one another but caused by one or more other variables. For

example, if we want to compare education with income to see whether they are correlated, this can be done using a scatter plot (explained below). The result could be that the more education one has, the more money they will earn. That is, these data show that education and income are strongly positively correlated.

Some of the characteristics that a correlation graph displays are:

Direction: This could be positive or negative. If the value of one variable increases when the

value of the other increases, correlation is said to be positive and if the value of one decreases when the other variable’s value increases, the correlation is said to be negative. E.g., A scatter plot sloping upwards from left to right represent positive correlation. A scatter plot going

downwards from left to right represent a negative correlation. Correlations can also be more complex when they keep changing directions.

Strength: Tightly grouped values show stronger correlation and a scattered set of values show

a weaker correlation between the variables on the graph.

Shape: A correlation graph’s shape can be either curved (curvilinear) or straight (linear). A

linear graph says that the differences between the points on x- and y-axes are proportional whereas a curve shows that the relation between the values along the axes are not consistent. Graphs curved in a single direction normally form a shape that is either exponential or

logarithmic (called trend lines). In logarithmic shape, values go up or down at an ever

decreasing rate, starting out great, then steadily decreasing and finally levels off. An example for this could be adding people to a project, which initially increase the project productivity drastically but eventually the increase in productivity levels off with adding people.

Exponential curves grow steadily up or down. An example of this would be the compound interest that we earn on our financial deposits.

Apart from these characteristics, concentrations, gaps and outliers are also displayed in correlation graphs. Any values that are very far off from the trend line that we identify in the graph are called Outliers. These are the values that do not fit the correlation.

Identifying the above characteristics in a correlation graph helps better understand the relationship between the variables.

Negative exponential correlation Positive logarithmic correlation

(25)

2017 Dell EMC Proven Professional Knowledge Sharing 25 The most commonly used and best suitable graph type to display correlation is the scatter plot. This graph is helpful to identify a possible correlation between two variables only. To show a correlation between more than two variables, table lens and scatterplot matrix can be used. Scatter Plot

Scatter Plot uses single data points to represent the relation between two variables by placing these variables on a plane defined by two axes, each with its own scale. Scatter plot very well displays all the characteristics of a correlation graph.

Chart 11 – Scatter Plot

The above graph displays the height and weight of individuals by gender. With the scatter plot, it is easy to view each data point value for both male and female. Also, it clearly illustrates that there is a cluster of “Female” data points (displaying the Strength characteristic) in the weight range 55-65 kg which shows a majority of women weigh in the range 55 to 65.

(26)

Chart 12 – Scatter Plot with trend

Another example which shows a strong correlation between two variables is shown above. The graph displays Internet use (hours per week) by individuals of different age groups. It is seen that a negative correlation is demonstrated between individuals’ age and Internet use which means as the age increases the amount of time spent on the Internet decreases, and vice versa. In the upper right corner of the graph, there is a text displayed r = -.82, where ‘r’ is the ‘Coefficient of Correlation.' This is the numerical measure of the strength of the correlation between the two variables which can range from -1 to +1. Instead of using the terms

‘Strongly’/’Weakly’ correlated, ’r’ describes a precise value for the correlation’s strength. The ‘-‘(minus) sign represents negative correlation, which means the two variables age and Internet use are negatively correlated, and the value 0.82 shows the correlation is very strong.

But one should keep in mind that this does not mean aging causes the decrease in Internet use. Correlation does not mean causation. A correlation says that the two variables are related in some way, but it does not say one causes the other.

Scatterplot matrix

If more than two variables are to be compared, this can be done by combining multiple scatter plots. This type of graph is called a Scatterplot matrix. This allows us to explore how several variables interact with each other.

(27)

2017 Dell EMC Proven Professional Knowledge Sharing 27 Table lens

This is a graph used to display correlations among several variables all at once. It uses horizontal bars arranged like a table, with a separate column for each variable and a separate row for each item to be measured.

This graph does not display correlations as precisely as the scatter plot, but it is useful in identifying correlations among variables all at once. There is a software called ‘Inxight’ which supports expanding particular rows on the click of a button to focus on particular patterns without distractions.

Some best practices that can be followed while creating a Correlation graph are:

 Remove fill color leaving only their outlines when data points are close together and overlap one another.

 Use reference lines which can make it much easier to compare data to a reference set of values such as average.

 When a scatterplot displays a distinct group of values, groups can be visually distinguished by assigning different shapes or different colors to the data points.

 Use trend lines; they can help identify the outliers in the graph. Trendlines can be drawn as a straight line or curved (using software that supports polynomial trend lines).

 Multiple trend lines can be used in cases where the graph displays values that form more than one trend with different slopes.

(28)

2017 Dell EMC Proven Professional Knowledge Sharing 28  Use Crosstabs – Crosstabs are used to remediate over-plotting which occurs when a

large number of values appear in a single scatterplot. The same information can be displayed with less complexity by dividing them into multiple graphs, separating regions and product categories to help compare the correlation patterns more easily.

Multivariate Analysis

The fundamental goal of any visual analysis technique is to make comparisons. In many visualization methods, comparisons are usually made between single instances of variables, or two variables against each other, called bivariate data. But there are cases which demand comparisons between multiple instances of several different variables at the same time called high-dimensional (Hi-D) or n-dimensional (n-D) data. These are called Hi-D data because the minimum number of coordinates needed to specify any point within the mathematical space, is more than two. In such cases, Multivariate Analysis is used. The purpose of Multivariate Analysis is to identify differences and similarities among items, each characterized by a common set of variables. There are several visualization methods which are used to do Multivariate Analysis. A couple of them are listed below.

Heatmap

Heatmap is a visual display that encodes quantities as variations in color. It is typically a matrix of rows and columns similar to a tabular structure that encodes quantitative values as colors rather than text.

Chart 13 – Heatmap

This chart shows a Heatmap of a customer support center, representing the number of calls received at different time windows of each day of the week. There are three variables to be analyzed here – day of the week, time of the day and number of calls. The combination of colors across a row shows the multivariate profile of the call density of the customer support center. The values here are shown as different shades of tangerine color. Patches of darker shade indicate a higher density of calls in that timeframe.

(29)

2017 Dell EMC Proven Professional Knowledge Sharing 29 Heatmaps are best suited for revealing exceptions. But it may be tough to use just heatmaps when searching for similar profiles. They especially add little or no value to the analysis when colors have not been chosen appropriately. Choosing better colors definitely improves

readability of heatmaps, but its usefulness for multivariate analysis is still limited. This may be because it is not easy to see the combination of colors for a particular product pattern. The Parallel Coordinates plot is a better choice for representing complex profiles such as multivariate.

Parallel Coordinates

This type of visualization has proved to be the most efficient representation of high-dimensional multivariate profiles. It consists of an array of equispaced vertical lines parallel to the Y-axis, each representing one of the dimensions. A point (or value) in n-dimension is represented as a polyline with vertices on the parallel axes, and the position of the vertex on the nth_axis

corresponds to the nth_{coordinate of that point.}

Parallel Coordinates plot has very close resemblance to time series, and an untrained eye may even mistake it to be a cluttered unreadable time series plot. But the difference is, this is applied to data where the axes do not correspond to points in time, which means the order of

appearance of the parallel axes does not have significance and may be reordered to assist readability. But note that parallel coordinates plot is helpful only when the chart is interactive, which means the user must be able to apply various filters to the data presented.

Chart 14 – Parallel coordinates showing multivariate data

Consider a use case of comparing different models of cars. There are so many variables to compare, such as the number of cylinders, engine characteristics, fuel economy, acceleration, etc. The above chart shows a parallel coordinate plot of different variables of cars. At first, the chart may look intimidating, and users may not be able to make much sense from it. This is

(30)

2017 Dell EMC Proven Professional Knowledge Sharing 30 when interactive charts help. It makes a lot of sense if there is an option to apply a range mask on any/all of the variable scales. For example, to filter out the cars which have the best

acceleration, we can apply range mask on the “time to 60 mph” axis.

Chart 15 – Parallel Coordinates plot with filter applied

The above chart filters data to show strokes only from cars which take 10 – 12 mph to reach a speed of 60 mph. Now the chart starts making more sense. Observe that all these cars with high acceleration have low fuel economy of just ~15 mpg, engines with more than 300cc of displacement and have higher number of cylinders.

Rather than relying only on our eyes, multivariate analysis using parallel coordinates can be made more efficient by augmentation from computer algorithms to automate certain tasks such as ranking and/or grouping items by similarity. Such algorithms rely mostly on pattern

(31)

Types of Charts – Summarized

Type of visualization Number of Dimensions Average number of data values

Type of data Typical Use-Case

Bar Chart 1 ~50 Categorical

Represent frequency of the values of a

dimension/aggregation function output. The height of the bars represents the frequency count of the values.

Line Chart 1 ~50 Ordinal,

Interval

Represent the frequency of the values of a

dimension/aggregation function output. Data points are

connected by lines to help display patterns or trends.

Pie Chart 1 ~10 Categorical

Compare values of a single dimension as

proportions/percentages of the whole.

Histogram 1 ~50 Ordinal or

continuous

Indicate the shape of the distribution of values.

Box Plot 2 ~10 Continuous,

Categorical

Show the distribution of values. The categorical dimension can be used to split into multiple box plots for comparison.

Scatter plot 2 or 3 ~1000 per dimension

Continuous, continuous

Examine how two data dimensions relate or to detect clusters and trends in the data. Parallel

Coordinates n

~1000 per

dimension Any

Visualize multidimensional data in a single chart.

(32)

Data Visualization Tools

There are data visualization technology offerings by many traditional business intelligence and analytics vendors, as well as newer market entrants. Below list is divided into two categories; the first one covers some of the tools meant for software developers, and the second list covers data visualization software products meant for end users.

For Software Developers

D3.js

D3.js, short for ‘Data Driven Documents', is the first name that comes to mind when we think of Data Visualization libraries. It uses HTML, CSS, and SVG. It's free and open source with rich, interactive and attractive charts support. Overall, it’s a feature-packed platform. Even though it does not come with charts out of the box, it has a helpful gallery which showcases possibilities with D3Js. On the flip side, there are two major concerns with D3.js. First, it has a steep learning curve. One has to study and understand the complex D3 library before being able to generate charts. Second, it is compatible only with modern browsers which support SVG. Thus, it is suitable only when the developers are fairly knowledgeable in usage of the D3 library, and the support for older browsers is not of concern.

Chart.js

Chart.js is a small open source library that supports very few chart types like line, bar, radar, polar, pie and doughnut. Chart.js is the perfect solution for small projects which do not need many complex graphs and sometimes these are all the charts one needs for an application. It uses HTML5 canvas element for rendering charts. All the charts are interactive. It’s one of the most popular open-source charting libraries of recent times.

Google Charts

Google Charts renders graphs using HTML5/SVG interchangeably. This ensures cross-browser compatibility and portability to iOS and Android devices. It also includes VML for supporting older versions of Internet Explorer. It offers the most commonly used chart types like bar, area, pie, and gauges. It is flexible and developer-friendly without a steep learning curve.

Highcharts

This is another big player in the charting space. Highcharts offers a diverse range of charts and maps right out of the box, very similar to Fusion charts. Other than basic charts, it also offers a different package for stock market charts, called Highstock. Highcharts is free to try or for use in personal applications, but a license is required for commercial use.

There are many other interactive charting libraries such as Leaflet, which offers mobile-friendly interactive maps and Dygraphs, which is useful for handling huge data sets.

(33)

For End Users

Tableau

Considered the most popular visualization tool, Tableau supports a wide variety of charts, graphs, maps and other graphics. The best USP of this tool is that it’s a free tool and the charts created can be quickly embedded in web pages. It offers a nice gallery of all the supported types of visualizations. There is also a paid version that offers a few more capabilities.

Microsoft Power BI

Microsoft Power BI is a suite of business analytics tools designed to enable business users to analyze data and share insights. The components include Power BI dashboards, which offer customizable views for business users for all their important metrics in real-time, accessible from any device.

Tibco

While Tibco is still making the transition from a desktop to a cloud software vendor, its self-service business intelligence tool “Tibco Spotfire Desktop” is an excellent way to start visualizing spreadsheet data.

(34)

Role of Visualization in Business Growth

George from sas.com says, “Inherently, business intelligence enables getting the right information to the right person at the right time and in the right format so that action can be taken. Data visualization backed by analytics enables business intelligence solutions to empower better decisions faster.”

The only fool-proof way to survive in business these days is to keep investing resources into collecting, learning, and aligning the business roadmap on the basis of historical performance records. For a lot of enterprises, interpreting data can be a real challenge. In such cases, data visualization becomes crucial.

Some of the ways data visualizations can impact business growth: Understand information quickly

By presenting business information as a graphical representation, many businesses are able to analyze large amounts of data in flawless, interconnected ways – and draw inferences from that information. And since it’s significantly quicker to investigate information in graphical

presentation (as opposed to analyzing information in spreadsheets), businesses can tackle problems or answer questions in a more timely manner.

Uncover associations and outlines between business and operational activities

Huge amounts of problematic data start making sense when presented properly with graphs. Businesses can recognize factors that are highly interrelated. Some of the correlations will be easy to correlate, but others won’t. Identifying those relationships helps organizations focus on areas most likely to influence their most important goals.

Spot evolving trends swiftly

Using data visualization techniques, evolving trends can be identified very quickly. It can give businesses an advantage over the competition, and ultimately affect the outcome. In the past, large enterprises usually hosted consumer panels, that helped them learn what consumers thought about upcoming product features. Today, data visualization techniques have made it possible for organizations to make business decisions quicker by interpreting large datasets in a relatively short span of time.

Leverage from open data

One of the advantages of data visualization is that it enables analysis of complex data extracted from various sources which also includes product communities. Businesses can leverage by

1. Collecting product ratings from the Internet that can feed into product development and/or support processes.

2. Monitoring competitor pricing available from online catalogs and aligning to it. 3. Analyzing where to market a product, or where to place a product for the best sales.

(35)

2017 Dell EMC Proven Professional Knowledge Sharing 35 Creating dashboards

A dashboard is a visual display of the crucial information consolidated and displayed on a single screen so the information can be monitored at a glance. The dashboard is a part of almost all products these days, and they use visualizations to achieve their objective.

From Theory to Practice

Impact on Big Data

According to IDC, from the beginning of recorded time until 2003, humans had created 5 Exabyte (5 billion gigabytes) of data. In 2011, the same amount of data was generated every two days. By 2020 the number of digital bits will be comparable to the number of stars in the universe. As the size of bits germinates every two years, for the period from 2013 to 2020 global data will increase from 4.4 to 44 zettabytes.

“Information, no matter how important, cannot speak for itself.”

Information depends on us to tell its story. Data, which has a unique blend of complexity and subtlety, is presented better through a well-crafted visualization. Data visualization will continue to be the most fundamental tool for any company dealing with Big Data. Many organizations have come very far in that pursuit, and yet they have farther to go. Large, well-capitalized firms related to banks, insurers, retailer, real estate, health services, government and human

resources, security, etc. are gathering and preserving highly granulated data with several billions of information.

Big Data Challenges

Big Data has some challenges and complications that can be primarily divided into three groups according to Akerkar et al.

Data Challenges

This includes various difficulties related to volume, variety, velocity and veracity. As the name suggests, Volume refers to a large amount of data generated and its problems related to its storage and analysis. Variety refers to different types of data sources which could be financial, social media, etc. Velocity refers to the real-time processing of data for streaming data analysis. Veracity refers to the complex nature of data which results in a lack of accuracy.

Processing Challenges

This involves the challenges related to the lack of processing power for such huge collection of data which may take a long time. By the time the data is processed, it may have become obsolete or stale which may make it no longer useful.

Management Challenges

This usually refers to secured data storage, its processing, and collection. Data privacy and its ethical security along with its governance are the main focus. Information Security institutes compile the regulations which eventually control the data and its use.

(36)

2017 Dell EMC Proven Professional Knowledge Sharing 36 Modern-day methods, practices, and tools for data analysis are still not flexible enough to

discover valuable information in the most efficient way. As a result, the problem related to data perception and appearance remains open. The task to unite the intangible world of data and the material world through visual representation is still open. Scientists around the world are still working to narrow it down.

The Future of Data Visualization Internet of Things

Over the next decade, billions of devices will be connected to each other. The Internet of Things provides insight to what's happening around us right from smart watches and wearables to sensors and monitors.

Network Theory

Graph theory is the base of Network Theory which has algorithms to understand the

relationships between objects. It’s important because it observes the relationship balance to predict the expected spread of information. It’s important for finding the shortest path between two points and also in finding target objects based on their conduct.

Complexity Theory

Systems across the globe are characterized by composite, non-linear interfaces. They evolve vigorously and often unpredictably. It's referred to as “butterfly effect,” small alarms in one state can result in massive consequences in an unrelated state. The principal understanding of Complicity theory is that it’s not possible to calculate with faith a future state, but it’s possible to understand the assembly and potential states of difficult systems.

Multidimensional Visualization

The saying “a picture is worth a thousand words” gained credence from our ability to process visuals more straightforwardly than text. Recent progress in computer graphics is making conceivable visualizations that enable the combination, manipulation, and investigation of dynamic multidimensional data sets. Multidimensional visualizations allow users to interact with data more effectively.

Data visualization will have an insightful effect by the advancement of Internet of Things. The interactions between humans and machines can be properly echoed through visualizations. Also with the evolution of frameworks like Network and Complexity theories, visualizations will help us better reflect the structural dependencies of the environment. Consequently,

improvements in multidimensional visualization will allow us to fuse and explore multidimensional data sets more effectively.

(37)

Conclusion

“There is magic in graphs. The proﬁle of a curve reveals in a ﬂash a whole situation — the life history of an epidemic, a panic, or an era of prosperity. The curve informs the mind, awakens the imagination, convinces” ― Henry D. Hubbard

Through this paper, we have tried to unravel and explore the role of vision and memory in understanding the pre-attentive attributes and analytical interactions in finding the patterns and relationships among data. We have covered the “What,” “Why” and “How” part of Data

Visualization.

“The main purpose of Data visualization is not just pictures but insight.”

We discussed different types of analysis and suitable graphs for the same therefore helping the readers understand and identify what graphs to choose and when.

We explored a few areas related to application and importance of data visualizations in the current market and its impact on Big Data and analytics. And over the course of our research, we have realized that visualization plays a pivotal role in any field of data investigation. Over decades, visualization has evolved so much with regard to the intuitive graphs and easy correlation techniques that it has helped not only large enterprises having professional analysts and scientists but also smaller firms with lesser expertise. The idea of the Visualization should be such that even employees who are not data scientists or analysts such as those in

marketing, finance or supply chain operations also should be able to quickly and easily explore data and spot irregularities based on their own skills and even get answers to the questions they haven’t yet thought to ask.

“Modern data graphics can do much more than simply substitute for small statistical tables. At their best, graphics are instruments for reasoning about quantitative information. Often the most effective way to describe, explore, and summarize a set of numbers – even a large set – is to look at the pictures”.

Data Visualization not only saves time with improved decision-making techniques but also allows an analyst to have a better ad-hoc data analysis. As the world has become progressively unified and co-dependent, we see many opportunities to generate value through data

visualization and believe that it will only increase. The world has responded to this with many organizations effectively implementing Data Visualization Solutions. According to a survey by IDG Research, “nearly fifty percent of the IT professionals say that both the business and IT organizations are driving business intelligence and/or data analytics at their enterprise.” That makes it increasingly important to understand Data Visualization and its techniques.

(38)

References

Research materials and learning resources

Books

1. Now You See It – By Stephen Few

2. Applied Security Visualization – By Raffael Marty Whitepapers

1. Data Visualization: Making Big Data Approachable and Valuable – SAS 2. Beginner’s Guide to Effective Business Storytelling with Data Visualizations -

reportplus.com

3. Big Data Visualization: Turning Big Data Into Big Insights – Intel

4. Principles of Data Visualization - What We See in a Visual – FusionCharts 5. Data Visualization - Past, Present, and Future – Stephen Few

Tools used for generating the charts in this document 1. Microsoft Office Excel

2. Highcharts charting library - www.highcharts.com 3. D3.js charting library – www.d3js.org

Dell EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice.

THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” DELL EMC MAKES NO RESPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE

INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Use, copying and distribution of any Dell EMC software described in this publication requires an applicable software license.

DATA VISUALIZATION THINKING WITH YOUR EYES