The second category of related work includes a wide range of tools that promote aggregate data transparency, which is an approach that makes data more accessible and understandable. Aggregate data transparency addresses the critical data literacy aspect in the information literacy curriculum. It is especially relevant when the data about the individual is also part of the data that is being aggregated and mined. Sometimes biased, even discriminatory decisions may be made on a collective level with aggregated datasets, and affect certain groups of people. A typical example is when banks use data mining models to help determine whether to approve or reject a loan to someone. Based on historical records, a model may misuse the fact that often females were rejected, and tune its parameters to discriminate against females in general. Another example is when the data about OSN users are aggregated and different services, such as advertisements and recommended news, are tailored for targeted user groups. People have a right to know how their data can be used collectively, what may be the consequences resulting from the inference or mining on the aggregated data. They should also be able to play a constructive role in such decision processes. Data transparency tools can help them, including FA tools. There is a spectrum of FA tools that can be differentiated in terms of the extent to which they offer data visualization. Recall from the previous section, there are the ones that are with little visualization, e.g. Privacy Check. Such tools primarily rely on texts and numbers to convey information, not the visual positioning, coloring and transitioning, etc. that are typical in data visualization. There are also the ones that incorporate static data visualization,
i.e. users can not interact with the visualizations, such as Personal Analytics for Facebook. Then there are the tools that incorporate basic interactive visualization components, such as PViz, in which the user can limitedly interact with the visual objects for one or two variables in a single view. Finally, there are the tools that rely on visualizations with rich interactivity for users to explore and understand data, for several or more variables, with multiple views. We call them exploratory visualization tools.
In this thesis, we focus on two use cases — discrimination-aware visual mining and OSN user sentiment comparison. More specifically, in the use case, we developed an online exploratory visualization tool named D-explorer that helps users explore mined patterns on potential discrimination, as elaborated in Chapter 8 and 10. In the second use case, we tested social hypotheses about sentiment expression on OSN data and developed algorithms to extract comparisons of OSN-user subgroups that are different in sentiment expression. This work serves as a first step towards building exploratory visualization tools that aid OSN users in gaining insights in their networks on a macro-level, as elaborated in Chapter 7. Next, we review the related work on visualization tools for data analytics and exploration.
2.2.1
Visualization Tools for Data Analytics and Exploration
All the tools mentioned in this subsection provide standard data reporting utilities, including tabular and geographical data processing and conventional visualizations such as bar chart, treemap, geo-chart. They do not address our use cases in terms of parsing classification rules, associating rule items, measuring meta-level characteristics of discrimination rules, comparing sentiment OSN-user groups and visualizing patterns in our use cases with tailored visualizations (as detailed in Chapter 8, 7 and 10). However, through this survey, we gain a clearer view of the current generation of exploratory visualization tools, and can make more informed decisions in developing our own tools.
Commercial Business-intelligence Applications
Business Intelligence (BI) is a term frequently used in industry to refer to the techniques or tools that “[transform] raw data into meaningful and useful information for business analysis purposes”24. There exist plenty of
APPROACHES TOWARDS AGGREGATE DATA TRANSPARENCY 27
BI applications, such as Tableau25, QlikView26, BIME27, Jaspersoft28, Metric
Insights29. These tools require a purchase to be used unlimitedly and are mostly
desktop tools that need installations.
Visualization Packages in Desktop Computing Environments
There are also the visualization packages that come with computing environ- ments/languages such as R30, MATLAB31 and Python32. However, in order to
process and visualize data, users have to install the corresponding environments and learn the full-fledged languages. The exploration of data relies on the user to type in commands or write complex programs. The variety and the interactivity of the visualizations provided by these packages are also more limited than those of BI applications.
Online Public Data Explorers
Online public data explorers are freely accessible websites, often equipped with rich visualization templates that are usually more intuitive to use than command lines. Users can explore public datasets and/or upload their own to inspect. Typical examples of general-purpose public data explorers are Google Public Data Explorer33 and RAW34. There are also special-purpose ones such as We
Feel Fine35. We consider these tools good examples for us to follow and build
our own tools.
Google Public Data Explorer provides a series of interactive data visualizations for people to explore a wide range of public datasets from various international organizations and academic institutions. These datasets mostly contain tabular data, sometimes geographical data. Google created a new data format DSPL (Dataset Publishing Language36) so that anyone can upload, visualize and share
their own datasets using Google Public Data Explorer. A screenshot of the tool is shown in Figure 2.7. The user can select an dataset from its data repertoire
25http://www.tableau.com/ 26 http://www.qlik.com/ 27https://www.bimeanalytics.com/ 28https://www.jaspersoft.com/ 29http://www.metricinsights.com/about/ 30https://www.r-project.org/ 31 http://nl.mathworks.com/products/matlab/ 32https://www.python.org/ 33http://www.google.com/publicdata/directory 34http://raw.densitydesign.org/ 35http://wefeelfine.org/ 36 https://developers.google.com/public-data/?hl=en
Figure 2.7: A screenshot of Google Public Data Explorer
in the data-selection panel on the left, for example, a dataset on birth rate. The user can then filter the regions or countries shown in the visualization by checking the checkboxes on the left. The user can perform more detailed filtering and highlighting by interacting with the visualization on the right. Different views are available as well. Figure 2.7 shows a multiple-line chart, the views with bar chart and bubble chart are available at the top right corner of the visualization. This tool is valuable in the sense that:
1. It has a rich data repertoire that documents some of the key human developments;
2. It provide multiple visualization views with rich interactive functions. 3. It is a generic data visualization tool that allows rendering custom data. 4. It is freely and openly available.
We Feel Fine [100], as stated on the website of We Feel Fine project37, “We Feel
Fine has been harvesting human feelings from a large number of weblogs. At the core, We Feel Fine is a data collection engine that automatically scours the Internet every ten minutes, harvesting human feelings from a large number of blogs. Blog data comes from a variety of online sources, including LiveJournal,
APPROACHES TOWARDS AGGREGATE DATA TRANSPARENCY 29
Figure 2.8: screenshots from We Feel Fine
MSN Spaces, MySpace, Blogger, Flickr, Technorati, Feedster, Ice Rocket, and Google”. The system searches the world’s newly posted blog entries for occurrences of the phrases “I feel” and “I am feeling”. When it finds such a phrase, it records the full sentence, up to the period, and identifies the "feeling" expressed in that sentence (e.g. sad, happy, depressed, etc.). Because blogs are structured in largely standard ways, the age, gender, and geographical location of the author can often be extracted and saved along with the sentence, as can the local weather conditions at the time the sentence was written. All of this information is saved. We Feel Fine only collects and displays data that was already posted publicly on the World Wide Web.
It provides users with the up-to-the-moment feelings of the world. Users can search for the emotional status of a particular group of people via the “filters”.
Figure 2.8 shows 6 screenshots of We Feel Fine. In Figure 2.8(1), each bubble represents a sentence or a picture that has recently been posted, and users can click on a bubble to reveal the “I feel” content. Figure 2.8(2) shows a user interface that helps the user filter feelings based on feeling keywords, blogger’s gender, age, location, local weather and year. Figure 2.8(3) and Figure 2.8(4) show different forms of presentations of the top feeling key words within the last few hours. Figure 2.8(5) and Figure 2.8(6) show that feelings can be mapped according to gender and location.
2.3
Approaches towards Data-visualization Library
Design
To develop exploratory visualizations, developers need to rely on data- visualization libraries, which reduce repetitive coding and promote modularized tool design. Related works on visualization library design include design patterns for reusable object-oriented software [64], and data-visualization taxonomies [154, 92, 31]. However, to the best of our knowledge, there has been no study on how to design data-visualization libraries. In this thesis, we fill this gap by making connections between data-visualization taxonomies, software design patterns and the current generation of libraries for online data-visualization developement. We discuss more detailed related work in Chapter 9.