• No results found

Q and then generating the aggregate views. Particularly, for aggregate views, the target views are generate on a subset of data defined by a refined query Qj∈ Q and the comparison views are generated

on another subset of data defined by another refined query Qk∈ Q such that Qjand Qkhave contextual

reference to each other.

Recall in Section 4.3.2, a query is defined by its predicates (Qj= Pj0, Pj1, ....Pjn). We define a metric to determine the contextual reference between two queries Qjand Qk in terms of difference of

predicates between the two queries. Particularly, a context is set if the two queries are not different from each other by more than one predicate. Note that, contrary to the case of input query refinement here we are not interested in exact amount of difference of value in predicates. For the sake of knowing that the queries are contextually related, it is enough to know the two queries differ from each other by only one predicate.

C(Qj, Qk) = |Qj∩ Qk|

We define an aggregate view Vi(Qj, Qk) as the ith aggregate view from two subsets defined by

queries Qjand Qk. Specifically, the target view of Vi(Qj, Qk) is defined on the subset of data selected

by Qj, while the comparison view is defined on subset selected by Qk. We define C as a constraint

in our problem setting. The utility of a view U is still the deviation based utility define in Chapter 4. Formally, the problem of reference view refinement is stated as:

Definition: Reference View Refinement for View Recommendation: Given an user-specified query Qon a database DB, a set of refined queries Q, and a multi-objective utility function U, Find k locally

interesting aggregate views Vi(Qj, Qk) that have the highest utility values, from all of the refined

queries Qj∈ Q, such that C(Qj, Qk) = 1.

All combinations of Qj and Qk forms a combinotorial problem. Particularly, let the number of

refined target views be N then considering the same views for comparison the possible aggregate views will be N × (N − 1). However, it is not a performance bottleneck for us, as the subsets of data defined by refined queries and the aggregate queries for target views are already being retrieved from the database, therefore, for reference dataset refinement and comparison views the already retried data can be used. However, calculating the constraint C for huge combination of views is computationally expensive. A simple baseline solution is to consider all combinations of Qjand Qk and the ones that

pass the constraint are used to generate and recommend the top-k views.

5.3

View-360

We present prototype of a visual data exploration tool, titled View-360, that recommends interesting visualizations by espousing the following four aspects: i) automatic binning of numerical dimensions, ii) automatic refinement of input query and reference dataset, iii) ensuring statistical significance of recommended views by hypothesis testing and iv) making sure the recommended aggregate views have semantic value by carefully selecting the target and comparison view for each visualizations.

Moreover, it is unclear how to decide which attributes to use for predicates, dimensions and measures. Therefore, to keep the system flexible and have maximize discovery of insights, we allow to

98 CHAPTER 5. VIEW-360: A PROTOTYPE SYSTEM FOR VIEW RECOMMENDATION have overlap in set P, A and M. However, for a view an attribute can be used either as a measure or dimension or in a predicate at one time to keep the semantics of visualizations in tact. For instance, assume attribute edu is marked in for all of the sets P, A and M by the user. Now when it is used in refinement then it does not make sense to also use it in dimension and perform binning on it or to use it as a measure and perform aggregate function on it, therefore for views that have edu as predicate it is removed from A and M sets. Later, it is included in set A and consequently it is removed from the other two sets and so on. In this way it gets to be treated as a dimension, measure and predicate for discovery of insights.

We have also explored the possibilities of diversifying or unifying results and it was discovered that unifying results i.e. putting the related views together is helpful for the user for drawing insights into the data. Therefore, after generating top-k views, View-30 provides three option of viewing the views: 1) in order of their ranking, 2) display the diversified top-k views, and 3) the results from the same target query are put together or the results having the same dimension attribute are put together. The analyst can perform analysis using one or all of the three options mentioned above. This helps to synthesizing insights into data. We discuss this in detail in our analysis of various datasets.

Moreover, View-360 also provides a feature to explore a view further by plotting the scatter plot or the frequency distribution of the attributes involved in that view. This feature is extremely useful in many cases as it is shown in the detail discussion of datasets in next sections. Particularly, when a recommended view has SUM as aggregate function, the explanation of such a view is not as straightforward as other aggregate functions. For instance, the high value of SUM of a measure attribute for a particular category in dimension attribute can be due to two reasons: 1) the value of the measure itself is high, or 2) the value of the measure attribute is small, but the number of instances in that category are large and when they add up for the SUM function it becomes a large value. Moreover, if the high value is due to (1), then the same view with COUNT as aggregate function should also has the ranking close to the one with the SUM function. Therefore, to draw an insight from a view that has the SUM aggregate function, the same view with COUNT aggregate function should also be looked at. View-360 further exploration feature, provides this facility to quickly look at the required related view or other visualizations that helps in explanation of insight.

5.3.1

General Settings

In default settings, we assume there is no input query, therefore, all possible input queries (subsets of data) are considered for recommendation.The user specifies attributes from the dataset for the following parameters:

Ac- Categorical dimension attributes

An- Numerical dimension attributes

M - Measure attributes

Pc-Categorical attributes for predicates