Similarity-aware query refinement for data exploration

(1)

for Data Exploration

Abdullah Mohammed Albarrak

Master of Computer Science

A thesis submitted for the degree of Doctor of Philosophy at

The University of Queensland in 2017

(2)

Database users are easily overwhelmed by the sheer size of data found in large-scale scientific and financial databases. Exploring these databases to make sense of the explored data and to discover interesting insights (i.e., data exploration) has been, and still is, a hideous and labour-intensive task, especially for non-expert users with no solid background of the underlying data. Some three decades ago, the database research community noticed the limitation of traditional DBMS in supporting users for data exploration tasks. Since then, the research community has proposed and designed various effective and efficient data exploration techniques to assist users in extracting interesting insights from their data. An instance of these techniques is the Query Refinement technique.

In query refinement techniques, users’ queries are assumed to be imprecise, i.e., the returned result does not meet some user-defined constraints. Accordingly, the goal of query refinement techniques is to automatically refine these imprecise queries to maximize users’ satisfaction with the results. In particular, the predicates of the queries are carefully modified so that the returned results satisfy the user-defined constraints. Since users’ constraints on the queries results are diverse and miscellaneous, this thesis focuses on two specific forms of constraints in exploring relational and sequential data, namely, 1) user-defined aggregate constraints on the result, and 2) user-defined correlation constraints of time series data. These constraints are common in real world applications because they represent an upper level view of the result that is easier to understand and digest than the raw result itself.

This thesis addresses the limitations of current query refinement techniques that are oblivious to the similarity of the refined queries to the users’ initial queries. Specifically, users’ initial (and imprecise) queries are defined as anchor points for which the similarity of its corresponding refined queries are computed over the whole refinement space. Consequently, the similarity-aware query refinement problem is formulated as a search problem, which aims to balance the trade-off between minimizing the deviation from satisfying a constraint on the query result, and maximizing the similarity of the refined query to the initial one. Searching for a trade-off between satisfying a constraint on the result of a query and maximizing the similarity introduces various challenges. A common challenge shared by many query refinement problems is that finding an optimal trade-off

(3)

whether they are optimal or not incurs expensive computational and I/O costs. Hence, simply applying exhaustive solutions is not adequate since they hinder users’ exploration tasks and worsen the response time. In this thesis, we discuss in detail our three key contributions, which address the challenges above in the context of query refinement for aggregate and correlation constraints.

Firstly, we formally define the Similarity-aware, Aggregate-based Query Refinement problem, in which users specify aggregate constraints on the result and prefer refined queries that are similar to their initial ones. Then, we consider the special case of aggregate constraints, in which users specify cardinality constraints on their queries results. For that special case, we propose innovative Similarity-aware Query Refinement schemes (SAQR) which employ pruning techniques to avoid unnecessary evaluations of candidate refined queries that are considered unpromising. We also show the applicability of SAQR in a web-based application (ORange) which utilizes SAQR schemes for refining selected areas based on cardinality constraints.

Secondly, we address the general case of aggregate constraints, in which multiple constraints can be defined using SQL standard aggregate operatorssum, avg, min, max. We present EAGER schemes for this general case and propose efficient approximation and optimization techniques to elevate the shortcomings of aggregates loose bounds that are used in pruning unpromising candidate queries. Moreover, by comparison with related algorithms using real world datasets, we show the efficiency gains of our schemes under different experimental parameters.

Thirdly, we formulate the Similarity-aware, Correlation-based Query Refinement problem, in which users’ queries are refined to satisfy their pairwise correlation constraints of time series data. We show the computational hardness of this problem, and propose the RELATE scheme to address the associated challenges by utilizing the incremental property of correlation. Further, we propose two-level pruning techniques for the RELATE scheme to minimize the associated computational and I/O costs. These two techniques enable RELATE to avoid exhaustively traversing the search space by pruning unqualified candidate queries, and avoid computing pairwise correlation of every time series pair wherever possible. We demonstrate by experiments the performance gains of RELATE against state-of-the-art algorithm with real and synthetic datasets.

(4)

This thesis is composed of my original work, and contains no material previously published or written by another person except where due reference has been made in the text. I have clearly stated the contribution by others to jointly-authored works that I have included in my thesis.

I have clearly stated the contribution of others to my thesis as a whole, including statistical assistance, survey design, data analysis, significant technical procedures, professional editorial advice, and any other original research work used or reported in my thesis. The content of my thesis is the result of work I have carried out since the commencement of my research higher degree candidature and does not include a substantial part of work that has been submitted to qualify for the award of any other degree or diploma in any university or other tertiary institution. I have clearly stated which parts of my thesis, if any, have been submitted to qualify for another award.

I acknowledge that an electronic copy of my thesis must be lodged with the University Library and, subject to the policy and procedures of The University of Queensland, the thesis be made available for research and study in accordance with the Copyright Act 1968 unless a period of embargo has been approved by the Dean of the Graduate School.

I acknowledge that copyright of all material contained in my thesis resides with the copyright holder(s) of that material. Where appropriate I have obtained copyright permission from the copyright holder to reproduce material in this thesis.

(5)

Peer Reviewed Journal Articles

• A. Albarrak and M. A. Sharaf. Efficient schemes for similarity-aware refinement of aggregation queries. World Wide Web, pages 1-31, 2017.

• I. A. Ibrahim, A. M. Albarrak, and X. Li. Constrained recommendations for query visualizations. Knowl. Inf. Syst., 51(2):499-529, 2017.

Peer Reviewed Conference Papers

• A. M. Albarrak and M. A. Sharaf. Query refinement for correlation-based time series exploration. In Databases Theory and Applications - 28th Australasian Database Conference, ADC 2017, Brisbane, Australia, September 25-28, 2017, Proceedings, pages 45-58, 2017. • A. Albarrak, T. Noboa, H. A. Khan, M. A. Sharaf, X. Zhou, and S. W. Sadiq. Orange:

Objective-aware range query refinement. In IEEE 15th International Conference on Mobile Data Management, MDM 2014, Brisbane, Australia, July 14-18, 2014 - Volume 1, pages 333-336, 2014.

• H. A. Khan, M. A. Sharaf, and A. Albarrak. Divide: efficient diversification for interactive data exploration. In Conference on Scientific and Statistical Database Management, SSDBM’14, Aalborg, Denmark, June 30 - July 02, 2014, pages 15:1-15:12, 2014.

• A. Albarrak, M. A. Sharaf, and X. Zhou. SAQR: an efficient scheme for similarity-aware query refinement. In Database Systems for Advanced Applications - 19th International Conference, DASFAA 2014, Bali, Indonesia, April 21-24, 2014. Proceedings, Part I, pages 110-125, 2014.

(6)

• A. Albarrak, M. A. Sharaf, and X. Zhou. SAQR: an efficient scheme for similarity-aware query refinement. In Database Systems for Advanced Applications - 19th International Conference, DASFAA 2014, Bali, Indonesia, April 21-24, 2014. Proceedings, Part I, pages 110-125, 2014. Incorporated in Chapter 3, Section 3.3.

Contributor Statement of contribution Abdullah Albarrak Conception and design (50%)

Analysis and interpretation (60%) Drafting and production (60%) Mohamed Sharaf Conception and design (40%)

Analysis and interpretation (30%) Drafting and production (40%) Xiaofang Zhou Conception and design (10%)

Analysis and interpretation (10%)

• A. Albarrak and M. A. Sharaf. Efficient schemes for similarity-aware refinement of aggregation queries. World Wide Web, pages 1-31, 2017. Incorporated in Chapter 3, Section 3.4.

Analysis and interpretation (10%) Drafting and production (20%)

(7)

Objective-aware range query refinement. In IEEE 15th International Conference on Mobile Data Management, MDM 2014, Brisbane, Australia, July 14-18, 2014 - Volume 1, pages 333-336, 2014. Incorporated in Chapter 3, Section 3.5.

Analysis and interpretation (60%) Drafting and production (80%) Tatiana Noboa Conception and design (15%)

Analysis and interpretation (10%) Hina Khan Analysis and interpretation (10%)

Drafting and production (15%) Mohamed Sharaf Conception and design (10%)

Analysis and interpretation (10%) Xiaofang Zhou Conception and design (5%)

Analysis and interpretation (5%) Shazia Sadiq Analysis and interpretation (5%)

Drafting and production (5%)

• A. M. Albarrak and M. A. Sharaf. Query refinement for correlation-based time series exploration. In Databases Theory and Applications - 28th Australasian Database Conference, ADC 2017, Brisbane, Australia, September 25-28, 2017, Proceedings, pages 45-58, 2017. Incorporated in Chapter 4.

Analysis and interpretation (10%) Drafting and production (10%)

(8)

My principle advisor, Dr. Mohamed Sharaf, has largely contributed towards the research problems presented in this thesis. Dr. Sharaf assisted me by providing guidance and feedback on formulating the problems and solutions in this thesis. He also reviewed, polished and assisted with the published papers included as part of this thesis.

Statement of parts of the thesis submitted to qualify for the award

of another degree

(9)

It is with mixed feelings that I write this part of the thesis. The vicissitudes throughout my PhD journey have made me a better person, both academically and personally, for which I shall be forever grateful. Without the help and support of others, I could not have persevered and completed this journey. I would, therefore, like to sincerely thank all those who supported me along the way.

Foremost, I would like to thank my principal advisor, Dr. Mohamed Sharaf. His patience and persistent guidance with my research problems, and his compassion with my personal life situation, helped to make this journey possible. I shall always be grateful that our paths have crossed. My genuine thanks extend to my advisory committee, Prof. Shazia Sadiq and Prof. Xiaofang Zhou, for their insightful feedback and support in each step of my PhD.

I shall be forever grateful to my dear wife who stood by my side, despite being away from her parents who are in need of her. I will always be thankful for her spiritual and emotional support throughout this journey.

I would like to thank my beloved parents for their love and constant encouragement. I hope God gives me the strength to repay their love until my soul rest.

Thank you, also, to all my colleagues in the DKE group who shared my life in Australia. Special thanks to Ibrahim A. Ibrahim, Sanad Al-Maskari, Hina Khan, Saeid Hosseini and Bolong Zheng.

Lastly, I would like to formally thank my sponsor, Al-Imam Muhammad Ibn Saud Islamic University, for providing the financial support which made this journey possible.

(10)

data exploration, query refinement, aggregate constraints, pairwise correlation constraints

Australian and New Zealand Standard Research Classifications

(ANZSRC)

ANZSRC code: 080604, Database Management, 100%

Fields of Research (FoR) Classification

FoR code: 0806, Information Systems, 100%

(11)

(12)

Abstract i

Acknowledgements viii

List of Figures xvi

List of Tables xvii

1 Introduction 1

1.1 Overview . . . 1

1.2 Data Exploration . . . 2

1.3 Query Refinement . . . 3

1.3.1 Similarity-aware Aggregate-based Query Refinement . . . 4

1.3.2 Similarity-aware Correlation-based Query Refinement . . . 8

1.3.3 Challenges . . . 13

1.4 Contributions . . . 14

1.5 Thesis Outline . . . 15

2 Literature Review 16 2.1 Data Exploration Techniques . . . 16

2.2 Query Refinement Techniques . . . 18

2.2.1 Query Refinement Techniques -Various Constraints . . . 19

2.2.2 Aggregate-based Query Refinement Techniques . . . 28

2.2.3 Correlation-based Query Refinement Techniques . . . 33

3 Similarity-aware Aggregate-based Query Refinement 37 3.1 Overview . . . 37

(13)

3.2.1 Cost Model . . . 41

3.2.2 Query Similarity Measures . . . 42

3.2.3 Problem Definition . . . 44

3.2.4 Declarative Query Model . . . 47

3.3 SAQR Schemes . . . 49

3.3.1 Problem Statement . . . 50

3.3.2 SAQR-S . . . 50

3.3.3 SAQR-CS . . . 55

3.3.4 The Monotonicity Property . . . 56

3.3.5 Cardinality-based Pruning . . . 57

3.3.6 Hierarchical Representation of the Search Space . . . 58

3.3.7 Experiments . . . 59

3.4 EAGER Schemes . . . 68

3.4.1 Aggregates Constrains Bounds . . . 68

3.4.2 Optimization Techniques . . . 69

3.4.3 Approximation Techniques for EAGER-GS . . . 70

3.4.4 Experiments . . . 74

3.5 Objective-aware Range Query Refinement . . . 89

3.5.1 ORange Architecture . . . 90

3.5.2 Application Setup . . . 90

3.5.3 Step-by-step Example . . . 91

3.6 Summary . . . 92

4 Similarity-aware Correlation-based Query Refinement 94 4.1 Overview . . . 94

4.2 Preliminaries . . . 95

4.2.1 Problem Definition . . . 96

4.2.2 Refining a Sub-Interval . . . 98

4.3 RELATE Scheme . . . 100

4.3.1 Cost Model Analysis . . . 101

4.3.2 Caching Essential Arrays . . . 102

4.3.3 Reusing Essential Arrays . . . 103

4.3.4 Breadth-First and Depth-First Search Strategies . . . 103

(14)

4.3.6 Similarity-aware Pruning Technique . . . 107

4.3.7 Pairwise Correlation Pruning Technique . . . 108

4.3.8 Paris Ordering . . . 109 4.4 Experiments . . . 111 4.4.1 Setup . . . 111 4.4.2 Results . . . 112 4.5 Summary . . . 120 5 Conclusions 122 5.1 Summary of Contributions . . . 122 5.2 Future Work . . . 124

5.2.1 Query Refinement for Aggregate Constraints . . . 124

5.2.2 Query Refinement for Correlation Constraints . . . 124

(15)

1.1 Rectangular query (or Box query) is one of the common and basic queries used to

explore SDSS [112] . . . 5

1.2 Two refined queries R∗and R0satisfy the constraint at almost the same level, but their similarities to the user’s initial region of interest I are quite different . . . 7

1.3 Relation R stores hourly CPU load readings of three connected servers T1, T2 and T3 in a hypothetical data centre . . . 9

1.4 2-D visualization of R. The abnormal behaviour is between 9 AM and 12 PM: T1’s load increases but T2’s or T3’s load does not increase (i.e., correlation close from zero) 9 1.5 Correlation matrix of Q1, Q2and Mabn. Mabnrepresents an abnormal behaviour where T1, T2and T3have negative correlation . . . 11

1.6 A subset of candidate queries and their normalized sum of absolute difference from the abnormal behaviour Mabn. Q2 = Q[9, 12] has the minimum difference to the abnormal behaviour . . . 11

3.1 Example - refining an input query I in a two-dimensional space . . . 39

3.2 A categorical attribute location represented as a three-levels hierarchy to enable refinement. Each value inlocation is mapped to a level, i.e., city, state, country 40 3.3 The first three steps of TA-Algorithm under the SR Model with two list: access sorted list(left) and random access list (right) . . . 53

3.4 Estimating upper and lower bounds of∆Ri by using probed queries Rland Ru. . . 57

3.5 2-Dimensional search space is decomposed into H levels, where the resolution of the top level δ = 1, and the resolution of the bottom level H is δ = ₂1H . . . 58

3.6 Average deviation while varying similarity weight α . . . 61

3.7 Average cost while varying similarity weight α . . . 61

3.8 Average deviation while varying grid resolution δ . . . 62

(16)

3.10 Average deviation while varying number of dimensions d . . . 64

3.11 Average cost while varying number of dimensions d . . . 64

3.12 Average cost for TPC-D database in different scales . . . 65

3.13 Average cost while varying z-value . . . 67

3.14 Average deviation while varying z-value . . . 67

3.15 Average deviation while varying similarity weight α forcount . . . 76

3.16 Average deviation while varying similarity weight α formax . . . 76

3.17 Average deviation while varying similarity weight α foravg . . . 77

3.18 Average cost while varying similarity weight α forcount . . . 77

3.19 Average cost while varying similarity weight α formax . . . 78

3.20 Average cost while varying similarity weight α foravg . . . 78

3.21 Average deviation while varying number of dimensions d forcount . . . 79

3.22 Average deviation while varying number of dimensions d formax . . . 79

3.23 Average deviation while varying number of dimensions d foravg . . . 79

3.24 Average cost while varying number of dimensions d forcount . . . 80

3.25 Average cost while varying number of dimensions d formax . . . 80

3.26 Average cost while varying number of dimensions d foravg . . . 80

3.27 Average deviation while varying grid resolution δ . . . 81

3.28 Average cost while varying grid resolution δ . . . 81

3.29 Average cost for SDSS database in different sizes . . . 82

3.30 Average cost while varying number of materialized queries . . . 83

3.31 Average cost while varying number of materialized queries . . . 83

3.32 Average cost while varying number of submitted queries . . . 84

3.33 Average cost while varying threshold λ . . . 84

3.34 Average deviation while varying threshold λ . . . 85

3.35 Average deviation while varying topb forcount . . . 85

3.36 Average deviation while varying topb formax . . . 86

3.37 Average deviation while varying topb foravg . . . 86

3.38 Average cost while varying topb forcount . . . 87

3.39 Average cost while varying topb formax . . . 87

3.40 Average cost while varying topb foravg . . . 88

3.41 Average pruning power with different Top-K . . . 88

(17)

4.1 The similarity S(Q_I, Q∗)decreases very quickly when distance d(Q_I, Q∗)increases . 98 4.2 Refining a query Q[s, e]implies refining its time sub-interval. Four candidate queries

are generated by applying LC, LE, RC and RE on QI’s sub-interval . . . 99

4.3 Computational time (CPU) to compute M dominates I/O time when there is a large number of series n ≥ 1000 . . . 100

4.4 The classical traversal strategies: Breadth First (BFS) and Depth First (DFS) to decide the visiting order of the candidate queries in the search space starting from the input query QI . . . 101

4.5 Essential arrays of a query Q[s, e]: Caching ∑ x, ∑ x2and ∑ xy of Q enables RELATE to incrementally compute the pairwise correlation of any pair in M for Q’s offspring . 102 4.6 Ordering of pairs in a correlation matrix for a given candidate query. REF-DY is faster to arrive to f_b than SYS and REF, hence, enabling RELATE to reduce the computational cost . . . 109

4.7 Average OP while varying time series length . . . 113

4.8 Average OP while varying time series length . . . 113

4.9 Average MaxMemory while varying time series length . . . 114

4.10 Average MaxMemory while varying time series length . . . 114

4.11 Average KBs while varying time series length . . . 115

4.12 Average KBs while varying time series length . . . 115

4.13 Average OP while varying number of time series . . . 116

4.14 Average MaxMemory while varying number of time series . . . 116

4.15 Average KBs while varying number of time series . . . 117

4.16 Average OP while varying similarity weight λ . . . 117

4.17 Average MaxMemory while varying similarity weight λ . . . 118

4.18 Average KBs while varying similarity weight λ . . . 118

4.19 Computational cost across different ordering methods . . . 119

4.20 Number of probed pairs across different ordering methods . . . 119

(18)

2.1 Summary of some QR techniques based on refinement constraints . . . 18

3.1 Summary of Symbols . . . 38

3.2 Examples from the literature for box query similarity measures . . . 42

3.3 Evaluation Parameters. . . 60

3.4 Time per probe in milliseconds for TPC-D database in different scales . . . 65

3.5 A histogram of the data distribution of different z-value forquantity attribute in the lineitem table. z=0 represents a uniform distribution, while z=3 represents a highly skewed distribution . . . 66

3.6 Evaluation Parameters . . . 74

3.7 Time per probe (ms) for SDSS database in different sizes . . . 82

3.8 Schema of used dataset SD_incidents_100k. . . 91

4.1 Summary of Symbols . . . 95

4.2 Variants of RELATE: INC: incremental computation of correlation. SMP: Similarity-aware pruning. PWC: Pairwise correlation pruning . . . 112

(19)

Introduction

1.1 Overview

Users are easily overwhelmed by the sheer size of data in today’s large-scale databases found in scientific and financial domains. Exploring these databases to make sense of the explored data and to discover interesting insights (i.e., Data Exploration tasks) has been, and still is, a hideous and labour intensive task for users [59, 52, 18, 63]. This is particularly true for non-expert users lacking a solid background of the explored database [54, 113, 81, 42, 86].

Some three decades ago [23], the database research community noticed the limitation of traditional DBMS in supporting users with their Data Exploration (DE) tasks. Since then, the community has proposed and designed various solutions and techniques to assist users with their diverse exploration tasks, with utmost effectiveness and efficiency, e.g., [25, 46, 15, 18, 127, 63, 58]. An instance of these techniques is the Query Refinement (QR) technique [17].

Given a user’s query that returns unexpected results, QR techniques aim to refine this query so that its results meet a user’s expectation. Specifically, the ultimate goal of QR techniques is to automatically modify (refine) predicates of a query so that the results of the modified (refined) query optimally meet the user’s expectation. Achieving the goal of QR techniques entails massive computational and I/O costs, because finding an optimal refined query requires searching a huge search space of possible refined queries. Consequently, many optimization algorithms have been proposed to address the efficiency and the effectiveness aspects in achieving this goal, e.g., [78, 57, 118, 49, 83, 125, 39].

The aim of this thesis is to address the limitations of current QR techniques that are oblivious to the similarity of the refined queries to users’ initial queries. In particular, we propose to define a user’s initial query as an anchor point for which the similarity of its corresponding refined query is computed over the whole refinement space. Consequently, we include the similarity as an objective

(20)

when searching for a refined query to ultimately increase the user’s satisfaction with the refined query. However, including this objective introduces multiple challenges (as shown in Section 1.3) and requires new, innovative algorithms with efficient optimization techniques.

The rest of this chapter is organized as follows: Section 1.2 briefly introduces the general theme of this thesis: Data Exploration. Then Section 1.3 discusses in detail the core topic of the thesis: Query Refinement. Section 1.4 discusses the key contributions of this thesis, and Section 1.5 lists the outline of the thesis.

1.2 Data Exploration

Improving the cycle of DE at different levels has recently become a major research direction for the databases research community [63]. This subsection discusses the grounds for that direction and illustrates the close links between QR and DE.

Informally, a data exploration task is a collection of ad-hoc, data-driven steps. The purpose of these steps is to make sense of the explored database and to gain interesting insights that the user would not otherwise know they exist in the first place [46, 15, 18, 127, 45]. Typically, users keep repeating these data-driven steps until they are satisfied by what they have seen, or they run out of resources (e.g., time) [15].

Unsurprisingly, traditional DBMS were not well designed for performing such DE tasks [63, 46, 128], as they were designed to provide well-structured storage for data and efficient data retrieval for well formulated and DBMS optimized queries. In [52], it is noted that this problem has been specifically identified over some three decades ago. [23] argued that DE (i.e., database usability) is not well supported by traditional DBMS because it simply was not among the concerns of the market at that time.

With recent advancements in data acquisition and storage technologies, today’s databases are larger, more complex, and more difficult to explore. It is a fact that human perception of data remains constant against the exponential increase of data’s volume [64], creating a large gap to be filled with efficient and effective DE techniques. Further, the parallel increase of non-experts users (e.g., journalists who want to validate politicians’ claims through databases [126]), and web-based query interfaces to public and scientific databases such as the ones hosted by Google’s BigQuery platform and Sloan Digital Sky Surveys (SDSS), created an urgent need for innovative DE techniques.

Accordingly, researchers have proposed highly specialized and optimized DE techniques to support users with their diverse exploration tasks. For example, some of these tasks are to recommend

(21)

data or a reference [124], to explain why outliers show up in the results [104, 129], to summarize and present representative sets of the potentially huge result sets [28, 65], to formulate or refine queries based on user-defined constraints [33, 119, 58, 125, 2].

1.3 Query Refinement

When querying a database, users often have some expectations of the queries results [16, 17]. For instance, a user might expects her query result to contain specific tuples [118, 49], or her query returns a non-empty result [90, 78]. When they formulate their queries and submit them to the DBMS, however, it is highly unlikely they will be satisfied with the results. That is, users oftentimes formulate wrong and imprecise queries due to their lack of a comprehensive knowledge of the data [80, 94], giving them results which do not meet their expectations.

As a result of these imprecise queries, users often enter a laborious trial-and-error process where they manually modify some predicates in their imprecise queries in the hope that these modifications will render the result to meet their expectations. In some cases, it might be impossible to perform this laborious manual process when there are usage limits on the DBMS (e.g., a maximum number of queries per session or per user). Hence, Query Refinement (QR) techniques have been proposed to address this problem by assisting users in refining their queries automatically so that the returned results meet their expectations.

Informally, QR is "the process of refining a query when the answer to the query does not meet the expectations of the user"[17]. In particular, the predicates of the query are automatically modified so that the refined query result reflects what the user expected, i.e., her constraints. A constraint over a query result is defined as the user’s expectation of her query result, and it can take different forms. For example, a user might be expecting to see a specific tuple in the result but it was not among the returned results [118, 49]. Hence, her constraint is for the query to return a result that contains this specific tuple. Another common example is a user who expects her query to return any result, but the query result is empty [85, 91]. Thus, her constraint is for the query to return a non-empty result. A more common example is a user who expects her query to return a result that meets a certain aggregateor correlation value, but the returned result fails to meet that value [125, 78]. Accordingly, her constraint is for the query to return a result that meets this specific aggregate or correlation value. While these examples refer to a constraint as a singular entity (e.g., one tuple, one aggregate value), a constraint can contain multiple expectations of the same form, such as multiple tuples or multiple aggregate values.

(22)

boosts users’ understanding of the raw result which contains individual tuples. That is, for database users, aggregated data are easier to understand and digest than the raw data itself, and are favored in DE tasks [37, 99, 104]. Accordingly, this thesis focuses on two specific forms of constraints defined over a query result in the context of relational and time series data. Namely:

1. Aggregate Constraints (Section 1.3.1) 2. Correlation Constraints (Section 1.3.2)

Exploring the pervasive relational and time series data using the two aforementioned constraints augments users’ understanding of their queries result and ultimately empowers them when exploring unfamiliar data spaces.

In the following subsections, we see that automatically refining an imprecise query to meet one of the above constraints is challenging because it requires examining a huge search space of refined queries, possibly exponential. Exhaustively examining this search space is not practical and incurs enormous CPU and I/O costs. Hence, a handful of techniques, e.g., [14, 79, 123], have been proposed to efficiently navigate this exponential space to meet users’ constraints.

In light of these techniques and the two constraints mentioned above, we introduce two problems and formally address them in this thesis. Specifically, in Section 1.3.1 we show the first problem addressed by this thesis, in which users define aggregate constraints over the queries results. We show the usefulness and the applications of this problem based on a real world dataset. Then, in Section 1.3.2 we formally introduce the second problem addressed by this thesis in which users specify correlation constraints for time series pairs. Similarly, we demonstrate the applicability of this problem by an example. We also touch on the existing techniques proposed to address these two problems and their limitations.

1.3.1 Similarity-aware Aggregate-based Query Refinement

Techniques for automatically refining a query to satisfy certain aggregate constraints provide effective and efficient solutions to various problems. For instance, they can be used to address the too many/few answers problem [3, 78], to enable richer expressions in querying a database [125, 17], and to generate test queries for the purpose of database testing [79, 14].

An aggregate constraint G over a query result can be specified using an aggregate operator and an attribute, i.e., agg(a), where agg() belongs to one of the standard SQL aggregation functions count, sum, avg, min and max. Further, users can define multiple aggregate constraints, i.e., G= {g , g , ..., g } such that each g ∈ G is a single aggregate constraint over the query result.

(23)

Figure 1.1: Rectangular query (or Box query) is one of the common and basic queries used to explore SDSS [112]

Refining a query is equivalent to applying modification operations on its predicates. These modification operations can be adding or removing predicates, relaxing or contracting predicates, replacing constants with other constants, joining with auxiliary tables through foreign keys, etc. As an example, a single sided range predicate on a numerical attribute ai of this form Pi: ai≤ xican be

relaxed (respectively, contracted) as follows: ai ≤ xi0, where xi0> xi (xi0< xi). Hence, it is easy to

observe the huge number of possible refined queries (i.e., candidate queries) that can be generated by refining the predicates of an input query.

Let us consider the following example where specifying aggregate constraints can be effective in exploring a real-world database: the widely known Sloan Digital Sky Server (SDSS) scientific database1. This database is the largest map of the Universe ever made that stores details of one third of the stars and galaxies we see in the sky, and it is publicly available for anyone to explore using different interfaces, one of which is the traditional SQL query language. However, exploring this large-scale database might be an overwhelming obstacle for users, especially for those with no solid background of the database [80].

Example 1.1. Using the SDSS database, a scientist wants to conduct a study of a particular rectangular region in the sky by retrieving astronomical objects (e.g., stars) enclosed in that region

to study their properties. This type of query is one of the commonly submitted queries according to

the SDSS website (Figure 1.1). We assume that the scientist has limited resources to conduct this

study, e.g., time and energy, and at the same time the study has to be performed on at least 1000

(24)

astronomical objects to be genuine and valid.

The scientist, with her limited background of SDSS, formulates a Box query to select a rectangular

region2in the sky and submit it to the DBMS. Her constraint on the result is for the number of returned objects to be large enough for a valid study, yet small enough for a feasible study. Namely, her

constraint iscount(*)=1000 objects. An example of this query is as follows:

I:SELECT * FROM SDSS.PhotoPrimary

WHERE ( ra ≥ 179.5 and ra ≤ 182.3 ) AND ( dec ≥ 1.24 and dec ≤ 1.86 ); Since it is very difficult to precisely set the values of the ra and dec predicates in query I that guarantee a desired number of objects (because of user’s limited knowledge of SDSS database),

the returned result might not satisfy the scientist’s constraint. That is, the result might contain too

few objects thereby rendering the study unreliable3, or too many objects which will make the study unachievable with the limited resources available to the scientist.

In Example 1.1 the scientist has no choice but to iteratively try different queries and manually adjust the values for the coordinates ra and dec in her input query I, until reaching a result which satisfies her constraint. This particular special case is the cardinality-based query refinement problem, in which the aggregate constraint is equivalent to the cardinality of the result that can be computed using thecount() aggregation function.

In [14], it was formally proven that this problem is NP-Hard. By relaxing the constraint, i.e., accepting approximate solutions that are close to the cardinality constraint, a heuristic Hill Climbing (HC) approach was developed by [14] to provide a refined query that minimizes the average relative error of the cardinality constraint in an efficient manner. Similarly, [79] addressed a similar settings of this problem where multiple cardinality constraints are defined for a query with m sub-expressions. A practical solution called TQGen has been proposed in [79] to quickly generate a refined query that optimally minimizes the sum squared logarithmic relative error of these constraints.

The interactive Semantic Windows approach [58] addresses a general case of this problem, in which users can define any aggregate constraint over a query result. We refer to this general case as the aggregate-based query refinement problem (AQR). Based on a cost-benefit model, a heuristic best-first algorithm is presented in [58] and further optimized by an adaptive prefetching technique to provide swift online results. Along the same lines, [12] presented the Package Query approach which assists users to find a set of tuples that satisfy global constraints (i.e., aggregate constraints) defined

(25)

dec ra _{I: user’s initial region of interest} _{R*: a refined query far from the}

user’s region of interest

R’: a refined query close from the user’s region of interest

1.24 1.86

179.5 182.3

Figure 1.2: Two refined queries R∗ and R0 satisfy the constraint at almost the same level, but their similarities to the user’s initial region of interest I are quite different

by the user. The proposed baseline algorithm in [12] translates these constraints into an Integer Linear Programming (ILP) problem and utilizes off-the-shelf ILP solvers to find a solution. A more scalable and efficient version is also proposed which relies on applying the ILP solvers on a representative set of the data to find approximated solutions, such that these solutions are guaranteed to be close by a factor to the baseline solutions.

Although the above techniques can be applied to solve the AQR problem efficiently, they exhibit various limitations. Foremost, similarity to users’ input queries is completely neglected in the techniques proposed in [14, 58, 12, 79]. Going back to Example 1.1, applying any of the above techniques will help the scientist to achieve her goal, i.e., an optimal refined query R∗which optimally satisfies the cardinality constraint. However, there is no consideration given to the user’s preference expressed in her input query I. That is, while R∗optimally satisfies her constraint, it might be very far from her initial region of interest, i.e., query I. At the same time, as shown in Figure 1.2 above, there might be another refined query R0that also satisfies her constraint (almost at the same level as R∗) but is very close to her initial region of interest, i.e., input query I. Therefore, suggesting R0instead of R∗ as the refined query increases the users’ satisfaction with the results because R0 is more similar to I than R∗.

Extending these techniques to include similarity is not practicable since it introduces multiple challenges. For instance, the HC approach in [14] can be modified to choose a refinement step that reduces the relative error and dissimilarity to the input query at the same time. However, this straightforward modification will render HC vulnerable to floundering about a local minima. The Semantic Windows approach in [58] can also be modified to include similarity to an input user’s query. Specifically, the dissimilarity of a cell to a query can be included in the cost-benefit model. Nonetheless, [58] represents the search space as one flat grid with a fixed granularity. Hence, the

(26)

number of cells at this approximated search space might be extremely huge when using a small value for granularity. Adding these cells to the queue and examining them exhaustively (as SW does) incurs a lot of I/O and CPU costs.

The techniques proposed in [78, 123] address the AQR problem and provide refined queries close from users’ input queries, but they also have their own limitations. For instance, the SnS framework [78] provides a manual algorithm for refining a query. It is manual in the sense that users are iteratively asked to select a predicate value that is most preferred by them, hence capturing similarity manually. Other approaches such as the ACQUIRE and SAUNA [123, 57] include similarity to users’ input queries in the refinement process and provide fully automatic algorithms. However, they ignore the I/O cost involved in refining a query and focus on the computational costs. For example, ACQUIRE focuses only on efficiently computing the aggregate value for a candidate query by utilizing the additive property of the aggregate. It also shares the limitation found in the SW approach: the search space is represented as one flat grid by dividing each dimension into a fixed number of partitions, resulting in a huge number of candidate queries that are exhaustively probed during the refinement process. Similarly, SAUNA requires each candidate query to be probed in order to compute its dissimilarity to the input query.

In Chapter 3 we formally address the inclusion of similarity to users’ input queries in the AQR problem. Including similarity requires new methods to represent the search space, and new pruning techniques to efficiently navigate this space and find the optimal refined query. For that, we propose a suite of algorithms with optimization techniques to overcome the efficiency challenges involved in finding an optimal query. We also compare our algorithms with the HC and TQGen approaches to validate our algorithms efficiency and effectiveness.

Next, we introduce the second type of constraints which users can define to explore time series data.

1.3.2 Similarity-aware Correlation-based Query Refinement

A common thread of data exploration is querying sequences of values (e.g., time series data) to perform various tasks [97, 76, 72]. For instance, a user can query sub-intervals of time series data then compute the pairwise correlation of all pairs of time series to find correlated pairs [70, 71, 89, 100, 39] or detect patterns and anomalies [87, 102].

In some cases, users select time series data within a specific time sub-interval to compute the pairwise correlation values for all time series pairs, e.g., [70, 71, 89]. We propose to define the pairwise correlation values as pairwise correlation constraints, such that a user’s query has to satisfy

(27)

timestamp

6:00 7:00 8:00 9:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00

T1 0.2 0.3 0.3 0.5 0.8 0.9 0.8 0.9 0.7 0.8 0.4 0.3 0.3 0.2 0.1

T2 0.15 0.2 0.15 0.22 0.41 0.54 0.48 0.49 0.36 0.43 0.19 0.17 0.16 0.11 0.06

T3 0.05 0.1 0.15 0.28 0.21 0.13 0.32 0.41 0.34 0.37 0.21 0.13 0.14 0.09 0.04

Figure 1.3: Relation R stores hourly CPU load readings of three connected servers T1, T2and T3 in a

hypothetical data centre

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 6: 00 7: 00 8: 00 9: 00 10 :00 11 :00 12 :00 13 :00 14 :00 15 :00 16 :00 17 :00 18 :00 19 :00 20 :00 C PU Loa d (%) timestamp T1 T2 T3 Abnormal behaviour between 9 AM and 12 PM

Figure 1.4: 2-D visualization of R. The abnormal behaviour is between 9 AM and 12 PM: T1’s load

increases but T2’s or T3’s load does not increase (i.e., correlation close from zero)

these constraints. Specifically, in the Correlation-based Query Refinement problem (CQR), a user’s input query is refined (i.e., its sub-interval is refined) such that the result of the refined query satisfies user-supplied pairwise correlation constraints for all time series pairs.

The following example illustrates how the CQR problem can be of use for users to automatically refine their queries to satisfy correlation constraints. This example is fairly prevalent in data centre management systems [73] where users analyze servers loads collectively (e.g., Queries 3 and 4 in [102]) using the correlation coefficient [114, 87, 102]. Moreover, the example will help in understanding the associated challenges for achieving the CQR problem goal.

Example 1.2. Assume a hypothetical data centre with three connected servers T1, T2, T3, where T1is

responsible for forwarding incoming requests to T₂ and T₃ as evenly as possible. The hourly CPU load readings of these servers are stored in a database relation R, as shown in Figure 1.3.

An admin wants to analyze the loads of these servers to identify any abnormal behaviour based on

the pairwise correlation of the servers loads. Let this abnormal behaviour be: T1’s load increases but

T₂’s or T₃’s load simultaneously does not increase (i.e., negative correlation). Conversely, the normal behaviour of these servers is for T₂ and T₃ loads to increase simultaneously as T₁’s load increases (i.e., positive correlation).

(28)

one shown in Figure 1.4. While the process of visually identifying abnormal behaviours is somehow

easy in this toy example as there are three time series in R, it should be clear that this process becomes

more challenging when there are more series.

Consequently, the admin executes a selection query over R and then computes the pairwise

correlation of the pairs (T₁, T2), (T1, T3) and (T2, T3) from the query output. These pairs and their

correlation values are collectively called a correlation matrix M.

Let the admin’s initial query be:

Q₁:SELECT * FROM R

WHERE timestamp ≤ 20 and timestamp ≥ 6;

The correlation matrix M_Q₁ for Q₁is shown in Figure 1.5.

It appears that no abnormal behaviour exists within the results of Q₁: the loads of T₂ and T₃ follow the same pattern as their parent T₁, which the high pairwise correlation values in M_Q₁ confirm. More precisely, the normalized sum of absolute difference (L₁-norm) between M_Q₁ and M_abn is high (≈0.96), where M_abnrepresents the abnormal behaviour in pairwise correlation values, as illustrated in Figure 1.5.

Nonetheless, between 9 and 12 there is evidence of an abnormal behaviour: T₃’s load breaks the pattern and decreases while T₁’s load increases. This abnormal pattern is captured by the following query:

Q₂:SELECT * FROM R

WHERE timestamp ≤ 12 and timestamp ≥ 9;

Its matrix M_Q₂ is shown in Figure 1.5 as well. Figure 1.6 confirms that Q₂’s correlation matrix is the closest to the user-defined abnormal behaviour, i.e., among all queries, M_Q₂ has the minimum sum of absolute difference to M_abn ≈0.74 .

In Example 1.2, it is assumed that the abnormal behaviour is well-known by the admin, e.g., pairwise correlation values close from -1 indicate an abnormal behaviour for the servers, as shown in M_abn. However, the time sub-interval (i.e., Q2) that is the closest to this abnormal behaviour is not

known to the admin. Further, it is assumed that Q1in Example 1.2 represents the admin’s initial guess

of where that abnormal behaviour is located.

To find Q2though, the admin is required to manually refine her input query (i.e., Q1) by modifying

(29)

T₁ T₂ T₃ T₁ 1 0.8 T2 0.7 T₃ T₁ T₂ T₃ T₁ 0.98 -0.55 T2 -0.47 T₃

M

_Q1

M

_Q2 T₁ T₂ T₃ T1 -1 -1 T₂ -1 T₃

M

_abn

Figure 1.5: Correlation matrix of Q1, Q2 and Mabn. Mabn represents an abnormal behaviour where

T1, T2and T3have negative correlation

0 0.2 0.4 0.6 0.8 1 1.2 S u m o f a b so lut e d if fe ren ce

Figure 1.6: A subset of candidate queries and their normalized sum of absolute difference from the abnormal behaviour Mabn. Q2=Q[9, 12]has the minimum difference to the abnormal behaviour

the results of all possible refined candidate queries.

Manually examining all possible candidate queries and computing the correlation matrices out of their results to find Q2 is obviously not a practical solution. There is a total of m(m−1₂ ) candidate

queries, and this number increases quadratically with the length of the time series m. Manually examining these queries is a labour intensive task and incurs tremendous I/O and CPU costs. For instance, it takes almost two hours to compute a single matrix for 10k time series [87] using a traditional PC.

A more suitable alternative is for the user to specify the target correlation matrix (e.g., Mabn in

Example 1.2) that represents an abnormal behaviour, and an efficient solution automatically finds the query (i.e., sub-interval) that is the closest to the target. This partially resembles the Query Reverse Engineering problem [119, 120, 115], but the pairwise correlation constraints and the time series data introduce unique challenges which cannot be addressed by the techniques proposed there. For instance, [119] requires that the query result must not contain an arithmetic expression such

(30)

as correlation, whereas [115] optimization techniques are based on series of rules using aggregates properties that are not applicable to correlation.

There is a large body of work [70, 107, 87, 39] which focuses on efficiently reporting time series pairs that are above a certain correlation threshold. This body of work is tightly related to CQR problem because reporting correlated pairs requires computing correlation efficiently, which is also a core requirement in the CQR problem. For example, the techniques in [70, 71] build on the observation of [107] where correlation can be computed incrementally to report the longest sub-interval of correlated pairs. These techniques however are a special case of CQR, because there is only one correlation constraint for all time series pairs. Moreover, these techniques aim to report all time series that are correlated with a given time series, i.e., bi-variant analysis, whereas the CQR problem falls into the multi-variant analysis category, in which the correlation of all pairs of time series are reported and compared [56, 41].

In the multi-variant analysis category, the AEGIS framework [39] has been proposed for fast computation of correlation in a distributed environment. This framework goal is to report all pairs of all time series that are correlated above a certain correlation constraint, i.e., AEGIS assumes only one correlation constraint for all pairs. Further, AEGIS assumes the length of the sub-interval is known in advance, e.g., given by a user. Based on these two assumptions, the framework partitions the time series such that the pairs that are potentially correlated are contained in a single partition, to control the communication cost in a distributed environment. Extending this framework to address the general case in CQR where neither of the two assumptions are enforced is not applicable.

A similar work is presented in [87] for computing pairwise correlation of a very large number of series to automatically discover anomalies. However, it also enforces the same assumptions as in [39]: a single correlation constraint for all pairs and the sub-interval is known in advance as well.

A more recent work [128] proposes a general framework to accelerate the computation of a number of statistics such as correlation. This framework utilizes the natural overlap in users’ exploratory queries to speedup the computation of statistics by synthesizing cached statistics that were previously computed. This is quite similar to the incremental computation of correlation method proposed in [107].

In Chapter 4, we address the Similarity-aware, Correlation-based Query Refinement problem, where users can specify similarity and correlation constraints. Given these constraints, we propose efficient algorithms and pruning techniques that overcome the efficiency challenges accompanied in solving this problem. We also compare the results of our algorithms to state-of-the-art algorithm using real and synthetic datasets and discuss the results under different parameters settings.

(31)

1.3.3 Challenges

Examples 1.1 and 1.2 briefly introduce the challenges related to the Similarity-aware, Aggregate-based and Similarity-aware, Correlation-based Query Refinement problems. Challenges in these problems can be classified as efficiency challenges, or effectiveness challenges.

Efficiency challenges in these two problems originate from navigating a huge search space to find an optimal refined query that satisfies the user-defined constraints. This is a common challenge shared by many data exploration problems where finding an optimal solution involves inspecting and examining a huge search space, possibly exponential, in a brute-force approach. Further, evaluating candidate solutions in these exponential spaces to decide whether they are optimal or not, incurs expensive computational and I/O costs. Hence, simply applying exhaustive solutions is not adequate since they hinder users’ exploration tasks and lengthen the response time. Accordingly, the efficiency challenges of these two problems necessitate the design of innovative algorithms that can utilize properties of the constraints and the search space to provide efficient solutions. For instance, the monotonic property of aggregate operators enables an algorithm to prune unqualified queries and avoid evaluating them using aggregate bounds that are computed from previously evaluated candidate queries. Moreover, the similarity constraint enables an algorithm to early abandon evaluating a candidate query, leading to signification savings of I/O costs.

For some tasks (e.g., algorithms that translate from English to Chinese) humans can easily decide which algorithm produced the most effective output. However, modeling the process of evaluating the effectiveness of such algorithms is quite challenging. In particular, the effectiveness challenges in the two problems addressed in this thesis relate to measuring the similarity of a solution to the user’s input query and user’s satisfaction of the refined query. That is, how to model users’ preferences so that an algorithm can automatically consider the dissimilarity of a candidate query to the input query in the refinement process. One extreme approach is to fully incorporate the user in the refinement process by asking her to label each candidate refined query based on her satisfaction of the refined query result. Although this can guarantee maximum user satisfaction with the end result, it adds a whole new substantial burden on users which can be avoided without sacrificing the solution effectiveness. Another approach which does not require users’ feedback is to automatically infer the similarity of a candidate query. This can be done systematically by comparing the results of a candidate query to the input query results. However, this approach implies retrieving the results of each candidate query, which could entail high CPU and I/O costs since the number of candidate queries can be huge.

Chapters 3 and 4 present our proposed innovative techniques and algorithms which address the efficiency and effectiveness challenges involved in achieving the goals of the Similarity-aware,

(32)

Aggregate-based and Similarity-aware, Correlation-based Query Refinement problems. The proposed algorithms utilize specific properties of the constraints to prune unqualified candidate queries and employ various techniques to optimize the search process. Further, these algorithms use simple, automatic and effective methods to swiftly infer the similarity of a candidate query while searching for the optimal solution.

1.4 Contributions

Motivated by the efficiency and effectiveness challenges mentioned above that are manifested in the exploration of relational and sequential data, we have proposed a suite of optimized schemes and algorithms which guide users in refining their imprecise queries based on aggregate and correlation constraints. Specifically, this thesis makes the following key contributions:

• For a special case of aggregate constraints, i.e., cardinality of the answer, Chapter 3, Section 3.3 proposes the SAQR schemes. SAQR schemes partition the search space using a space-based partitioning method to transform it into a multi-level grid with equal-width, non-overlapping cells. Each intersection in this grid represents a candidate refined query, and one of these candidates optimally minimizes the relative error in terms of the cardinality constraint and the dissimilarity to the input query. Hence, SAQR follows a Top-K approach but at the query-level, not the tuple-level, to find this optimal candidate query, with respect of the new partitioned search space. In a nutshell, SAQR schemes utilize similarity-based and cardinality-based properties to prune and avoid evaluating unqualified candidates when searching for the optimally refined query, without any approximation.

• The special case of cardinality was then extended to allow the constraints to be any multiple aggregate constraints, i.e., count, sum, avg, min, max for which the EAGER scheme has been proposed with its approximation techniques in Chapter 3, Section 3.4. EAGER schemes address the limitation of loose bounds for some aggregates, the case of having multiple constraints in one query, and reduce the cost incurred in the search by strategically materializing parts of the search space.

• The applicability of SAQR schemes are demonstrated by a web-based application (ORange) presented in Chapter 3, Section 3.5. ORange was designed as a tool to efficiently guide planners in allocating services zones that optimally satisfy a certain cardinality constraint.

(33)

Chapter 4, Section 4.2.1 which generalizes on previous problems addressed in literature in two ways: by specifying a correlation constraint for each pair of time series, and by inclusion of similarity to users’ input queries in refinement.

• Then in Chapter 4, Section 4.3 the RELATE schemes are introduced as efficient solutions to this refinement problem. RELATE schemes extend state-of-the-art algorithms to incrementally compute the correlation for all pairs of time series. To achieve that incremental computation, RELATE applies two classical tree traversal methods, BFS and DFS, to visit the candidate queries in a specific order that enables incremental computation of correlation.

• Further, RELATE optimizes the search process by pruning unqualified candidate queries via monotonic properties at two levels, query similarity level, and pairwise correlation level, as explained in Chapter 4, Section 4.3.5.

1.5 Thesis Outline

This thesis is organized as follows: in Chapter 2 we introduce the preliminaries of query refinement and elaborate more on the related work. Chapter 3 presents our first contribution for the Similarity-aware, Aggregate-based Query Refinement problem. Specifically, Section 3.3 provides innovative schemes called SAQR to efficiently refine queries based on a special case of aggregate constraints, i.e., cardinality constraints on the result. Then, in Section 3.4, SAQR Schemes are extended to address the general case of the Similarity-aware, Aggregate-based Query Refinement problem where SQL standard aggregate operatorssum, avg, min, max can be defined as constraints, and we propose efficient approximation and optimization techniques and compare them to related algorithms. Section 3.5 presents a web-based application ORange which employs SAQR schemes for refining selected areas based on cardinality constraints.

In Chapter 4, we formulate the Similarity-aware, Correlation-based Query Refinement problem and address its computational hardness by proposing the RELATE schemes. Section 4.3 shows the optimization techniques for RELATE schemes which include incremental computations of correlation, pruning candidate queries based on similarity constraints and pairwise correlation pruning. Finally, Chapter 5 concludes this thesis and provides suggestions for future studies in the area of query refinement.

(34)

Literature Review

This chapter starts with a broad overview of the data exploration (DE) techniques discussed in recent literature, which were proposed to facilitate efficient and effective knowledge extraction. Then, in Section 2.2 it narrows down the discussion to one area of these techniques, Query Refinement (QR) techniques, which is the scope of this thesis.

Section 2.2 broadly divides the discussion on QR techniques based on the refinement constraints. Initially, Section 2.2.1 reviews several QR techniques that address various refinement constraints. Then, Sections 2.2.2 and 2.2.3 explore in detail the techniques proposed to address similar constraints and problems to those in this thesis: Similarity-aware, Aggregate-based Query Refinement and Similarity-aware, Correlation-based Query Refinement problems. We investigate the shortcomings of these techniques in terms of efficiency and effectiveness, and their limitations when including similarity in the refinement process.

2.1 Data Exploration Techniques

DE is central for data-driven applications in which users interact and explore data through a sequence of related queries to gain deep insights [15, 18, 127]. This new form of interaction has resulted in the design of various DE techniques capable of guiding users through the data space with utmost efficiency and effectiveness [46].

Because "one size does not fit all", these techniques are application-oriented. That is, they are highly specialized and optimized for certain data exploration objectives. For example, some of these objectives are to recommend relevant data [30, 29], to identify interesting subspaces of data that are highly deviated from the rest of data or a reference [124], to explain why outliers show up in the results [104, 129], to summarize and present representative sets of the potentially huge result sets [28, 65], and to formulate or refine queries based on user-defined constraints [33, 119, 58, 125, 2].

(35)

Although these techniques have different objectives, they share commonality in their assumptions and optimization methods. One common assumption is that users are unfamiliar with the data space and aim to efficiently extract deep and interesting insights from their data with no prior assumptions. There are several optimization methods that are common among these techniques as well, such as incremental computations, materialization, sharing of computations, pruning based on properties and bounds, caching, etc. Next, we expatiate on these techniques then dwell on Query Refinement techniques as this is the scope of this thesis.

Data Recommendation: Recommending relevant data based on users’ queries helps users to understand their queries better. The YMALDB framework [30] and the AIMQ approach [92, 93], for instance, allow users to discover highly correlated and similar tuples to the original query’s results, although these discovered tuples are not among the original query’s results. At the other extreme, SeeDB [124] presents and recommends alternative queries (i.e., views) that are highly deviated from a reference query’s result. Such alternative queries are considered interesting and insightful for users. Explaining Outliers in Queries’ Results: Explaining outliers in queries results is another application-oriented DE technique that aims to give meaningful explanations which cause outliers to appear in the results. These explanations can be a set of tuples that must be removed from the result or modifications to the original query’s predicates [129, 104].

Query Results Summarization: Summarizing queries’ results assists users when exploring large-scale data, such as scientific and financial data. Specifically, because users’ exploratory queries will most likely return large results, deriving insights from these large results becomes an overwhelming obstacle. Techniques which summarize and select small representatives out of the raw and large results elevate this obstacle and efficiently enhance users understanding of their queries’ results. Two well-known techniques which follow this direction are the traditional Skyline [9] and Top-K [47] techniques. Other recent emerging techniques such as Regret Minimization [95, 96] and result Diversification [28, 65] have also been shown to be effective in promoting users’ understanding of their queries’ results. Another body of work [55] enhances the well-known drill-down operation in OLAP by showing rules along side the output. These rules represent the interesting aspects of the explored data.

Query Formulation: Query formulation techniques are orthogonal to QR techniques. Their goal is to efficiently and effectively guide users in locating their interests within a large data space by formulating a query from scratch, i.e., users do not provide input queries as traditionally assumed in QR techniques. That goal can be achieved by following different approaches. For example, in some cases users provide a set of tuples (i.e., result of an arbitrary query) for which they want a set of queries that return these exact tuples [119]. Discovering these queries increases users’ understanding

(36)

No Constraint Informal Definition Related Works 1 Query’s result size ≥ K, ≤ K,

or > 0

Given that Q returns too-few, too-many or an empty result, minimally refine Q to Q0 to satisfy the constraint

[90, 66, 83, 85, 53, 106]

2 Query’s result contains a set of tuples T

Formulate a query Q based on given example tuples T

[120, 119, 25, 69, 26, 110]

3 Remove outliers O in query’s result

Given Q has outliers O in its results, refine Q to Q0 to remove these outliers O

[104, 105, 129, 109]

4 Remove unexpected tuples U from query’s result

Refine Q to Q0 so that Q0 removes unexpected tuples U

[118, 49, 48, 16]

5 Query’s result satisfies aggregate constraint G

Refine query Q to Q0 such that Q0satisfies constraint G

[78, 17, 2, 123, 125, 10, 11, 12, 57, 79, 14, 58]

Table 2.1: Summary of some QR techniques based on refinement constraints

of their databases’ schema, since they provide alternative paths that are equivalent to that arbitrary query [120].

In other cases where users are unable to provide such tuples of interests, techniques such as AIDE [25] help in this regard by interactively steering users towards interesting data within a massive data space. To enable such interactive steering, AIDE carefully selects sample tuples for users to label as relevant or irrelevant. Then AIDE updates a classification model to further select sample tuples for users to label. Once users end this interactive steering process, AIDE formulates a query out of the classification model which captures users’ interests.

2.2 Query Refinement Techniques

Informally, QR is "the process of refining a query when the answer to the query does not meet the expectations of the user"[17]. In particular, the predicates of the query are automatically refined so that the query’s result reflects what the user expected, i.e., her constraints.

Refining predicates is tantamount to applying a set of modification operations on the predicates. These modification operations can be for example adding or dropping existing predicates, relaxing constants into ranges, narrowing ranges into constants, joining with auxiliary tables through foreign keys, etc.

Next, we review the literature of QR techniques based on the refinement constraints, which are certainly proposed to address a specific need or to solve a problem faced by users in DE tasks. We also

(37)

review the search and optimizations methods in these techniques. Table 2.1 above lists a summary of the discussed query refinement techniques broadly classified by the refinement constraints.

2.2.1 Query Refinement Techniques -Various Constraints

Too-few, Too-many, and Empty Answers Problems

In data exploration settings, it is not uncommon for users to experience the too-many, too-few or empty answers problems with their exploratory queries. These queries are often restrictive and narrow (too-few answers problem), liberal and under-specified (too-many answers problem), or they are unsuccessful in returning any answers at all (empty answers problem).

The empty answer problem is a special case of the too-few answers problem. These two problems aim to refine the original query into a new one, by applying modification operations on the predicates, so that the answer of the new query is likely to contain the tuples that interest the user. In [85, 83] they followed an interactive approach to solve this problem, i.e., users are asked for feedback on possible relaxation proposals for their queries. They proposed a probabilistic framework with exact and approximate algorithms which aim to refine a conjunctive query with atomic predicates by dropping some of these predicates (i.e., relaxation sequences) to achieve a non-empty-answer. To do that, all relaxation sequences are represented in a tree, rooted by the empty-answer query, and scored by the probability that a user accepts a proposed relaxation. Then, to find the best relaxation sequence, the FullTree [85] algorithm constructs this tree in full and recursively traverses it in a depth-first mode. Since this algorithm is computationally expensive and inappropriate for exploration settings, an optimized algorithm called FastOpt was proposed to minimize the construction cost of this tree by pruning branches of it using lower and upper bounds of the probability scores. These algorithms though were proposed to work with Boolean databases, i.e., attributes with 0s or 1s. They cannot be applied on databases that contain categorical and numerical attributes directly.

Using Machine Learning algorithms, [90, 91] developed LOQR: an online algorithm to relax a failing query (i.e., a query with an empty answer) with disjunctive predicates. The algorithm learns a decision rule for each attribute used in a predicate, then converts the learned rules into statements. The statements are then scored based on their similarity to the disjunctive predicates of the original query, and the most similar ones are used by LOQR to relax the predicates of the failing query.

The techniques in [106] also use Machine Leaning algorithms to address the too-many answers problem. They adopted the principles of the faceted search paradigm on structured databases, and proposed to take advantage of associated rich meta-data that comes in the form of tables, attributes, value ranges, etc. As an alternative method to ranked retrieval, faceted search was proposed to drill