Processing skyline queries in centralised and distributed incomplete databases

Full text

(1)UNIVERSITI PUTRA MALAYSIA PROCESSING SKYLINE QUERIES IN CENTRALISED AND DISTRIBUTED INCOMPLETE DATABASES. ALI AMER ALWAN. FSKTM 2013 7.

(2) PM. R. IG. H. T. U. PROCESSING SKYLINE QUERIES IN CENTRALISED AND DISTRIBUTED INCOMPLETE DATABASES. ©. C. O. PY. ALI AMER ALWAN. DOCTOR OF PHILOSOPHY UNIVERSITI PUTRA MALAYSIA 2013.

(3) PM. IG. H. T. U. PROCESSING SKYLINE QUERIES IN CENTRALISED AND DISTRIBUTED INCOMPLETE DATABASES. R. By. ©. C. O. PY. ALI AMER ALWAN. Thesis Submitted to the School of Graduate Studies, Universiti Putra Malaysia, in Fulfilment of the Requirements for the Degree of Doctor of Philosophy. June 2013.

(4) U. PM. DEDICATION. IG. H. T. To my Dearest and First Teachers: My Father and Mother To my lovely wife “Zubaidah” I will always be grateful for your endless love, unlimited support and deep faith in me To my beloved son Amer. ©. C. O. PY. R. To my lovely brother and sisters. Ali. Abstract of thesis presented to Senate of Universiti Putra Malaysia in fulfilment of the requirement for the degree of Doctor of Philosophy ii.

(5) PROCESSING SKYLINE QUERIES IN CENTRALISED AND DISTRIBUTED INCOMPLETE DATABASES. ALI AMER ALWAN June 2013. PM. By. :. Professor Hamidah Ibrahim, PhD. Faculty. :. Computer Science and Information Technology. U. Chairman. T. Skyline queries incorporate and provide a flexible query operator that returns data items. H. (skylines) which are not being dominated by other data items in all dimensions. IG. (attributes) of the database. Most of the existing skyline techniques determine the skylines by assuming that the values of dimensions for every data item are available However,. this. assumption. is. not. always. true. particularly for. R. (complete).. PY. multidimensional database as some values may be missing. The incompleteness of data leads to the loss of the transitivity property of skyline technique and results into failure in test dominance as some data items are incomparable to each other. Furthermore,. O. missing values will influence negatively on the process of finding skylines, leading to. C. high overhead, due to exhaustive pairwise comparisons between the data items. This problem becomes more complicated when multiple tables with incomplete data need to. ©. be accessed in determining the skylines. Since in distributed database tables are spread over various locations, therefore, join operation is needed in identifying the skylines. Joining these dimensions without any filteration will result in a huge amount of data to be joined. Furthermore, most of the previous works in the area of skyline queries in incomplete database emphasized only on retrieving skylines without estimating the iii.

(6) missing values. In other words, the derived skylines have some missing values in one or more dimensions. However, in many cases users are concerned about the values in these. PM. dimensions.. This thesis aims at proposing an efficient approach which is able to identify skylines in. incomplete database. The approach employs the concepts of clustering data to partition. U. the initial database into a set of distinct clusters. Then, the derived clusters are further. divided into smaller groups and local skylines of each cluster are then identified. Next, a. T. set of virtual skylines called k-dom that is derived from the local skylines are merged to. H. derive a global k-dom skyline which is inserted at the top of each cluster to identify the. IG. candidate skylines. The final skylines are retrieved after conducting pairwise comparisons among the candidate skylines. The approach is extended to process skyline. R. queries in incomplete distributed databases by pruning the input relations before. PY. conducting the join and skyline operations.. The thesis also proposes an approach to estimate the missing values in the skylines. The. O. approach utilizes the concept of mining attribute correlations to generate approximate. C. functional dependencies (AFDs) that capture the relationships between dimensions. Besides, the strength of probability correlations between dimensions is computed in. ©. order to estimate the values. Then, the skylines are ranked according to the confidence of the generated AFD and the strength of probability correlations of the dimensions.. Several experiments on synthetic and real datasets have been conducted. The results showed that our proposed approach for processing skyline queries in incomplete iv.

(7) database has reduced the number of pairwise comparisons in the range of 75%-93% and the processing time in the range of 50%-89% compared to the previous approach. While the approach for processing skyline queries in incomplete distributed databases achieved. PM. between 56% to 88% reduction in the processing time and 84% to 90% for network cost compared to the previous approach. Lastly, the results for imputing the missing values of. ©. C. O. PY. R. IG. H. T. missing values and the estimated values of the skylines.. U. the skylines have shown that our approach achieved 25% error rate between the real. v.

(8) Abstrak tesis yang dikemukakan kepada senat Uinversiti Putra Malaysia sebagai memenuhi keperluan untuk ijazah Doktor Falsafah MEMPROSES PERTANYAAN SKYLINE DI DALAM PANGKALAN DATA TIDAK LENGKAP BERPUSAT DAN TERAGIH. ALI AMER ALWAN Jun 2013. PM. Oleh. :. Profesor Hamidah Ibrahim, PhD. Fakulti. :. Sains Komputer dan Teknologi Maklumat. T. U. Pengerusi. H. Pertanyaan Skyline menggabungkan dan menyediakan operator pertanyaan fleksibel. IG. yang memulangkan item data (skyline) yang tidak didominasi oleh item data lain dalam semua dimensi (atribut) pangkalan data. Kebanyakan teknik skyline yang sedia ada. R. menentukan skyline dengan mengandaikan bahawa nilai dimensi bagi setiap item data adalah tersedia (lengkap). Walau bagaimanapun, andaian ini tidak selalunya benar. PY. khususnya bagi pangkalan data pelbagai dimensi memandangkan beberapa nilai mungkin tiada. Ketidaklengkapan data menyebabkan kehilangan sifat transitiviti teknik. O. skyline dan mengakibatkan kegagalan dalam ujian penguasaan kerana beberapa item data tidak sepadan antara satu sama lain. Tambahan pula, nilai yang hilang akan. C. memberi pengaruh negatif terhadap proses mendapatkan item data skyline,. ©. menyebabkan atas yang tinggi, akibat perbandingan berpasangan yang tuntas. Masalah ini menjadi lebih rumit bila pelbagai jadual yang mempunyai data tak lengkap perlu dinilai untuk menentukan skyline. Memandangkan dalam pangkalan data teragih jadual disebarkan kepada berbagai lokasi, maka operasi sambungan diperlukan untuk mendapatkan skyline. Penggabungan dimensi-dimensi ini tanpa sebarang tapisan akan. vi.

(9) mengakibatkan penggabungan sejumlah data yang besar. Lagipun, kebanyakan kajian terdahulu dalam bidang pertanyaan skyline dalam pangakalan data tak lengkap hanya menekankan kepada mendapatkan semula skyline tanpa mengambil kira nilai-nilai yang. PM. tiada. Dalam kata lain Skyline yang diperolehi tidak mengandungi nilai dalam satu atau lebih dimensi. Walau bagaimanapun, dalam banyak kes, pengguna lebih menitik. U. beratkan nilai-nilai dalam dimensi-dimensi ini.. Tesis ini bertujuan mencadangkan satu pendekatan cekap yang boleh mengenal pasti. T. skyline dalam pangkalan data tak lengkap. Pendekatan ini menggunakan konsep. H. pengelompokan data untuk membahagi pangkalan data asal kepada satu set kelompok. IG. yang berbeza. Kemudian, kelompok yang diterbitkan dibahagikan lagi kepada kumpulan lebih kecil dan skyline setempat bagi setiap kelompok kemudiannya dikenal pasti.. R. Selepas itu, satu set skyline maya yang dipanggil k-dom yang diterbitkan daripada. PY. skyline setempat digabungkan untuk menerbitkan satu skyline k-dom sejagat yang diselitkan di atas setiap kelompok untuk mengecam calon-calon skyline. Skyline terakhir didapatkan semula selepas dilaksanakan perbandingan berpasangan dalam kalangan. O. calon skyline. Pendekatan ini dilanjutkan untuk memproses pertanyaan skyline dalam dengan memangkas hubungan input sebelum. C. pangkalan data teragih tak lengkap. ©. dilakukan operasi cantuman dan skyline.. Tesis ini juga mencadangkan satu pendekatan untuk menyenggara nilai-nilai yang tiada dalam skyline melalui anggaran nilai hampir. Pendekatan ini menggunakan konsep. perlombongan korelasi atribut untuk menjana satu Kebergantungan Fungsian Hampiran (AFD) yang menangkap hubungan antara dimensi. Di samping itu, kekuatan korelasi vii.

(10) kebarangkalian antara dimensi dikira untuk menganggarkan nilai tersebut. Kemudian, skyline tersebut disusun mengikut kekuatan AFD yang dijana dan kekuatan korelasi. PM. kebarangkalian antara dimensi.. Beberapa eksperimen ke atas set data sintetik dan sebenar telah dijalankan. Keputusan menunjukkan pendekatan yang dicadangkan untuk memproses pertanyaan skyline dalam. U. pangkalaan data tak lengkap telah menurunkan bilangan perbandingan berpasangan. dalam julat 75%-93% dan masa pemprosesan dalam julat 50%-89% berbanding dengan. T. pendekatan-pendekatan terdahulu. Manakala pendekatan untuk memproses pertanyaan. H. skyline dalam pangkalaan data teragih tak lengkap telah mencapai penurunan 56%. IG. hingga 88% dalam masa pemprosesan dan 84% hingga 90% bagi kos rangkaian. Akhir sekali, keputusan bagi menggantikan nilai skyline yang tiada telah menunjukkan bahawa. R. pendekatan ini menghasilkan kadar ralat 25% di antara nilai sebenar skyline yang tiada. ©. C. O. PY. dengan nilai yang dianggarkan.. viii.

(11) ACKNOWLEDGEMENTS. In the name of ALLAH, the most merciful and most compassionate. Praise to ALLAH S.. PM. W. T. who granted me strength, courage, patience and inspirations to complete this work.. U. I do not think that words can express the gratitude that I have for my supervisor Professor Dr. Hamidah Ibrahim. As a top researcher in database system, her smart. T. intuition on computer science, stimulating suggestions and encouragement helped me in. IG. H. all aspects of study.. First and foremost, I heartiest would to sincere thanks my supervisor Prof. Dr. Hamidah. R. Ibrahim, for her incredible guidance, continuous support, and encouragement. Always. PY. having time for me and readily providing her technical expertise throughout the period of my study. I owe more than I can ever repay. Only has the successful completion of this work become possible due to her supervision, she is the first person to thank for. O. making my Ph.D. at the Universiti Putra Malaysia a very enjoyable experience. Her high. C. stand of diplomatic power and professionalism set a great model for me to follow.. ©. To my thesis committee members, Assoc. Prof. Dr. Nor Izura Udzir, and Dr. Fatimah Sidi, I would like to express appreciation for their insightful comments, questions, criticisms, and suggestions on the work. Their critical appraisals of my papers and presentations are extremely valuable for the improvement in my thinking also her kindness and willingness to help is unforgettable. ix.

(12) I delighted to gratefully acknowledge the Universiti Putra Malaysia for giving me the opportunity to complete my study and their financial support during my Ph.D journey. Thanks also are due to other members of the academic, and the technical staff in the. PM. Faculty of Computer Science and Information Technology for their help and effort to provide facilities, equipments, and an excellent environment to accomplish this research.. T. help, enjoyable discussions and some good times.. U. I also I would like to thank many people I have met during my stay in Malaysia for their. H. Finally, I would like to express my love and deepest thanks to my noblest father Amer. IG. Alwan Al-Jubouri and my great mother Ebtisam Mohammad were the reason of my success, I’m indebted to them. To my brother, and sisters, who have loved and support. ©. C. O. PY. R. me throughout my life, thank you.. Ali Amer Alwan June 2013. x.

(13) Hamidah Ibrahim, PhD Professor Faculty of Computer Science and Information Technology Universiti Putra Malaysia (Chairman). T. U. Nur Izura Udzir, PhD Associate Professor Faculty of Computer Science and Information Technology Universiti Putra Malaysia (Member). PM. This thesis was submitted to the Senate of Universiti Putra Malaysia and has been accepted as fulfillment of the requirement for the degree of Doctor of Philosophy. The members of the Supervisory Committee were as follows:. O. PY. R. IG. H. Fatimah Sidi, PhD Associate Professor Faculty of Computer Science and Information Technology Universiti Putra Malaysia (Member). ©. C. ____________________________ BUJANG BIN KIM HUAT, PhD Professor and Dean School of Graduate Studies Universiti Putra Malaysia. Date:. xi.

(14) DECLARATION. R. IG. H. T. U. PM. I hereby declare that the thesis is based on my original work except for quotations and citations which have been duly acknowledged. I also declare that it has not been previously or concurrently submitted for any other degree at Universiti Putra Malaysia or other institutions.. ALI AMER ALWAN. ©. C. O. PY. Date: 19 June 2013. xii.

(15) TABLE OF CONTENTS Page ii iii vi ix x xii xvi xvii xx. U. INTRODUCATION 1.1 Overview 1.2 Problem Statement 1.3 Objective of the Research 1.4 Research Scope 1.5 Contributions 1.6 Organization of the Thesis. PREFERENCE EVALUATION TECHNIQUES AND INCOMPLETE DATABASES 2.1 Introduction 2.2 Types of Queries in Database System 2.3 Preference Evaluation Techniques 2.3.1 Top-k Technique 2.3.2 Skyline Technique 2.3.3 Top-k Dominating Technique 2.3.4 K-dominance Technique 2.3.5 K-frequency Technique 2.4 Incomplete Database 2.5 Approaches of Query Processing in Incomplete Database 2.5.1 Statistical Methods 2.5.2 Machine Learning Methods 2.6 Summary. C. O. PY. R. 2. IG. H. T. CHAPTER 1. PM. DEDICATION ABSTRACT ABSTRAK ACKNOWLEDGEMENTS APPROVAL DECLARATION LIST OF TABLES LIST OF FIGURES LIST OF ABBREVIATIONS. ©. 3. LITERATURE REVIEW 3.1 Introduction 3.2 Preference Queries in Complete Database Systems 3.2.1 Top-k Preference Queries 3.2.2 Skyline Preference Queries 3.2.3 Top-k Dominating Preference Queries 3.2.4 K-dominance Preference Queries 3.2.5 K-frequency Preference Queries 3.3 Preference Queries in Incomplete Database Systems xiii. 1 2 5 6 7 8. 11 11 13 14 16 18 20 21 23 25 28 29 30. 31 31 34 38 45 47 48 50.

(16) 3.4. Summary. 54. RESEARCH METHODOLOGY 4.1 Introduction 55 4.2 Methodology of Research 56 4.3 The Structure of the Incomplete Database 60 4.4 The Proposed Framework of Skyline Queries in Incomplete Database 62 4.4.1 Components of the Global Incomplete Skylines 63 4.4.2 Components of the Global Complete Skylines 65 4.5 Performance Measurement 66 4.6 Datasets 68 4.6.1 Synthetic Datasets 69 4.6.2 Real Datasets 69 4.7 Summary 72. 5. SKYLINE QUERIES IN INCOMPLETE DATABASE 5.1 Introduction 5.2 Preliminaries 5.3 Our Proposed Approach 5.3.1 Clustering Data 5.3.2 Grouping and Finding Local Skylines 5.3.3 Deriving k-dom Skylines 5.3.4 Retrieving Skylines 5.4 Summary SKYLINE QUERIES IN DISTRBUTED INCOMPLETE DATABASES 6.1 Introduction 6.2 Our Proposed Approach 6.2.1 Identifying the Skylines in each Relation 6.2.1.1 Clustering Data 6.2.1.2 Identifying Local Skylines 6.2.1.3 Deriving k-dom Skylines 6.2.2 Joining the Skylines of the Relations 6.2.3 Determining the Final Skylines 6.3 Summary. 95 96 97 99 101 102 105 107 108. ESTIMATING MISSING VALUES OF THE SKYLINES IN INCOMPLETE DATA 7.1 Introduction 7.2 The Phases of Estimating Missing Values of the Skylines 7.2.1 Generating Approximate Functional Dependencies 7.2.2 Identifying the Strength of the Probability Correlations 7.2.3 Imputing the Missing Values of the Skylines 7.2.4 Ranking the Final Skylines 7.3 Summary. 109 110 111 116 117 119 120. C. O. PY. 6. ©. 7. 73 74 76 77 80 88 93 93. R. IG. H. T. U. PM. 4. xiv.

(17) RESULTS AND DISCUSSION 8.1 Introduction 8.2 Experiment Settings 8.3 Experiment Results of Skylines in Incomplete Database 8.3.1 Effect of Data Dimensionality 8.3.2 Effect of Dataset Size 8.4 Experiment Results of Skyline Queries in Distributed Incomplete Databases 8.4.1 Effect of Data Dimensionality 8.4.2 Effect of Join Selectivity 8.4.3 Effect of Dataset Size 8.4.4 Network Cost 8.5 Experiment Results of Estimating the Missing Values of Skylines 8.6 Summary. 122 122 127 127 131 136 136 138 140 142. U. PM. 8. Introduction Conclusion of Research Future Work Recommendations. IG. 9.1 9.2 9.3. T. CONCLUSIONS AND FUTURE WORK RECOMMENDATIONS. H. 9. 145 148. ©. C. O. PY. R. REFERENCES BIODATA OF STUDENT LIST OF PUBLICATIONS. xv. 150 150 152. 154 163 164.

(18)

No results found