Future work - Compact data structures for large and complex datasets

As future work, we will extend the proposed structure to other dimensions, for instance, to be used for spatio-temporal or 3D datasets. In addition, we will also study the adaptation of our data structure to distributed or dynamic environments.

Another interesting future research line would be to integrate our structure to perform queries of spatial data in the semantic web using, for example, the standard GeoSPARQL1_{. The current tools for this type of queries have several drawbacks,}

either they do not implement all the functionality or the query performance is very poor. We believe that our structure could improve both problems.

The election of the R-tree for indexing the vector dataset is a pragmatic choice, since it is the de facto standard for this type of data. However, as future work we will consider the use of modern compact data structures as a substitution for the R-tree.

Part III

Scientific data

Chapter 13

Introduction

This part presents two main contributions. The first is a compact representation of huge sets of functional data or trajectories of continuous time stochastic processes, which allows keeping the data always compressed, even during the processing in main memory. It is oriented to facilitate the efficient computation of the sample autocovariance function without a previous decompression of the dataset, by using only partial local decoding. This structure, which we call Compact representation of

Brownian Motion(CBM), is presented in Section 14.1. The second contribution is

a new memory-efficient algorithm to compute the sample autocovariance function, which is described in detail in Section 14.2.

We compare our C++ implementation, which receives as input CBM compressed data, with two baselines: i) the R implementation (in fact a C program), and ii) our own C implementation, both operating on plain data. The results of our empirical evaluation are shown in Chapter 15. Finally, Chapter 16 presents our conclusions and directions of future work.

The outline of this chapter is as follows. Section 13.1 presents the motivation for the use of compression in the context of empirical autocovariance computation for Brownian motion trajectories. More details about trajectories of Brownian motion are described in Section 13.2. Finally, Section 13.3 shows some related work in the compression field.

13.1 Introduction

In the last decade, we are attending to an exceptionally growing demand for large- scale data analysis, which is linked to the new field called Big Data. The need to process huge collections of data poses several challenges. On one hand, statistics and artificial intelligence communities continue to develop new methods and techniques

110 Chapter 13. Introduction

to analyze data. On the other hand, computer scientists have to adapt analytical algorithms to datasets with data volume too large, data rate too fast, data too

heterogeneous, and data too uncertainthe so-called Volume, Velocity, Variability,

and Veracity.

Researchers or professionals working in Big Data must master many different techniques and skills. To facilitate their work, several packages appeared, mainly SAS1_{, MATLAB}2_{, and R}3_{. These packages are very useful, but they have scalability}

problems [KEW13]. For example, the installation and administration manual of R recommends loading into main memory datasets that occupy only 10–20% of the available RAM and warns that if the dataset exceeds 50% of the available RAM, the system will be unusable due to operation overhead, even the simplest ones. The solution to these problems is, in most cases, the use of parallel processing [DG08, KEW13, DXS+_{15, SETM13]. Parallel processing is a straightforward}

solution, probably due to the existence of a good set of available tools. However, while putting most of the efforts in this strategy, one is missing chances to improve the scalability by means of other techniques. The use of more evolved data structures and algorithms is losing the role that they had in the past when the hardware technology was more limited.

Compression of floating point numbers has been proven difficult, mainly because the datasets usually contain many distinct values and with few repetitions. These two features make sequences of floating point numbers poorly skewed and, as a consequence, the entropy of those sequences is high, making them virtually incompressible with statistical compressors. Therefore, general purpose compressors may not succeed over sequences of floating point numbers. Instead, there are compressors that take advantage of properties of the data domain. Thus, there are compressors specially designed for images, video, or sound [Wal91, MPFL96, LR04, LI06], for general scientific data [EFF00, RKB06], or for more specific domains [YS08, MHP+_11].

Although our method can be used for trajectories of any continuous time stochastic process, in this thesis, we rely on the characteristics of Brownian motion to develop data structures and algorithms especially suited for these data. Since the seminal work by Einstein [Ein06], the Brownian motion has been extensively used to model the movements of particles subject to instantaneous imbalanced combined forces exerted by collisions. Brownian motion and related stochastic processes have been successfully used to model the movement of colloidal particles or the trajectory of pollen grains suspended in water. Over the past forty years, starting with the papers by [BS73], the Brownian motion and related processes have been used to model option pricing and plenty of financial time series (see, for instance, [Hul09]). Our method is designed to efficiently compute empirical moments from a sample

1_{http://www.sas.com/}

2_{http://www.mathworks.com/products/matlab}

In document Compact data structures for large and complex datasets (Page 136-141)