• No results found

where users are overwhelmed with machine-generated data and are spending days to prepare data only to discard it shortly after upon realizing that it does not contain anything useful or is a product of a "noisy" activity from a device [8,111,162,193]. The work presented in this chapter is particularly attractive for scientific domains, in which a lack of proper tool for their analysis is evident [111,162]. The ideas presented in this chapter, thus, could be thought of as a step toward, as Jim Gray stated [111], putting the scientist back in control of his data, by unlocking the data and facilitating its analysis.

Furthermore, loading data into a DBMS creates a second copy of the data, which for PB and TB of data routinely gathered in scientific domains increases the storage cost [47,130,173,

220]. This copy, however, can be stored in an optimized manner depending on the database schema: e.g. integers stored in a database page (in binary) likely take less space than in ASCII. Nonetheless, there are cases where a second copy does not imply less data. For instance, variable-sized data stored in fixed-size fields usually takes more space in a database page rather than in its raw form, hence more than doubling the data set size. Moreover, DBMS store data in database pages using proprietary and vendor-specific formats. The DBMS has complete ownership over the data, which is a cause of concern for some users. The NoDB paradigm, however, achieves database format independence, since the raw data files remain intact as the main data repository.

In addition, DBMS are designed to be the main repository for the data, which makes the integration of DBMS data with external tools inherently hard. Techniques such as ODBC, stored procedures and user-defined functions aim to facilitate the interaction with data stored in the DBMS. Nonetheless, none of these techniques is fully satisfactory and in fact, this is a common complaint of scientific users, who have large repositories of legacy code that operates against raw data files. Migrating and reimplementing these tools in a DBMS would be difficult and likely require vendor-specific hooks. NoDB significantly facilitates such data integration, since users may continue to rely on their legacy code in parallel to systems such as PostgresRaw.

Another major opportunity coming with the NoDB vision is the potential to query multiple different data sources and formats. NoDB systems can adopt format-specific plugins to handle different raw data file formats [158]. Implementing these plugins in a reusable manner requires applying data integration techniques but may also require the development of new techniques, so that commonalities between formats are determined and reused. Additionally, supporting different file formats also requires the development of hybrid query processing techniques, or even adding support for multiple data models (e.g. for hierarchical data)[159].

7.6 Related work

The ideas presented in this chapter draw inspiration from several decades of research into database technology and it is related to a plethora of research topics. In this section, we discuss some topics closely related to the work presented in this chapter.

Chapter 7. Timely and Interactive Data Analytics

Automated physical design. The NoDB paradigm advocates for minimizing the data-to-

insight time, which is also the goal of automated physical design and auto-tuning tools (automated physical designers). Every major database vendor offers offline indexing fea- tures, where an auto tuning tool performs offline analysis to determine the proper phys- ical design including sets of indexes, statistics and views to use for a specific workload [5,6,7,41,53,72,73,193,244,263]. More recently, these ideas have been extended to support online indexing [42,214], hence removing the need to know the workload in advance. The workload is discovered on-the-fly, with periodic reevaluations of the physical design. In this chapter, we have considered the hard case of zero a priori idle time or workload knowledge, enabling instantaneous querying while using each query as an advice on how to tune the system.

Adaptive indexing. NoDB brings new opportunities toward achieving fully autonomous

database systems, i.e., systems that require zero initialization and administration. Recent efforts in database cracking and adaptive indexing [102,103,107,133,134,135,137] demon- strate the potential for incrementally building and refining indexes without requiring an administrator to tune the system, or knowing the workload in advance. Still, though, all data has to be loaded up front, breaking the adaptation properties and forcing a significant delay in the data-to-insight time. We envision that adaptive indexing can be exploited and enhanced for NoDB systems. A NoDB-like system with adaptive indexing can avoid both index creation and loading costs, while providing full-featured database functionality. The major challenge is the design of adaptive indexing techniques directly on raw files.

External files. Some DBMS offer the ability to query raw data files directly with SQL, i.e.,

without loading data, similarly to our approach. External files, however, can only access raw data with no support for advanced database features such as indexes or statistics. Therefore, external files require every query to access the entire raw data file, as if no other query did so in the past. In fact, this functionality is provided mainly to facilitate data loading tasks and not for regular querying. NoDB systems, however, provide on-the-fly index creation and incremental data loading through caching to assist future queries and improve performance.

Raw query processing. Several researchers have already identified the need to reduce data

analysis time for very large data processing tasks [8,66,111,136,162,172,228]. Multiple sys- tems following the MapReduce paradigm [75] are in fact used nowadays to perform data analy- sis on raw data files stored in the Hadoop Distributed File System (HDFS) [118]. Hadoop [118] provides capabilities to query raw data, however it is more amenable for batch analytics rather than for interactive data exploration due to the long job initialization process. Approaches such as Pig [190] and Hive [236] expose a declarative query language similar to SQL to launch queries that are then transformed internally into MapReduce jobs. Similar to NoDB, they have zero initialization overhead. Nonetheless, their performance remains flat (similar to external table approaches), while NoDB adapts to the query characteristics and improves performance over time. Hence the techniques presented in this chapter could complement the MapReduce solutions to save on parsing and tokenizing time.

7.6. Related work

Information extraction. Information extraction techniques have been extended to provide

direct access to raw text data [153], similarly to external files. The difference from external files is that raw data access relies on information extraction techniques instead of directly parsing raw data files. These efforts are motivated by the need to bridge multiple different data formats and make them accessible via SQL, usually by relying on wrappers [207].

Data management of raw files. DataLinks [30] is developed as a tool that provides an integra- tion between DBMS and the file system, by enabling a DBMS to manage files stored in the file system. This project however focuses more on providing management capabilities (e.g., consistency, integrity) over files, unlike PostgresRaw that focuses on efficient query processing capabilities to enable efficient exploration. Data Vault [147] also shares the same motivation as the NoDB project, providing an instant gateway to the data. Nevertheless, a major role of a data vault is to locate the files of interest for the given task, and then use existing external tools (should they exist) or load the data just-in-time in a DBMS. The DBMS is employed with a similar role in [157] where metadata information (i.e., semantic chunks) is used to find files of interest. Hence, the techniques proposed in this chapter could be applied to provide efficient query processing over files of interest once they are located with the above-mentioned approaches.

Data loading. Loading has recently gained more attention from the database community who

realized that loading is becoming a bottleneck for modern data applications characterized by frequent arrivals of new fresh data. Existing efforts toward reducing this overhead could be categorized into approaches that: a) amortize loading cost by doing it lazily and incrementally, and b) accelerate the loading procedure.

Adaptive and lazy loading was first presented in [136], as an alternative to a full data loading. In this approach, loading happens incrementally during query processing, and is driven by the workload, in a similar way PostgresRaw is building its caches. Invisible loading [3] further applied this idea to the context of MapReduce jobs to incrementally build tuples by using parsing and tuple extraction operations of MapReduce and store them in MonetDB, a modern column-store [181].

An orthogonal approach such as instant loading [182] in Hyper [161] focuses on improving the performance of the loading procedure by using vectorization primitives (SIMD instructions) and exploiting modern hardware. Modern hardware is exploited in speculative loading as well [62], where a new operator SCANRAW exploits the external tables functionality extending it with a parallel implementation to take advantage of modern multi-cores to improve perfor- mance. Instant and speculative loading are orthogonal to the techniques presented in this chapter and could be used to enhance the performance of adaptive loading (i.e., caching) of PostgresRaw.

Chapter 7. Timely and Interactive Data Analytics