Chapter 2: Background and Technology Trends
2.3 Storage Interfaces
2.4.3 Data Mining
In data mining, basic tasks such as association discovery, classification, regression, clustering, and segmentation are all data-intensive. The data sets being processed are often quite large and it is not known a priori where in a particular data set the “nuggets” of knowledge may be found [Agrawal96, Fayyad98]. Point-of-sale data in retail organiza- tions is collected over many months and years and grows continually [Agarwal95]. Tele- communication companies maintain tens of terabytes of historical call data that they wish to search for patterns and trends. Financial companies maintain decades of data for risk analysis and fraud detection [Senator95]. Airlines and hotels maintain historical data as input for various types of yield management and targeted marketing [Sun99a]. The poten- tial list of data sources and potential uses is endless.
Many of the statistical and pattern recognition algorithms used in data mining have been developed with small data sets in mind and depend on the ability to operate on the data in memory, often requiring multiple passes over the entire data set to do their work. This means that users must either limit themselves to using only a subset of their data, or must have more efficient ways of operating on them out-of-core.
The primary difference between “traditional” database operations and data mining or data warehousing is that on-line transaction processing (OLTP) systems were designed to automate simple and highly structured tasks that are repeated all day long - e.g. credit card sales or ATM credits and debits. In this case the reliability of the system and consis- Figure 2-7 Trends in transaction processing performance and cost. The chart shows price/performance ratios for TPC-C machines from the introduction of the TPC-C benchmark in 1993 to 1999. There are two sets of plots, one for enterprise class and one for commodity or workgroup class machines across three different manufacturers. Note the log scale.
1992 1993 1994 1995 1996 1997 1998 1999 2000 101 102 103 Year $/tpmC IBM enterprise HP enterprise IBM commodity HP commodity Dell commodity $100 per tpmC $17 per tpmC $2000/tpmC
tency of the data are the primary performance factors [Chaudhuri97]. The historical data sets that form the basis for data mining are the accumulated transactions of (often) several OLTP systems that an organization collects from different portions of its daily business. The goal of data mining is to combine these datasets and look for patterns and trends both within and among the different original databases (imagine, for example, combining gro- cery store receipts with customer demographics and weather reports to determine that few people in Minneapolis buy charcoal briquets in December). As a result, data mining sys- tems will often contain orders of magnitude more data than transaction processing sys- tems. In addition, the queries generated in a data mining system are much more ad-hoc than those in an OLTP system. There are only so many ways to debit a bank account when a withdrawal is made, while there are an infinite number of ways to summarize a large col- lection of these transactions based on branch location, age of the customer, time of day, and so on. This means that the use of static indices is significantly less effective than in OLTP workloads, particularly as the number and variety of attributes and dimensions in the data increases (due to the curse of dimensionality, discussed in more detail in the next section). The use of materialized views and pre-computed data cubes [Gray95, Harinarayan96] allows portions of the solution space to be pre-computed in order to answer frequently-asked queries without requiring complete scans, although these mecha- nisms will only benefit the set of queries and aggregates chosen a priori and must still be computed (using scans) in the first place.
These characteristics mean that data mining tasks are not well-supported by the database systems that have been optimized for OLTP workloads over many years [Fayyad98]. There are several efforts underway to identify a basic set of data mining prim- itives that might be added as extensions to SQL and used as the basis of these more com- plex queries. The field is relatively new, so many of the basic tasks are still being identified and debated within the community. One of the major factor in determining what these primitives should be is how they can be mapped efficiently to the underlying system architectures. In particular, what sorts of operations will be quick and which more cum- bersome. There is as yet no “standard” way to do data mining, and there is great variance across disciplines and data sets. This means there is room for novel architectures that pro- vide significant advantages over existing systems to make inroads before particular ways of doing things are set in stone (or code1).
2.4.4 Multimedia
In multimedia, applications such as searching by content [Flickner95, Virage98] place large demands on both storage and database systems. In a typical search, the user might provide a single “desirable” image and requests a set of “similar” images from the 1. To further illustrate this point, there are over 200 companies of varying sizes currently developing and providing data mining software [Fayyad99], whereas the number of companies that provide “standard” OLTP database management software can be counted on the fingers of one hand. This means there is much more scope for novel architectures or ways of developing code than there might be in the “traditional” database systems.
database. The general approach to such a search is to extract a set of feature vectors from every image, and then search these feature vectors for nearest neighbors in response to a query [Faloutsos96]. Both the extraction and the search are data-intensive operations.
Extracting features requires a range of image processing algorithms. The algorithms used and features extracted are also constantly changing with improvements in processing, or as the understanding of how users classify “similarity” in multimedia content such as images, video, or audio changes. The state of the art is constantly evolving, so workloads will require repeated scans of the entire data sets to re-extract new features. Since the extraction of features represents a lossy “compression” of the data in the original images, it is often necessary to resort to the original images for re-processing. This is true in data sets of static images that may be available for searching on the web [Flickner95], as well as in image databases used to find patterns in the physical world [Szalay99].
Once a fixed set of features has been identified and extracted from an image data- base, it is no longer necessary to resort to the original images, which may be measured in terabytes or more, for most queries, but the extracted data is still large. The Sloan Digital Sky Survey, for instance, will eventually contain records for several hundred million celestial objects [Szalay99]. The Corbis archive maintains over 2 million online images [Corbis99]. The dimensionality of these vectors will often be high (e.g. moments of inertia for shapes [Faloutsos94] in the tens, colors in histograms for color matching in the hun- dreds, or Fourier coefficients in the thousands). It is well-known [Yao85], but only recently highlighted in the database literature [Berchtold97], that for high dimensionali- ties, sequential scanning is competitive with indexing methods because of the curse of
dimensionality. Conventional database wisdom is that indices always improve perfor-
mance over scanning. This is true for low dimensionalities, or for queries on only a few attributes. However, in high dimensionality data and with nearest neighbor queries, there is a lot of “room” in the address space and the desired data points are far from each other. The two major indexing methods for this type of data, grid-based and tree-based, both suf- fer in high dimensionality data. Grid-based methods require exponentially many cells and tree-based methods tend to group similar points close together, resulting in groups with highly overlapping bounds. One way or another, a nearest neighbor query will have to visit a large percentage of the database, effectively reducing the problem to sequential scanning.
There is a good deal of ongoing work in this area to address indexing for this type of data, including X-trees [Berchtold96], but there are some recent theoretic results to indi- cate that this is actually a structural problem with these types of data and queries, rather than simply due to the fact that no one has found the right indexing scheme “yet”.
In addition to requiring support for complex, data-intensive queries, the sheer size of these databases can be daunting. One hour of video requires approximately one gigabtye of storage and storing video databases such as daily news broadcasts can quickly require many terabytes of data [Wactlar96]. Increasingly, users are maintaining such databases
that can be searched by content (whether as video, as text, or as audio), using many of the methods discussed above to find a particular piece of old footage or information. Medical image databases also impose similarly heavy data requirements [Arya94].
2.4.5 Scientific
Large scientific databases often include image data, which has already been men- tioned, as well as time series data and other forms of sensor data that require extensive and repeated post-processing. These data sets are characterized by huge volumes of data and huge numbers of individual objects or observations. The Sloan Digital Sky Survey projects will collect 40 TB of raw data on several hundred million celestial objects to be processed into several different data products totalling over 3 TB. This data will be made available for scientific use to a large number of organizations, as well as to the public via a web-accessible database [Szalay99]. The dozens of satellites that form NASA’s Earth Observing System will generate more than a terabyte of data per day when they become fully operational [NASA99].