• No results found

2.3 Analysis of Temporal Data

2.3.2 Application Ranges for Threshold Queries

The novel concept of threshold queries is an important technique, useful for many practical application areas.

Application 1

For the pharmaceutical industry it is interesting which drugs cause similar effects in the blood values of a patient. Obviously, effects like a certain blood parameter exceeding a critical level τ are of particular interest. We assume that after a certain drug treatment the heart rate and systolic blood pressure of several patients are measured for one minute, as shown in Figure2.11, and the data were stored within a database. In our example, the recorded data of patient A shows an immediate effect of the drugs, which differs significantly

2.3 Analysis of Temporal Data 33 Measurement A Measurement B Temperature Ozone (O 3 )

Figure 2.12: Detection of associations between different environmental and climatical attributes.

from the effects on patient B. A threshold query could return for a certain patient all other patients in the database whose heart rates and blood pres- sures show similar temporal reaction on the medical treatment with respect to certain thresholds which may be significant for the observed attributes.

Application 2

The amount of time series data, derived from environment observation cen- ters, for example, has increased drastically. Furthermore, modern sensor techniques enable the user to record many attributes of the observed objects or scenes simultaneously. For instance, the analysis of environmental air pollution has been the focus of many European research projects in recent years. Many sensor stations have been installed at different locations in Eu- ropean cities and in rural areas. Each sensor station is equipped with several types of sensors that are used to measure multiple air pollution attributes (e.g. SO2, N O, N O2, CO, BT X, O3, H2S and CmHn−O) as well as mete-

orological parameters such as wind direction, speed and temperature. As a result, German state offices for environmental protection maintain about 127 million time series, each representing the daily course of air pollution parameters. The gathered data are stored in terms of time series which have to be analyzed. Geo- and environmental scientists could be interested

in the dependencies that exist between meteorological attributes, e.g. hu- midity, and environmental attributes, e.g. particulate matter (P M10). To

discover which attributes nearly simultaneously exceed their legal threshold could help to find such dependencies. Hence, an effective and efficient pro- cessing of queries like ”return all ozone time series that exceed the threshold

τ1 = 75µg/m3 when the temperature reaches the threshold τ2 = 25◦C” could

be very valuable. An example is depicted in Figure2.12, showing two pairs of temperature-ozone curves where the characteristic of the ozone concentration (lower time series) is very similar to that of the corresponding temperature (upper time series) w.r.t. τ1, τ2 respectively. Analysis based on such sim-

ilarity is provided by threshold queries. Obviously, the increasing amount of data to be analyzed represents a big challenge for methods supporting efficient threshold queries.

Application 3

The analysis of gene expression data is important for understanding of gene regulations and cellular mechanisms in molecular biology. Gene expression data contains the expression level of thousands of genes, indicating howactive one gene is over a set of time slots. The expression level of a gene can be ”up” (indicated by a positive value) or ”down” (negative value). From a biologist’s point of view, it is interesting to find genes that have a similar up and down pattern because this indicates a functional relationship among the particular genes. Since the absolute up/down-value is irrelevant, this problem can be solved by means of threshold queries with a threshold of

τ = 0. Each gene provides its own interval sequence, indicating the time slots classified as ”up”. Genes with a similar interval sequence have a similar ”up” and ”down” pattern.

2.3.3

Test Datasets

Our experimental evaluation of the threshold similarity is based on a wide variety of test datasets. In order to guarantee the reproducibility of the

2.3 Analysis of Temporal Data 35

experiments and to compare the results to other approaches, we used several publicly available datasets, mostly from the UCI KDD Archive1, which we describe in the following.

The AUDIO dataset contains time sequences expressing the temporal behavior of the energy, the dynamics and the strongest peak in pieces of music. The three representations are computed 25 times per second for 6 octaves by using a cascade of bandpass filters. The resulting time series are then cut into pieces of length 300, resulting in an overall database of 700,000 time series. If not otherwise stated, the database size was set to 50,000 objects and the length of the objects was set to 50. This dataset is used to evaluate the performance of our approach (cf. Section13.3).

TheSCIENTIFIC datasets are derived from two different applications:

• the analysis of environmental air pollution (SCIEN ENV) and

• gene expression data analysis (SCIEN GEX).

The data on environmental air pollution is derived from the Bavarian State Office for Environmental Protection, Augsburg, Germany2and contains the daily measurements of 8 sensor stations distributed in and around the city of Munich, Germany from the year 2000 to 2004. One time series represents the measurement of one station at a given day, and contains 48 values for one of 10 different parameters such as temperature, ozone concentration etc. The gene expression data from [SSZ+98] contains the expression level of approximately 6,000 genes measured at 24 different time slots.

The STANDARD datasets are derived from diverse fields and cover the complete spectrum of data characteristics, including stationary/non- stationary, noisy/smooth, cyclical/non-cyclical, symmetric/asymmetric etc. They are available from the UCR Time Series Data Mining Archive [KF02]. Due to their variety, they are often used as a benchmark for novel approaches in the field of similarity search in time series databases. We used the following

1kdd.ics.uci.edu/ 2www.bayern.de/lfu

four datasets: GUN/POINT (GunX), TRACE (Trace), CYLINDER-BELL- FUNNEL (CBF) and CONTROL CHART (SynCtrl).

Figure 2.13: Example time series taken from theGunX dataset. The GunX dataset is a two-class dataset which comes from the video surveillance domain [KR04]. It has two classes, each containing 100 instances. All instances were created by using one female actor and one male actor in a single session. The two classes are:

• Gun-Draw: The actors have their hands by their sides. They draw a replica of a gun from a hip-mounted holster, point it at a target for approximately one second, then return the gun to the holster, and their hands to their sides.

• Point: The actors have their hands by their sides. They point with their index fingers to a target for approximately one second, and then return their hands to their sides.

For both classes, we tracked the centroid of the right hand in X-axes. Each instance has a length of 150 data points and is z-normalized (i.e., µ = 0,

σ= 1).

The examples of this dataset, which are depicted in Figure 2.13, exhibit the classification-problem of this dataset: the actual time for pointing at the target greatly varies within the allowed time frame of one second. This inconsistency poses a challenge for classifying this dataset. It is often the case that the classification is based on the length of the pointing time interval, and not on the slight irregularities when drawing the gun.

The Trace dataset is a four-class dataset which is a subset of the Tran- sient Classification Benchmark (trace project) used in [Rov02] for nuclear

2.3 Analysis of Temporal Data 37

Figure 2.14: Example time series taken from the Trace Data dataset.

power plant malfunction diagnostics. It is a synthetic dataset designed by Davide Roverso to simulate instrumentation failures in a nuclear power plant. The full dataset consists of 16 classes, 50 instances in each class. Each in- stance has 4 features. TheTrace subset only uses the second feature of class 2, and the third feature of classes 3 and 7. Hence, this dataset contains 200 instances, 50 for each class. All instances are linearly interpolated to have the same length of 275 data points, and are z-normalized.

Figure 2.14 depicts some examples from this dataset. It is clearly visible that the time series within a class are relatively similar, but are heavily shifted along the time axis.

Figure 2.15: Example time series taken from the Cylinder-Bell-Funnel dataset.

The CBF dataset is an artificial dataset that was defined by Saito in [Sai94] and later used in several other publications (cf. [Geu01]). It consists of three classes cylinder, bell and funnel which are defined by the following functions:

c(t) = (6 +η)·χ[a,b](t) +(t) b(t) = (6 +η)·χ[a,b](t)· t−a b−a +(t) f(t) = (6 +η)·χ[a,b](t)· b−t b−a +(t) with χ[a,b](t) =    1 if a≤t ≤b 0 otherwise

The values for η and (t) are standard normal variates, a and b are uni- formly distributed integers in the range [16,32], respectively [64,128]. For our experiments, we generated a CFB dataset containing 50 time series of each class.

The biggest problem when classifying this dataset is the rather strong noise which is added on the time series and the large window where the characteristic feature of each class can be located (cf. Figure2.15).

Figure 2.16: Example time series taken from theSynthetic Control dataset. TheSynCtrl dataset was created by Alcock and Manolopoulos for [AM99] and contains 600 examples of synthetically generated control charts. It con- sists of the cyclic pattern subset of the control chart data from the UCI KDD archive. The data is effectively a sine wave with noise consisting of 6,000 data points. There are six different classes (100 instances per class) of control charts: normal cyclic, increasing trend, decreasing trend, upward