• No results found

Temporal Patterns for Time Point Data

In this section, we review the major approaches for mining symbolic sequences, where the events are instantaneous and do not have time durations.

4.3.1 Substring Patterns

The simplest form of patterns that can be extracted from time point symbolic sequences are the substring patterns [Manber and Myers, 1990,Fischer et al., 2008], which are subse- quences of symbols that appear consecutively in a sequence (without gaps). For example, Figure 21 shows the occurrences of substring pattern < B, C, D > in a symbolic sequence. Discovering such patterns is mostly used in bioinformatics and computational biology for matching sequences of nucleotides or amino acids. Substring pattern mining is applied on univariate symbolic sequences that are regularly sampled in time.

Figure 21: An example showing the occurrences of substring pattern < B, C, D > in a sym- bolic sequence.

4.3.2 Sequential Patterns

Sequential patterns are more general than substring patterns because they do not have to

be consecutive in the sequence (allow gaps). Similar to itemset mining, sequential pattern mining was initially proposed for analyzing market basket data and customer shopping behavior [Agrawal and Srikant, 1995]. An example of a sequential pattern is “customers who buy a Canon digital camera are likely to buy an HP color printer within a month”.

In the following example, we illustrate the main concepts of sequential pattern mining in the setting of market basket data. However, the presented concepts generalize to other domains, such as telecommunication data, machine logs, web-click data and more.

Example 13. Consider the data in Table 9, where the alphabet of items is Σ= {A, B, C, D} and there are 5 transactions T1 to T5. Each transaction is a sequence of events (customer visits to the supermarket) and each event can be a single item or a set of items (items bought

by the customer on the same trip to the supermarket). For example, customer T1 bought first item A, then item C, then items B and A together. We can see that sequential pattern P1=< C, B > (C followed by B) appears in transactions T1,T2,T3andT4, hence sup(P1) = 4, while sequential patternP2=< B, C > appears only in transaction T4 and hence sup(P2) = 1.

Transaction sequence of items T1 < A, C, (B, A) > T2 < C, (D, A), B > T3 < A, C, B, A > T4 < C, (B, D), (A, C) > T5 < B, D >

Table 9: An example of a sequence data.

The standard sequential pattern mining framework only cares about the order of events rather than their exact times. Therefore, sequential pattern mining does not require the original sequences to be regularly sampled in time. Note that the application of sequential pattern mining goes far beyond the market basket analysis task. It can be applied to any kind of univariate or multivariate symbolic sequences.

In the following, we first outline the most common sequential pattern mining algorithms and then discuss how to reduce the number of sequential patterns using temporal con- straints.

Mining algorithms: The first algorithm for mining frequent sequential patterns was

proposed by [Agrawal and Srikant, 1995], which is based on the Apriori approach (see Sec- tion2.2.1). PrefixSpan [Pei et al., 2001] mines sequential patterns using the pattern growth approach (see Section 2.2.2) and SPADE [Zaki, 2001] uses the vertical data approach (see Section2.2.3). CloSpan [Yan et al., 2003] and BIDE [Wang and Han, 2004] are two efficient methods for mining closed sequential patterns (see Section2.3.1for the definition of closed patterns).

Temporal Constraints: Mining the complete set or even the closed set of frequent

Many of the concise representations described in Section2.3can be applied for compressing sequential patterns. Another way to reduce the number of sequential patterns is to impose

temporal constraints on the patterns. One common temporal constraint is to restrict the

total duration of the pattern. For example, we may specify that the total pattern duration must not exceed w time unites (e.g., 6 months). This constraint translates to defining a sliding window of width w and mining only sequential patterns that can be observed within this window. Another common temporal constraint is to define the maximum gap that is allowed between consecutive events in a pattern. For example, we may specify that the difference between consecutive events should not be more than g time units (e.g., 2 weeks). Incorporating temporal constraints in the Apriori approach is described in [Srikant and Agrawal, 1996] and in the pattern growth approach is described in [Pei et al., 2007].

4.3.3 Episode Patterns

We saw that sequential patterns are used to express order among events. Episode pat-

terns [Mannila et al., 1997, Méger and Rigotti, 2004] are more general than sequential patterns because they can also express the concept of concurrency. [Mannila et al., 1997] defined two special types of episodes:

1. Serial episodes: express order of events (equivalent to a sequential pattern with a maximum pattern duration constraint).

2. Parallel episodes: express concurrency of events (the order does not matter).

In general, an episode is a combination of serial and parallel episodes. For example, an episode can be a sequence of parallel episodes or a parallel combination of serial episodes. Episodes are usually represented as a directed acyclic graph of events whose edges specify the temporal order of events. On the right of Figure22, we show an example of an episode pattern that is a parallel combination of two serial episodes < F2= C, F1= A > and < F3= D, F3= B >. Note that this episode represents only a partial order because the relation between F2= C and F3= D is unspecified. On the left of Figure22, we show the occurrences of this episode in a multivariate symbolic sequence using a sliding window.

Figure 22: An example showing the occurrences of episode pattern (< F2= C, F1= A >, < F3= D, F3= B >) in a multivariate symbolic sequence using a sliding window.

Although these episodes are able to represent partial order patterns, not every partial order pattern can be represented as a combination of serial and parallel episodes. [Casas- Garriga, 2005] proposed an algorithm for mining all partial order patterns. However, their algorithm is computationally very expensive and does not scale up to large data.