On Recursive Edit Distance Kernels with Application to Time Series Classification

(1)

Time Series Classification

Pierre-Fran¸cois Marteau, Sylvie Gibet

To cite this version:

Pierre-Fran¸cois Marteau, Sylvie Gibet. On Recursive Edit Distance Kernels with Application

to Time Series Classification. IEEE Transactions on Neural Networks and Learning Systems,

IEEE, 2014, 26 (6), pp.1121-1133.

<10.1109/TNNLS.2014.2333876>.

<hal-00486916v12>

HAL Id: hal-00486916

https://hal.archives-ouvertes.fr/hal-00486916v12

Submitted on 25 May 2014

HAL

is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not.

The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire

HAL, est

destin´

ee au d´

epˆ

ot et `

a la diffusion de documents

scientifiques de niveau recherche, publi´

es ou non,

´

emanant des ´

etablissements d’enseignement et de

recherche fran¸

cais ou ´

etrangers, des laboratoires

publics ou priv´

es.

(2)

On Recursive Edit Distance Kernels with

Application to Time Series Classification

Pierre-Franc¸ois Marteau,

Member, IEEE

and Sylvie Gibet,

Member, IEEE

Abstract—This paper proposes some extensions to the work on kernels dedicated to string or time series global alignment based on the aggregation of scores obtained by local alignments. The extensions that we propose allow to construct, from classical recursive definition of elastic distances, recursive edit distance (or time-warp) kernels that are positive definite if some sufficient conditions are satisfied. The sufficient conditions we end-up with are original and weaker than those proposed in earlier works, although a recursive regularizing term is required to get the proof of the positive definiteness as a direct consequence of the Haussler’s convolution theorem. Furthermore, the positive definiteness is maintained when a symmetric corridor is used to reduce the search space and thus the algorithmic complexity, which, is quadratic in the worse case . The classification experiment we conducted on three classical time warp distances (two of which being metrics), using Support Vector Machine classifier, leads to the conclusion that, when the pairwise distance matrix obtained from the training data isfar from definiteness, the positive definite recursive elastic kernels outperform in general the distance substituting kernels for several classical elastic distances we have tested.

Index Terms—Edit distance, Dynamic Time Warping, Recursive kernel, Time series classification, Support Vector Machine, Definite-ness.

✦

1

I

NTRODUCTION

E

LASTIC similarity measures such as Dynamic Time Warping (DTW) or Edit Distances have proved to be quite efficient compared to non-elastic similarity mea-sures such as Euclidean meamea-sures or LP norms when addressing tasks that require the matching of time series data, in particular time series clustering and classifica-tion. A wide scope of applications in different domains, such as physics, chemistry, finance, bio-informatics, net-work monitoring, etc, have demonstrated the benefits of using elastic measures. A natural question to ask at this point is whether we can develop and use elastickernel methods based from such elastic distances. This requires to address the existence and construction of Reproducing Edit Distance Hilbert Spaces for a given elastic measure, i.e. a functional vector space endowed with the dot prod-uct in which edit distances or time warping algorithms can be defined. Unfortunately it seems that common elastic measures that are derived from DTW or more generally dynamic programming recursive algorithms are not directly induced by an inner product of any sort, even when such measures are metrics. One can conjecture that it is not possible to embed time series in an Hilbert space having a time-warp capability using these classical elastic measures, but nevertheless, we can propose (close) regularized variants for which such kernel construction construction is possible.

This paper aims at (re-)exploring this issue and, fol-lowing earlier works ([1], [2], [3], [4], [5]) proposes Re-• P.F. Marteau and Sylvie Gibet are with the Univ. Bretagne Sud, UMR

6074, IRISA, F-56000 Vannes, France.

E-mail:{Pierre-Francois.Marteau, Sylvie.Gibet}(AT)univ-ubs.fr

cursive Edit Distance Kernels (REDK) constructions that try to preserve the properties of elastic measures from which they are derived, while offering the possibility of embedding time series in Time Warped Hilbert Spaces. The main contributions of the paper are as follows

1) we verify the indefiniteness of the main time-warp measures used in the literature,

2) we propose a new method to construct positive definite kernels from classical time-warp or edit measures. This method is quite general and less re-strictive than previous ones, although it requires to introduce a recursive regularizing term that allows to prove the definiteness of our kernels as a direct consequence of the Haussler’s convolution theorem, 3) we show that the regularizing approach we develop applies also when a symmetric corridor is used to limit the search space used to evaluate the elastic measure,

4) we experiment and compare the proposed kernels on some time series classification tasks using a large variety of time series data sets to estimate in practice the benefit we can expect from such kernels. The paper is organized as follows: the second section synthesizes the related works; the third section intro-duces the notation and mathematical background that is used throughout the paper. The fourth section develops the construction of a general REDK and details some instantiations derived from classical elastic measures. The fifth section gathers classification experimentation on a wide range of time series data and compares REDK accuracies with classical elastic measures. The sixth section proposes a conclusion and further research

(3)

perspectives. Appendix A states the indefiniteness of classical elastic measures, and appendix B presents the proof of our main results.

2

R

ELATED WORKS

During the last decades, the use of kernel-based methods in pattern analysis has provided numerous results and fruitful applications in various domains such as biology, statistics, networking, signal processing, etc. Some of these domains, such as bioinformatics, or more generally domains that rely on sequence or time series models, require the analysis and processing of variable length vectors, sequences or timestamped data. Various meth-ods and algorithms have been developed to quantify the similarity of such objects. From the original dynamic programming [6] implementation of the symbolic edit distance [7] by Wagner and Fisher [8], the Smith and Waterman (SW) algorithm [9] has been designed to evaluate the similarity between two symbolic sequences by means of a local gap alignment. More efficient local heuristics have since been proposed to meet the massive symbolic data challenge, such as BLAST [10] or FASTA [11]. Similarly, dynamic time warping measures have been developed to evaluate similarity between digital time series or timestamped data [12], [13], and more recently [14], [15] propose elastic metrics dedicated to such digital data.

Our ability to construct kernels with elastic or time-warp properties from such elastic distances allowing to embed time series into vector spaces (Euclidean or not) has attracted attention (e.g. [2][16][17]). Indeed, signifi-cant benefits can be expected from applications of kernel-based machine learning algorithms to variable length data, or more generally data for which some elastic matching has a meaning. Among the kernel machine algorithms applicable to discrimination or regression tasks, Support Vector Machines (SVM) [18], [19], [20] are still reported to yield state-of-the art performances although their accuracy greatly depends on the exploited kernel.

The definition of good kernels from known elastic or time-warp distances applicable to data of variable lengths has been a major challenge since the 1990s. The notion of ’goodness’ has rapidly been associated to the concept of definiteness. Basically SVM algorithms in-volve an optimization process whose solution is proved to be uniquely defined if and only if the kernel is positive definite: in that case the objective function to optimize is quadratic and the optimization problem convex. Nev-ertheless, if the definiteness of kernels is an issue, in practice, many situations exist where definite kernels are not applicable. This seems to be the case for the main elastic measures traditionally used to estimate the similarity of objects of variable lengths. A pragmatic approach consists of using indefinite kernels [2][17][21], although contradictory results have been reported about the impact of definiteness or indefiniteness of kernels on

the empirical performances of SVMs. The sub-optimality of the non-convex optimization process is possibly one of the causes leading to these un-guaranteed perfor-mances [22], [2]. Regulation procedures have been pro-posed to locally approximate indefinite kernel functions by definite ones with some benefits. Among others, some approaches apply direct spectral transformations to indefinite kernels. These methods [23] [24] consist in i) flipping the negative eigenvalues or shifting the eigenvalues using the minimal shift value required to make the spectrum of eigenvalues positive, and ii) re-constructing the kernel with the original eigenvectors in order to produce a positive semidefinite kernel. Yet, in general, ’convexification’ procedures are difficult to interpret geometrically and the expected effect of the original indefinite kernel may be lost. Some theoretical highlights have been provided through approaches that consist in embedding the data into a pseudo-Euclidean (pE) space and in formulating the classification prob-lem with an indefinite kernel, such as minimizing the distance between convex hulls formed from the two categories of data embedded in the pE space [17]. The geometric interpretation results in a constructive method allowing for the understanding, and in some cases the prediction of the classification behavior of an indefinite kernel SVM in the corresponding pE space.

Some other works like [25], [26] address the construc-tion of elastic kernels for time series analysis through a decomposition of time series as a sum of local low degree polynomials and, using a resampling process, the piece-wise approximation of the time series are embedded into a proper so-called Reproducing Kernel Hilbert Space in which the SVM is learned.

Other approaches try to use directly the elastic dis-tance into the kernel construction, without any approx-imation or resampling process.These works are based on the work of Haussler on convolution kernels [1] defined on a set of discrete structures such as strings, trees, or graphs. The iterative method that is developed is generative, as it allows for the building of complex kernels from the convolution of simple local kernels. Following the work of Haussler [1], Saigo et al [27] define, from the Smith and Waterman algorithm [9], a kernel to detect local alignment between strings by convolving simpler kernels. These authors show that the Smith and Waterman distance measure, dedicated to determining similar regions between two nucleotide or protein sequences, is not definite, but is nevertheless connected to the logarithm of a point-wise limit of a series of definite convolution kernels. Cuturi et al. [5] have adapted this approach to times series alignments covering the DTW elastic distance. In fact, these previous studies have very general implications, the first one be-ing that classical elastic measures can also be understood as the limit of a series of definite convolution kernels.

In this paper we tackle a new approach to the positive definiteness regularization of elastic kernels that follows directly the lines of the Haussler’s theorem on

(4)

convolu-tion kernels. The condiconvolu-tion to obtain positive definiteness is weaker than the one proposed in previous works [5], although it requires the addition of an explicit recursive regularizing term. This term is easy to evaluate and has the same complexity than the recursive equation that defines the elastic distance. Our regularization strategy is thus quite general, independent from the data (although a parameter can be optimized), and can be applied to a very large family of editing or dynamic time warping like distances. It applies also to some of their variants which exploit a symmetriccorridorto speed-up the com-putation.

3

N

OTATIONS AND MATHEMATICAL BACK

-GROUNDS

We give in this section commonly used definitions, with few details, for metric, kernel and definiteness, sequence set, and classical elastic measures.

3.1 Kernel and definiteness

A very large literature exists on kernels, among which [28], [20] and [29] present a large synthesis of major results. We give hereinafter some basic definitions and some results derived from these previous references.

Definition 3.1: A kernel on a non empty setU refers to a complex (or real) valued symmetric function ϕ(x, y) :

U _×U _→C_(orR_).

Definition 3.2: Let U be a non empty set. A function

ϕ : U _×U _→ C _{is called a positive (resp. negative)} definite kernel if and only if it is Hermitian (i.e.

ϕ(x, y) = ϕ(y, x) where the overline stands for the conjugate number) for all x and y in U and Pn

i,j=1cic¯jϕ(xi, xj) ≥0 (resp. Pni,j=1cic¯jϕ(xi, xj) ≤ 0),

for allninN_,₍_x₁_{, x}₂_{, ..., x}_n₎_∈_Un_and₍_c₁_{, c}₂_{, ..., c}_n₎_∈Cn_.

Definition 3.3: Let U be a non empty set. A function

ϕ : U _× U _→ C _{is called a conditionally positive} (resp. conditionally negative) definite kernel if and only if it is Hermitian (i.e. ϕ(x, y) = ϕ(y, x) for all x and y in U) and Pn

i,j=1cic¯jϕ(xi, xj) ≥ 0

(resp. Pn

i,j=1cic¯jϕ(xi, xj) ≤ 0), for all n ≥ 2 in N,

(x1, x2, ..., xn) ∈ Un and (c1, c2, ..., cn) ∈ Cn with Pn

i=1ci= 0.

In the last two above definitions, it is easy to show that it is sufficient to consider mutually different elements in

U, i.e. collections of distinct elements x1, x2, ..., xn. This

is what we will consider for the remaining of the paper.

Definition 3.4: A positive (resp. negative) definite ker-nel defined on a finite setUis also called a positive (resp. negative) semidefinite matrix. Similarly, a positive (resp. negative) conditionally definite kernel defined on a finite set is also called a positive (resp. negative) conditionally semidefinite matrix.

For convenience sake, we will use p.d. and c.p.d. for positive definite and conditionally positive definite to

characterize either a kernel or a matrix having these properties.

Constructing p.d. kernels from c.p.d. kernels is quite straightforward. For instance, if−ϕis a c.p.d. kernel on a set U and x0 ∈ U then, according to [28], ψ(x, y) =

ϕ(x, x0) +ϕ(y, x0)−ϕ(x, y)−ϕ(x0, x0) is a p.d. kernel,

so aree(ψ(x,y)) _and_e−ϕ(x,y)_{. The converse is also true.}

Furthermore,e−tϕ(x,y)_{is p.d. for}_{t >}₀_if

−ϕis c.p.d. We will precisely use this last result to construct p.d. kernels from classical elastic distances.

3.2 Sequence set

Definition 3.5: Let U _{be the set of finite sequences} (symbolic sequences or time series): U ₌ _{_Ap_1|_p _∈ N_}_.

Ap1 is a sequence with discrete index varying between1

andp. We noteΩthe empty sequence (with null length) and by conventionA0

1= Ωso thatΩis a member of set

U_. _|_A_| _{denotes the length of the sequence} _A_{. Let} U_p ₌

{A∈U_{| |}_A_{| ≤}_p_} _{be the set of sequences whose length} is shorter or equal top.

Definition 3.6: Let A be a finite sequence. LetA(i)be the ith _{element (symbol or sample) of sequence} _A_{. We}

will consider that A(i) ∈ S×T where S embeds the multidimensional space variables (either symbolic or numeric) andT ⊂R_{embeds the time stamp variable, so} that we can writeA(i) = (a(i), t(i))where a(i)∈S and

t(i) ∈ T, with the condition that t(i) > t(j) whenever

i > j (time stamps strictly increase in the sequence of samples). Aji with i ≤ j is the subsequence consisting

of the ith through the jth element (inclusive) of A. So

Aj_i =A(i)A(i+ 1)...A(j).Aj_i with i > j is the null time series, e.g.Ω. Furthermore, ifi >_|A_|then by convention,

A(i) = Ω.

3.3 General Edit/Elastic distance on a sequence set

Definition 3.7: An edit operation is a pair

(a, b)_∈((S_×T)_{∪ {}Ω_})2 _written _a

→b. The sequence B

results from the application of the edit operationa_→b

into sequence A, written A _⇒B via a_→b, if A=σaτ

and B =σbτ for some sub-sequences σ and τ. We call

a _→ b a substitution operation if a ₆= Ω and b ₆= Ω, a delete operation if b= Ω, an insert operation ifa= Ω.

For any pair of sequences Ap₁, B₁q, for which we con-sider the extensionsAp0, B

q

0whose first element isΩ, and

for each elementary edit operation related to position

0 _≤i _≤ p in sequence A and to position 0 _≤j _≤ q in sequenceBis associated a cost valueΓA(i)→B(j)(Ap1, B

q 1), or ΓA(i)→Ω,B,j(Ap1, B q 1) or ΓΩ,A,i→B(j)(Ap1, B q 1) ∈ R. To

simplify this notation, we will simply write Γ(A(i) →

B(j)), Γ(A(i) → ΩB(j)) or Γ(ΩA(i) → B(j)) with

the understanding that the suppression cost Γ(A(i) →

(5)

may depend on the current location j in sequence B

(respectively on the location iin sequenceA).

Hence we consider that ΩX(k) is a function from

U_×N_to ₍_S_×_T₎_{∪ {}_Ω_}_.

Definition 3.8: A function δ :U_×U_→ R _{is called an} edit distance defined on U_{if, for any pair of sequences}

Ap1, B q

1, the following recursive equation is satisfied

δ(Ap1, B q 1) = Min    δ(Ap₁−1, Bq₁) + Γ(A(p)_→ΩB(q)) del δ(Ap1−1, B q−1 1 ) + Γ(A(p)→B(q)) sub δ(Ap1, B q−1 1 ) + Γ(ΩA(p)→B(q)) ins

Note that not all edit/elastic distances are metric. In particular, the dynamic time warping distance does not satisfy the triangle inequality.

3.3.1 Levenshtein distance

The Levenshtein distance δlev(x, y)has been defined for

string matching. For this edit distance, the delete and

insert operations induce unitary costs, i.e. Γ(A(p) _→ ΩB(q)) = Γ(ΩA(p) → B(q)) = 1 while the substitution

cost is null if A(p) =B(q)or 1otherwise. Thus, for this distance, we consider that the functional termΩX(k) = Ω

for allXandk, since the suppression and insertion costs do not depend on the suppressed or inserted element respectively.

3.3.2 Dynamic time warping

The DTW similarity measure δdtw [12][13] is defined

according to the previous notations such as:

δdtw(Ap1, B q 1) = dLP(a(p), b(q)) + Min    δdtw(Ap1−1, B q 1) del δdtw(Ap1−1, B q−1 1 ) sub δdtw(Ap1, B q−1 1 ) ins

where and dLP(x, y)is theLP norm of vector(x−y)

in S, and so for DTW, Γ(A(p) _→ ΩB(q)) = Γ(A(p) →

B(q)) = Γ(ΩA(p) → B(q)) = dLP(a(p), b(q)). Thus, for

this distance,ΩX(k) =X(k)for allX and k. Let us note

that the time stamp values are not used, therefore the costs of each edit operation involve vectorsaandbinS

instead of vectors(a, ta)and(b, tb)inS×T. Furthermore,

δdtw does not comply with the triangle inequality as

shown in [14].

3.3.3 Edit Distance with real penalty

δerp(Ap1, B q 1) = Min    δerp(Ap1−1, B q 1) + Γ(A(p)→ΩB(q)) del δerp(Ap1−1, B q−1 1 ) + Γ(A(p)→B(q)) sub δerp(Ap1, B q−1 1 ) + Γ(ΩA(p)→B(q)) ins with Γ(A(p)_→ΩB(q)) =dLP(a(p), g)) Γ(A(p)_→B(q) =dLP(a(p), b(q)) Γ(ΩA(p)→B(q)) =dLP(g, b(q))

whereg is a constant inS. For this distance, ΩX(k) =g

for allX and k.

Note that the time stamp coordinate is not taken into account, thereforeδerpis a distance onSbut not onS×T.

Thus, the cost of each edit operation involves vectors a

andbinRk _{instead of vectors}₍_{a, t}_a₎_and₍_{b, t}_n₎_inRk+1_. According to the authors of ERP [14], the constant g

should be set to0for some intuitive geometric interpre-tation and in order to preserve the mean value of the transformed time series when adding gap samples.

3.3.4 Time warp edit distance

Time Warp Edit Distance (TWED) [30], [15] is defined similarly to the edit distance defined for string [7][8]. The similarity between any two time series Aand B of finite length, respectivelypandq is defined as:

δtwed(Ap1, B q 1) = Min    δtwed(Ap1−1, B q 1) + Γ(A(p)→ΩB(q)) delA δtwed(Ap1−1, B q−1 1 ) + Γ(A(p)→B(q)) subs δtwed(Ap1, B q−1 1 ) + Γ(ΩA(p)→B(q)) delB with Γ(A(p)_→ΩB(q)) =d(A(p), A(p−1)) +λ Γ(A(p)_→B(q)) =d(A(p), B(q)) +d(A(p₋1), B(q₋1)) Γ(ΩA(p)→B(q)) =d(B(q), B(q−1)) +λ

where λ is a positive constant that represents a gap penalty. The time stamps are exploited to evaluate d(A(p), B(q)), in practice, d(A(p), B(q)) =

dLP(a(p), b(q)) +νdLP(t(p), t(q)), whereν is a non

nega-tive constant which characterizes thestiffnessof theδtwed

elastic measure along thetimeaxis. Furthermore, for this distance, we consider that the functional termΩX(k) = Ω

for allX andk, since the suppression and insertion costs do not depend on the suppressed or inserted element respectively.

3.4 Indefiniteness of elastic distance kernels

In appendix A, we give counter examples, one for each previously defined elastic distance, showing that these distances do not lead to definite kernels. This demon-strates that the metric properties of a distance defined on U_{, in particular the triangle inequality, are not sufficient} conditions to establish definiteness (conditionally or not) of the associated distance kernel. One could conjecture that elastic distances cannot be definite (conditionally or not), possibly because of the presence of the max or

minoperators in the recursive equation. In the following sections, we will see that replacing these min or max

operators by a sum operator makes possible, under some conditions, for the construction of series of positive definite kernels whose limit is quite directly connected to the previously addressed elastic distance kernels.

(6)

4

C

ONSTRUCTING POSITIVE DEFINITE KER

-NELS FROM ELASTIC DISTANCE

The main idea leading to the construction of positive definite kernels from a given elastic distance defined on U_{is to replace the}_min_or_max_{operator into the recursive} equation defining the elastic distance by a summation (P

) operator. Instead of keeping only one of the best alignment paths, the new kernel will sum up the costs of all the existing sub-sequence alignment paths with some weighting factor that will favor good alignments while penalizing bad alignments. In addition, this weighting factor can be optimized. This principle has been applied successfully to the Smith and Waterman symbolic dis-tance which is also known to be indefinite [27], and more recently to dynamic time warping kernels for time series alignment [5]. If, basically, we are following the same objective, the approach that we propose is quite different since it is based on a regularization principle which is simply and directly expressed into the recursive equation defining the elastic distance. First we replace themin(or

max) operator by a summation operator and introduce a symmetric corridor function to optionally limit the summation. Then we add a regularizing recursive term such that the proof of the positive definiteness prop-erty can be understood as a direct consequence of the Haussler’s convolution theorem, as shown in appendix B. Basically, all is reduced to the proper inventory of pairs of symmetric alignment paths.

4.1 Recursive accumulation off-costproducts

Definition 4.1: A function C(., .) :U_×U_→R _{is called} a Recursive Accumulation of f-cost Products (RAfP) if, for any pair of sequences Ap₁, B₁q, there exist a function

f :R_→R_{and three indicator functions} _h_d_, _h_s _and_h_i _: N2 _{→ {}₀_,₁_} _{such that the following recursive equation} is satisfied: C(Ap1, B q 1) = 1 c (1) X    hd(p, q)C(Ap1−1, B q 1)f(Γd(A, p, B, q)) del hs(p, q)C(Ap1−1, B q−1 1 )f(Γs(A, p, B, q)) sub hi(p, q)C(Ap1, B q−1 1 )f(Γi(A, p, B, q)) ins

Where c is a non negative constant. This recursive definition relies on an initialization. To that end we set

C(Ω,Ω) = ξ, where ξ is a real non negative constant, typically ξ = 1, and Ω the null sequence. Note that if no path satisfying the indicator functions exists when aligning time seriesA andB, then C(A, B) = 0.

Furthermore, according to the indicator functions hd,

hsandhi, this type of construction sums up the

multipli-cation of the local quantitiesf(Γ(A(i)_→B(j))) that we call f-costsfor all or for some of the possible alignment paths that exist between the two time series, the concept of alignment path being precisely defined in appendix B (see definition B.3).

In this paper we are specifically concerned with the following two symmetric RAfP families. The first one

(Eq. 2), characterized by ∀(p, q) _∈ N2_, _h_d₍_{p, q}_{) =} _h₍_p₋

1, q), hi(q, p) = h(p, q−1) and hs(p, q) =h(p−1, q−1),

accumulates the f-costproducts along all the alignment paths between the two time series (Eq.2) that exist inside the symmetric corridor defined by h(p, q). For the case depicted in Figure 1, h(p, q) = 1, whenever |p₋q_{| ≤}3,

h(p, q) = 0otherwise. We denote this RAfP function_Ch.

Ch(Ap1, B q 1) = 1 c (2) X    h(p₋1, q)_Ch(Ap1−1, B q 1)f(Γ(A(p)→ΩB(q))) h(p−1, q−1)Ch(Ap1−1, B q−1 1 )f(Γ(A(p)→B(q))) h(p, q₋1)_Ch(Ap1, B q−1 1 )f(Γ(ΩA(p)→B(q)))

The second RAfP function (Eq. 3) accumulates the f-cost products of all mappings characterized by simul-taneous insertions, deletions and substitutions. Hence the substitution, deletion and insertion operations are simultaneously performed on the same index value. This function is thus characterized by _∀(p, q) _∈ N2_,

hd(p, q) = h(p−1, q), hi(q, p) = h(p, q−1) where h is

a symmetric function, and hs(p, q) =δp,qh(p−1, q−1),

whereδis the Kronecker’s symbol (Eq.3). We denote this second RAfP function Ch,δ.

Ch,δ(Ap1, B q 1) = 1 c (3) X        h(p−1, q)Ch,δ(Ap1−1, B q 1)f(Γ(ΩA(p)→ΩB(p)) δp,qh(p−1, q−1)· Ch,δ(Ap1−1, B q−1 1 )f(Γ(A(p)→B(q))) h(p, q₋1)_Ch,δ(Ap1, B q−1 1 )f(Γ(A(q)→B(q)) 4.2 Recursive Edit Distance Kernel - Main result

Definition 4.2: We call Recursive Edit Distance Kernel (REDK) the functionK(., .) =Ch(., .) +Ch,δ(., .) :U×U→

R_.

The following theorem, that relates to the ones presented in [5] and [31], establishes sufficient conditions on f(Γ(a _→ b)) for a REDK function to be definite positive, and thus is a basis for the construction of definite REDK. The proof of this theorem, that is given in Appendix B, is a direct consequence of the Hausslers

R-convolution theorem [1]. The sufficient condition that is proposed here is quite general and less restrictive than those proposed by [5] and [31]. The avenue that is taken here only requires that the local kernel k(x, y)

is PD. In previous works, the condition for positive definiteness is that the local kernel k(x, y)and the ratio

k(x, y)/(k(x, y) + 1) are jointly PD. Our new condition is obviously weaker than the one proposed in [9]. Furthermore, our REDK construction applies to a wider range of edit or time warp distances developed either for symbolic sequence or time series matching.

(7)

REDK is a positive definite kernel on U_×U _{iff the} local kernel k(x, y) =f(Γ(x_→y))is a positive definite kernel on((S_×T)_{∪ {}Ω_})2.

A sketch of proof for theorem 4.3 is given in the appendix B.

As the cost functionΓis, in general, conditionally neg-ative definite, choosing for f(.)the exponential ensures that f(Γ(x→y))is a positive definite kernel [32].

Other functions can be used, such as the Inverse Multi Quadric kernel k(x, y) = _√ 1

(Γ(x→y))2

+θ2. As with the exponential (Gaussian or Laplace) kernel, the Inverse Multi Quadric kernel results in a positive definite matrix with full rank [33] and thus forms an infinite dimension feature space. Note here that Cuturi et al. [5] state that to ensure the definiteness of the DTW-REDK kernel, not only f(Γ(x _→ y)), but also f(Γ(x _→ y))/(1 +f(Γ(x _→ y))) need to be PD kernels, which forms a stronger condition than the one we propose.

4.2.1 Computational cost of REDK

Although the number of paths that are summed up exponentially increases with the lengths |A| and |B| of the time series that are evaluated, the recursive compu-tation of any RAfP function and therefore of K(A, B)

lead to a quadratic computational cost O(_|A_||B_|), e.g.

O(N2₎ _if _N _{is the average length of the time series}

that are considered. This quadratic complexity can be reduced to a linear complexity by limiting the number of alignment paths to consider in the recursion. This can be achieved when using a search corridor [13] as far as the kernel remains symmetric, which is the case when processing time series of equal lengths and restraining the search space using, for instance, a fixed size corridor symmetrically displayed around the diagonal as shown in Fig. 1. Note that any kind of corridor can be specified using the indicator functions hd,hs andhi.

4.3 Exponentiated REDK

Definition 4.4: The following equations define the two RAfP function instances entering into the construction of the exponentiated REDK, considering f(Γ(. _→ .)) =

1 c.e −ν′_Γ(_._→_.₎ with c >0 Ce,h(Ap1, B q 1) = 1 c (4) X    h(p−1, q)Ce,h(Ap1−1, B q 1)e−ν ′_Γ(_A₍_p₎_→_Ω_B₍_q₎₎ h(p₋1, q₋1)_Ce,h(Ap1−1, B q−1 1 )e−ν ′_Γ(_A₍_p₎_→_B₍_q₎₎ h(p, q₋1)_Ce,h(Ap1, B q−1 1 )e−ν ′_Γ(Ω_A₍_p₎_→_B₍_q₎₎ Ce,h,δ(Ap1, B q 1) = 1 c (5) X        h(p−1, q)Ce,h,δ(Ap1−1, B q 1)e−ν ′_Γ(Ω A(p)→ΩB(p)) δpqh(p−1, q−1)Ce,h,δ(Ap1−1, B q−1 1 )e−ν ′_Γ(_A₍_p₎_→_B₍_q₎₎ h(p, q−1)Ce,h,δ(Ap1, B q−1 1 )e−ν ′_Γ(_A₍_q₎_→_B₍_q₎₎

Fig. 1. Example of alignment paths existing when match-ing time seriesAandB. In red dotted line a path whose editing f-cost product is accumulated by _Ch and Ch,δ, in blue plain line, a path whose editing cost product is accumulated by _Ch but notCh,δ. A symmetric corridor of any form can be used to reduce the number of admissible alignment paths that are inventoried by a RAfP

where ν′ _{is a} _stiffness _{parameter that weights the}

contribution of the local elementary costs. The largerν′

is, the more the kernel is selective around the optimal paths. At the limit, when ν′

→ ∞, only the optimal path costs are summed up by the kernel. Note that, as is generally seen, several optimal paths leading to the same global cost exist, thereforelimν′→+∞−1/ν′log(Ke(A, B))

does not coincide with the elastic distanceδthat involves the same corresponding elementary costs, although we expect it to be close.

The constant1/c(typically1/3) is used to maintain the global accumulation functionsCe,h and Ce,h,δ in a range

that is computationally acceptable.

The alignment cost of two null time series being0, we set Ce,h(Ω,Ω) = Ce,h,δ(Ω,Ω) = 1 in this exponentiation

context.

Proposition 4.5: Definiteness of the exponentiated REDK instance,Ke:

Ke(., .) is positive definite for the cost functions

Γ(A(p)→ΩB(q)),Γ(A(p)→B(q))andΓ(ΩA(p)→B(q))

involved in the computation of the δlev, δdtw and δerp

distances.

The proof of proposition 4.5 is straightforward and is omitted.

The local cost function Γ(ΩA(p) → B(q)) involved

(8)

condition of theorem 4.3 and thus the local kernel is not guaranteed yet to be p.d. for this distance.

The REDKs constructed from these distances are referred respectively toREDKlev,REDKerp,REDKdtw,

REDKtwed in the rest of the paper.

4.3.1 Interpretation of the exponentiated REDK

In a REDK kernel, each alignment path is assigned with a product of f-coststhat is the multiplication of the local f-cost functions attached to each edge of the path. For exponentiated REDK, the local f-cost function, e.g.

e−ν′_Γ(_A₍_p₎_→_B₍_q₎₎

can be interpreted, up to a normalizing constant, as a probability to align symbol A(p) with symbol B(q), and the value affected to each path can be interpreted as the probability of a specific alignment between two sequences. In this case the REDK, that sums up the probability of all possible alignment paths between two sequences, can be interpreted as a matching probability between two sequences. This probabilistic interpretation suggests an analogy between REDK and the alpha-beta algorithm designed to learn HMM models: while the Viterbi’s algorithm that uses a max operator in a dynamic programming implementation (just like theDTWalgorithm) evaluates only the probability of the best alignment path, the

alpha-beta algorithm is based on the summation of the probabilities of all possible alignment paths. As reported in [27], the main drawbacks of these kind of kernels is the vanishing of the product of local cost functions (that are lower than one) when comparing long sequences. When considering distance-matrix (obtained from pairwise distances on finite sets) this leads to a matrix that suffers from the diagonal dominance problem, i.e. the fact that the kernel value decreases extremely fast when the similarity slightly decreases.

5

C

LASSIFICATION EXPERIMENTS

The purpose of this experiment is to estimate the benefit one can expect from using PD elastic kernels instead of indefinite ones in SVM classification. It is clear that the the Sequential Minimal Optimization (SMO) involved in the SVM learning procedure [34] [35] is able to handle indefinite kernels, but, as the convergence toward a global extremum is not guaranteed (the optimization is problem is not convex), we can expect some better ac-curacy for definite kernels especially when the pairwise distance matrix derived from the train data set is far from definiteness. This is precisely what our experiment targets to demonstrate.

To that end, we empirically evaluate the effectiveness of some REDK kernels comparatively to Gaussian Radial Basis Function (RBF) Kernels or elastic distance substituting kernels [2] using some classification tasks on a set of time series coming from quite different

application fields. The classification task we have considered consists of assigning one of the possible categories to an unknown time series for the 20 data sets available at the UCR repository [36]. As time is not explicitly given for these datasets, we used the index value of the samples as the time stamps for the whole experiment.

For each dataset, a training subset (TRAIN) is defined as well as an independent testing subset (TEST). We use the training sets to train two kinds of classifiers:

• the first one is a first near neighbor (1-NN) classifier:

first we select a training data set containing time series for which the correct category is known. To assign a category to an unknown time series selected from a testing data set (different from the train set), we select its nearest neighbor (in the sense of a distance or similarity measure) within the training data set, then, assign the associated category to its nearest neighbor. For that experiment, a leave one out procedure is performed on the training dataset to optimized the meta parameters of the considered comparability measure.

• the second one is a SVM classifier [19], [37]

configured with a Gaussian RBF kernel whose parameters are C > 0, a trade-off between regularization and constraint violation and σ that determines the width of the Gaussian function. To determine theC andσhyper parameter values, we adopt a 10-folded cross-validation method on each training subset. According to this procedure, given a predefined training set TRAIN and a test set TEST, we adapt the meta parameters based on the training set TRAIN: we first divide TRAIN into 10 stratified subsets T RAIN1, T RAIN2,· · · , T RAIN10; then for

each subset T RAINi we use it as a new test set,

and regard(T RAIN −T RAINi)as a new training

set; Based on the average error rate obtained on the 10 classification tasks, the optimal values of meta parameters are selected as the ones leading to the minimal average error rate.

We have used the LIBSVM library [38] to implement the SVM classifiers.

5.1 Experimenting with REDK

We tested the exponentiated REDK kernel based on the

δerp, δdtw, δtwed distance costs. We consider respectively

the positive definite REDKerp, REDKdtw, REDKtwed

kernels. Our experiment compares classification errors on the test data for

• the first near neighbor classifiers based on the

δerp, δdtw, δtwed distance measures (1-NNδerp, 1-NN

δdtw and 1-NNδtwed),

• the SVM classifiers using Gaussian distance

substi-tuting kernels based on the same distances [21], e.g. SVMδerp, SVMδdtw, SVMδtwed,

(9)

• and their corresponding REDK kernel, SVM

REDKerp, SVMREDKdtw, SVMREDKtwed.

Forδerp,δtwed,REDKerpandREDKtwedwe used the

L1-norm, while the L2-norm has been implemented for

δdtw andREDKdtw, a classical choice for DTW [39].

5.1.1 Meta parameters

For δerp kernel, meta parameterg is optimized for each

dataset on the train data by minimizing the classification error rate of a first near neighbor classifier using a Leave One Out (LOO) procedure. For this kernel, g

is selected in _{−3,₋2.99,₋2.98,_{· · ·},2.98,2.99,3_}. This optimized value is also used for comparison with the

REDKerp kernel.

For δtwed kernel, meta parameters λ and ν are

opti-mized for each dataset on the train data by minimizing the classification error rate of a first near neighbor classi-fier. For our experiment, thestiffnessvalue (ν) is selected from {10−5_,₁₀−4_,₁₀−3_,₁₀−2_,₁₀−1_,₁

} and λ is selected from {0, .25, .5, .75,1.0_}. If different(ν, λ)values lead to the minimal error rate estimated for the training data then the pairs containing the highestνvalue are selected first, then the pair with the highest λ value is finally selected. These optimized (λ,ν) values are also used for comparability purposes with the REDKtwed kernel.

The kernels exploited by the SVM classifiers are the Gaussian Radial Basis Function (RBF) kernels

K(A, B) = eδ(A,B)2

/(2σ2

) _where _δ _{stands for} _δ

erp, δdtw,

δtwed, REDKerp, REDKdtw, REDKtwed. To limit the

search space for the SVM meta parameters, the pairwise distances or kernels values are normalized using the

min and max values calculated on the TRAIN data, basically

• δ = (δ−δ_min)/(δ_max−δ_min) for δ = δ_erp, δ_dtw or

δtwed, and

• δ=exp (log(δ)−log(δmin))/(log(δmax)−log(δmin))

for δ=REDKerp,REDKdtw,REDKtwed.

Meta parameter C is selected from

{2−5_,₂−4_{, ...,}₁_,₂_{, ...,}₂10_}_, _and _σ2 _from {2−5_,₂−4_{, ...,}₁_,₂_{, ...,}₂14_}_{. The best values are obtained}

using the 10-folds cross validation procedure.

For theREDKerp,REDKdtwandREDKtwedkernels,

meta parameter1/ν′_{is selected from the discrete set} G=

{10−5_,₁₀−4_{, ...,}₁_,₁₀_,₁₀₀ }.

The optimization procedure is as follows:

• for each value in G, we train a SVM REDK_∗

clas-sifier on the training dataset using the previously described 10-folded cross validation procedure to select the SVM meta parameters C and σ and the average of the classification error is recorded.

• the best σ, C andν′ values are the ones that lead to

the minimal average error.

0 10 20 30 40 50 60 0 10 20 30 40 50 60 SVM REDK erp SVM δerp

SVM δerp v.s. SVM REDKerp

Fig. 2. Comparison of error rates (in %) between two SVM classifiers, the first one based on the δerp sub-stituting kernel (SVM δerp), and the second one based on a REDK kernel induced by the δerp distance (SVM

REDKerp). The straight line has a slope of 1.0 and dots correspond, for the pair of classifiers, to the error rates on the train (star) or test (circle) data sets. A dot below (resp. above) the straight line indicates that SVMREDKerphas a lower (resp. higher) error rate than distance SVMδerp

Table 1 gives for each data set and each tested kernel (δerp, δdtw, δtwed, REDKerp, REDKdtw and

REDKtwed) the corresponding optimized values of the

meta parameters.

5.2 Discussion

5.2.1 REDK experiment analysis

Tables 2 and 3 show the classification error rates ob-tained for the tested methods, e.g. the first near neighbor classifier based on the δerp,δdtw and δtwed distances

(1-NNδerp, 1-NNδdtw and 1-NNδtwed), the Gaussian RBF

kernel SVM based on the same distances (SVMδerp, SVM

δdtw and SVM δtwed) and Euclidean distance, and the

Gaussian RBF kernel SVM based on the REDK kernels (SVMREDKerp, SVMREDKdtw and SVMREDKtwed).

In this experiment, we show that the SVM classifiers clearly outperform in general the 1-NN classifiers that are used as a baseline. But the interesting results reported in tables 2 and 3 and figures 2, 3 and 4 is that SVM

REDKerp and SVMREDKtwed perform slightly better

than SVMδerpand SVMδtwed respectively, and the SVM

REDKdtw is clearly much efficient than the SVM δdtw.

This could come from the fact that δerp and δtwed are

metrics but not δdtw, but another explanation could be

related to the the SMO optimization involved into the SVM learning. SVMδdtwpoorly behaves compared to the

(10)

opti-TABLE 1

Meta parameters used in conjunction withδerp,REDKerp,δdtw,REDKdtw,δtwed andREDKtwedkernels

DATASET δerp:g;C;σ REDKerp:g;ν′;C;σ δdtw:C;σ REDKdtw:ν′;C;σ δtwed: Ω;ν;C;σ REDKtwed: Ω;ν;ν′;C;σ

Synth. cont. 0.0;2.0;0.25 0.0;0.457;256.0;0.062 8.0;4.0 0.047;1024.0;0.062 0.75;0.01;1.0;0.25 0.75;0.01;0.685;8.0;4.0 Gun-Point -0.35;4.0;0.031 -0.35;0.457;128.0;1.0 16.0;0.0312 0.457;64.0;2.0 0.0;0.001;8.0;1.0 0.0;0.001;0.685;32;32 CBF -0.11;1.0;1.0 -0.11;0.203;4.0;32.0 1.0;1.0 0.457;2.0;1.0 1.0;0.001;1.0;1.0 1.0;0.00;,0.20;4.0;32.0 Face (all) -1.96;4.0;0.5 -1.96;1.028;8.0;0.62 2.0;0.25 1.028;4.0;0.25 1.0;0.01;8.0;4.0 1.0;0.01;2.312;8.0;4.0 OSU Leaf -2.25;2.0;0.062 -2.25;1.541;256;0.031 4.0;0.062 1.541;32.0;0.062 1.0;1e-4;8.0;0.25 1.0;1e-4;1.028;64.0;1.0 Swed. Leaf 0.3;8.0;0.125 0.3;0.203;1.0;4.0 4.0;0.031 5.202;0.062;0.5 1.0;1e-4;16.0;0.062 1.0;1e-4;0.304;32.0;1.0 50 Words -1.39;16.0;0.25 -1.39;0.685;16;0.25 4.0;0.062 1.028;64.0;0.062 1.0;1e-3;8.0;0.5 1.0;1e-3;1.028;32.0;2.0 Trace 0.57;32;0.62 0.57;0.457;256;4.0 4;0.25 0.685;16;0.25 0.25;1e-3;8.0;0.25 0.25;1e-3;300;0.0625;0.25 Two Patt. -0.89;0.25;0.125 -0.89;0.304;0.004;1.0 0.25,0.125 0.457;2.0;0.125 1.0;1e-3;0.25;0.125 1.0;1e-3;0.685;0.25;0.125 Wafer 1.23;2.0;0.062 1.23;0.685;4.0;0.5 1.0;0.016 1.541;1024;0.031 1.0;0.125;4.0;0.62 1.0;0.125;1.541;1.0;4.0 face (four) 1.97;64;16 1.97;0.685;32;2 16;0.5 0.457;16;2 1.0;0.01;4;2 1.0;0.01;1.027;4;2 Ligthing2 -0.33;2;0.062 -0.33;2.312;128;0.062 2.0;0.031 1.541;32;0.062 0.0;1e-6;8;0.25 0.0;1e-6;1.541;8;8 Ligthing7 -0.40;128;2 -0.40;0.685;32;0.25 4;0.25 0.685;32;0.062 0.25;01;4;0.5 0.25;0.1;0.685;4;8 ECG 1.75;8;0.125 1.75;0.457;16;0.5 2;0.62 1.028;32;0.062 0.5;1.0;4;0.125 0.5;1.0;5.202;8;16 Adiac 1.83;16;0.0156 1.83;2.312;4096;0.031 16;0.0039 1.028;2048;0.031 0.75;1e-4;16;0.016 0.75;1e-4;2.312;128;1 Yoga 0.77;4;0.031 0.77;11.7054096;0.031 4;0.008 26.337;1024;0.031 0.5;1e-5;2;0.125 0.5;1e-5;3.468;256;2 Fish -0.82;64;0.25 -0.82;0.685;32;0.5 8;0.016 3.468;64;16 0.5;1e-4;4;.5 0.5;1e-4;0.457;16;16 Coffee -3.00;16;0.062 -3.00;26.337;4096;16 8;0.062 5.202;512;4 0;0.1;16;4 0;0.1;300;1024;128 OliveOil -3.00;8;0.5 -0.82;0.457;256;0.062 2;0.125 0.457;32;0.125 0;0.001;256;32 0;0.001;32;32 Beef -3.00;128;0.125 -3.00;0.685;0.004;16384 16;0.016 0.457;0.004;16 0;1e-4;2;1 0;1e-4;0.135;0.004;16

TABLE 2

Comparative study using the UCR datasets: classification error rates (in %) obtained using the first near neighbor classification rule and a SVM classifier for theerp,REDKerp,dtwandREDKdtw kernels. Two scores are given S1_|S2: the first one, S1, is evaluated on the training data, while the second one, S2, is evaluated on the test data.

Bold faces indicates the lowest error rate between the pairs of SVMs (Kerp,REDKerp) and (Kdtw,REDKdtw).

DATASET 1-NNδerp SVMδerp SVMREDKerp 1-NNδdtw SVMδdtw SVMREDKdtw

Synthetic control 0.67|3.7 0.33|1 0.33|1 0|1.33 0|1.33 0|1 Gun-Point 6.12|4 0|1.33 0|1.33 18.36|9.3 6|8 0|.67 CBF 0|0.33 3.33|3.44 3.33|2.67 0|0.33 3.33|1.67 0|0.22 Face (all) 10.73|20.18 .71|17.04 .71|16.98 6.8|19.23 4.29|14.97 1.25|7.89 OSU Leaf 30.15|40.08 18.5|31.41 22|29.75 29|44.63 28|34.3 21|25.62 Swedish Leaf 11.02|12 9.2|7.68 6.8|6.24 24.65|20.8 20.6|17.92 6.8|6.72 50 Words 19.38|28.13 16.22|16.26 16.89|17.80 33.18|31 30.89|38.90 16.44|25.93 Trace 10.01|17 0|0 0|1 0|0 0|0 0|0 Two Patterns 0|0 0|0 0|0 0|0 0|0 0|0 Wafer .1|0.9 .2|0.62 0.1|0.34 1.4|2.01 2.4|3.11 0.2|0.42 face (four) 4.35|10.2 4.17|3.41 8.33|4.55 26.09|17.05 12.5|25 8.33|3.41 Ligthing2 11.86|14.75 13.33|26.23 11.67|22.95 13.56|13.1 16.67|8.2 8.33|16.39 Ligthing7 23.19|30.1 22.86|17.81 22.86|17.81 33.33|27.4 30.00|26.03 12.86|17.81 ECG 10.01|13 7|16 9|12 23.23|23 12|19 6|11 Adiac 35.99|37.85 33.33|31.71 24.10|24.4 40.62|39.64 36.41|35.29 21.79|20.97 Yoga 14.05|14.7 21|12.63 11.67|10.5 16.37|16.4 20.33|17.03 11.67|11.47 Fish 16.09|12 9.14|6.86 9.14|4.57 26.44|16.57 25.71|22.29 7.43|4.57 Coffee 25.93|25 14.29|17.86 14.29|17.86 14.81|17.86 7.14|7.14 7.14|7.14 OliveOil 17.24|16.67 13.33|16.67 13.33|16.67 13.79|13.33 10|13.33 13.33|13.33 Beef 68.97|50 36.67|46.67 46.67|46.67 46.67|50 33.33|53.33 30|46.67 # Best Scores - 15|10 15|17 - 5|5 19|19 # Uniquely Best Scores - 3|4 5|10 - 1|1 15|15

mization process does not perform well. Nevertheless, the REDKdtw kernel based on δdtw greatly improves

the classification results. To further explore the potential impact of indefiniteness on classification rates, we give in table 4 two quantified hints of deviation to conditionally definiteness for the distance-matrices corresponding to the δdtw, δerp and δtwed distances. Since to be

condi-tionally definite (negative) a pairwise distance-matrix should have a single positive eigenvalue, the first hint is the number of positive eigenvalues#P ev (we give also as a reference the total number of eigenvalues, #Ev).

The second hint,∆p = 100∗ P

evi>0(evi)−ArgMax_evi>0{evi} P

evi>0evi ,

whereevi is an eigenvalue of the distance matrix,

quan-tifies the weight of the extra positive eigenvalues rela-tively to the weight of the total number of positive eigen-values. Therefore, a conditionally definite (negative) ma-trix should be such that simultaneously #P ev = 1 and

∆p= 0.

By examining the distance-matrices corresponding to each training datasets and for each distancesδdtw(A, B),

δerp(A, B) and δtwed(A, B), we can show that the δdtw

(11)

defi-TABLE 3

Comparative study using the UCR datasets: classification error rates (in %) obtained using the first near neighbor classification rule and a SVM classifier for theδtwedandREDKtwedkernels. Two scores are given S1|S2: the first

one, S1, is evaluated on the training data, while the second one, S2, is evaluated on the test data. Bold faces indicates the lowest error rate between the pair of SVMs (Ktwed,REDKtwed).

DATASET 1-NNδtwed SVMδtwed SVMREDKtwed

Synthetic control 1|2.33 0|1.33 0|1 Gun-Point 0|1.33 2|0.67 0|0.67 CBF 0|0.67 3.33|3.33 3.33|1.77 Face (all) 1.43|18.93 0.54|17.10 0.71|16.15 OSU Leaf 17.59|24.79 15|17.36 17|20.66 Swedish Leaf 8.82|10.24 7|6.56 6.4|5.76 50 Words 18.26|18.9 16.22|14.07 15.78|16.04 Trace 1|5 0|0 0|1 Two Patterns 0|0.12 0|0 0|0 Wafer 0.1|.86 0.3|0.34 0.1|0.2 face (four) 8.7|3.41 8.33|2.27 8.33|3.41 Ligthing2 13.56|21.31 20|32.15 10|21.31 Ligthing7 24.64|24.66 27.14|21.92 27.14|21.92 ECG 13.13|10 11|8 12|7 Adiac 36.25|37.6 28.72|30.95 24.85|24.30 Yoga 19.06|12.97 11.67|9.87 11.33|10.46 Fish 12.07|5.14 6.29|3.43 6.86|3.43 Coffee 18.52|21.43 17.86|21.43 14.29|17.86 OliveOil 11.11|16.67 13.33|13.33 13.33|16.67 Beef 58.62|53.3 36.67|53.33 46.67|50 # Best Scores - 12|10 15|14 # Uniquely Best Scores - 5|5 8|10

0 10 20 30 40 50 60 0 10 20 30 40 50 60 SVM REDK dtw SVM δ_dtw SVM δ_dtw v.s. SVM REDK_dtw

Fig. 3. Comparison of error rates (in %) between two SVM classifiers, the first one based on the δdtw substi-tuting kernel (SVM δdtw), and the second one based on a REDK induced by theδdtw distance (SVMREDKdtw)). The straight line has a slope of 1.0 and dots correspond, for the pair of classifiers, to the error rates on the train (star) or test (circle) data sets. A dot below (resp. above) the straight line indicates that SVMREDKdtwhas a lower (resp. higher) error rate than distanceδdtw

0 10 20 30 40 50 60 0 10 20 30 40 50 60 SVM REDK twed SVM δtwed

SVM δ_twed v.s. SVM REDK_twed

Fig. 4. Comparison of error rates (in %) between two SVM classifiers, the first one based on theδtwed substi-tuting kernel (SVMδtwed), and the second one based on a REDK induced by theδerpdistance (SVM REDKtwed). The straight line has a slope of 1.0 and dots correspond, for the pair of classifiers, to the error rates on the train (star) or test (circle) data sets. A dot below (resp. above) the straight line indicates that SVM REDKtwed has a lower (resp. higher) error rate than distance SVMδtwed

(12)

TABLE 4

Analysis of the deviation to conditionally definiteness for the distance-matrices associated to theδdtw,δerpandδtwed distances. We report for each dataset the number of positive eigenvalues (#P ev) relatively to the total number of eigenvalues (#Ev)and the deviation to definiteness estimated as∆p that expresses in %. The expectation is a single

positive eigenvalue,#P ev= 1, corresponding to∆p= 0%.

DATASET δdtw δerp δtwed

- #Pev/#Ev ∆p #Pev/#Ev ∆p #Pev/#Ev ∆p

Synthetic control 110/300 15.66% 8/300 .16% 6/300 .22 % Gun-Point 23/50 2.54% 1/50 0% 1/50 0% CBF 5/30 3.36% 1/30 0% 1/30 0% Face (all) 242/560 26.6% 83/560 2.42% 41/560 1.89% OSU Leaf 96/200 31.79% 29/200 2.97% 16/200 .89% Swedish Leaf 206/500 17.04% 24/500 .68% 23/500 .41% 50 Words 218/450 34.03% 119/450 9.54% 93/450 4.85% Trace 43/100 5.42% 1/100 0% 1/100 0% Two Patterns 453/1000 36.7% 259/1000 13.8% 226/1000 9.85% Wafer 497/1000 14.84% 137/1000 1.29% 39/1000 .04% face (four) 2/24 .74% 1/24 0% 1/24 0% Ligthing2 20/60 13.44% 1/60 0% 1/60 0% Ligthing7 24/70 14.25% 1/70 0% 1/70 0% ECG 38/100 14.7% 1/100 0% 1/100 0% Adiac 159/390 5.54% 26/390 .82% 39/390 .69% Yoga 142/300 23.4% 29/300 3.17% 10/300 .41% Fish 71/175 17.57% 1/175 0% 1/175 0% Coffee 12/28 8.83% 1/28 0% 1/28 0% OliveOil 4/30 .24% 1/30 0% 1/30 0% Beef 15/30 6.17% 1/30 0% 1/30 0%

nite matrix than the δerp andδtwed kernels. The distance

that is closer to conditional definiteness is the δtwed

distance. This is clearly quantifiable by computing the number of positive eigenvalues, as well as their ampli-tudes, of the pairwise distance matrices. For datasets such as FaceAll, OSULeaf, SwedishLeaf, 50words, Adiac, Two Patterns, Fish etc., all the three distances lead to indefinite pairwise distance matrices. The REDK regu-larization brings in general a significant improvement, specifically when the number and amplitudes of the extra positive eigenvalues are high. Furthermore, for datasets of small sizes (such asCBF,Beef,Coffee,OliveOil, etc.), δerp and δtwed produce conditionally definite

ma-trices whereδdtw does not. One can see that the

regular-ization brought by REDK is much more effective onδdtw

on these data set. Nevertheless, some datasets are better classified by SVM that use directly the distance substi-tuting kernel instead of the derived REDK regularized kernel.

To conclude, our experiment clearly shows that for problems involving distance matrices far from definite-ness (more than 5-10% of eigenvalues are positive, see Table 4), the REDK regularized version outperforms the original elastic distance substituting kernel. This is particularly true for δdtw. For δtwed, or even δerp

distances, most of the training sets lead to pairwise distance-matrices that are either conditionally definite negative or close to CND matrices (with very few extra positive eigenvalues). Nevertheless, in most cases, the REDK-δerpand REDK-δtwedoutperformedδerpandδtwed

respectively, showing some robustness of the REDK regularization. The main drawback of REDK is the extra parameter ν′_{. This extra parameter}_ν′ _{makes the search}

for an optimal setting on the train data more difficult and requires more learning data to converge. The trade-off between learning and generalization is therefore more complex to optimize.

6

C

ONCLUSION

Following the work on convolution kernels [1] and local alignment kernels defined for string processing around the Smith and Waterman algorithm [9] [27], or defined for time series on the basis of the DTW similarity mea-sure [5], we have addressed the definiteness of elastic distances through the construction of positive definite kernels built from the recursive definitions of elastic (editing) distances themselves and achieve some exten-sion and generalization of these previous results.

Contrary to other approaches that rely on the reg-ularization of the Gram-matrix [23][24], this approach does not depend at all on the data, excpet for the opti-mization of a regularizing parameter (ν′_{). By adding an}

extra recursive term we achieve some simple sufficient conditions, that are weaker than those proposed in [27] or [5], allowing to build positive definite exponentiated REDK. These conditions are basically satisfied by any classical elastic distance defined by a recursive equation. In particular they can be applied to the edit distance, the well known Dynamic Time Warping measure and to some variants such as the Edit Distance With Real penalty and the Time Warp Edit Distance, the latter two being metrics as well as the symbolic edit dis-tance. Furthermore, our results apply when a symmetric

corridor is used to limit the search space required to evaluate the elastic distance, thus reducing consequently

(13)

the computation time, which, in the worse case, i.e. without corridor, remains quadratic.

The experiments conducted on a variety of time series datasets show that the positive definite REDKs outper-forms in general the indefinite elastic distances from which they are derived, when considering 1-NN and SVM classification tasks. This is mostly striking when the Gram-matrix evaluated on the train data sets is far

from definiteness, which is the case when DTW is used. Recent experiment in isolated gesture recognition [40] corroborate this observation.

A

PPENDIX

A

I

NDEFINITENESS OF SOME ELASTIC MEASURES A.1 The Levenshtein distance

The Levenshtein distance kernel ϕ(x, y) = δlev(x, y)

is known to be indefinite. Below, we discuss the first known counter-example produced by [41]. Let us con-sider the subset of sequencesV ={abc, bad, dab, adc, bcd}

that leads to the following distance matrix

MlevV =       0 3 2 1 2 3 0 2 2 1 2 2 0 3 3 1 2 3 0 3 2 1 3 3 0       (6)

and consider coefficient vectors C and D in R5 _such that C= [1,1,₋2/3,₋2/3,₋2/3]with P5 i=1ci= 0and D= [1/3,2/3,1/3,₋2/3,₋2/3]withP5 i=1di= 0. Clearly CMV

levCT = 2/3>0and DMlevV DT =−4/3<

0, showing thatMV

lev has no definiteness.

A.2 The Dynamic Time Warping distance

The DTW kernel ϕ(x, y) = δdtw(x, y) is also known

not to be conditionally definite. The following example demonstrates this known result. Let us consider the subset of sequences V =_{01,012,0122,01222_}.

Then the DTW distance matrix evaluated onV is

MdtwV =     0 1 2 3 1 0 0 0 2 0 0 0 3 0 0 0     (7)

and consider coefficient vectors C and D in R4 _such that C= [1/4,₋3/8,₋1/8,1/4]with P4 i=1ci= 0 and D = [−1/4,−1/4,1/4,1/4] with P4 i=1di = 0. Clearly CMV dtwCT = 2/32 > 0 and DMdtwV DT = −1/2 < 0, showing that MV dtw has no definiteness.

A.3 The Time Warp Edit Distance

Similarly, it is easy to find simple counter examples that show that TWED kernels are not definite.

Let us consider the subset of sequences V =

{010,012,103,301,032,123,023,003,302,321_}.

For the TWED metric, with ν = 1.0 and Ω = 0.0 we get the following matrix:

MV twed=                 0 2 7 9 6 7 5 5 10 9 2 0 5 9 4 5 3 3 8 9 7 5 0 6 7 4 6 2 5 10 9 9 6 0 13 10 12 8 1 4 6 4 7 13 0 5 3 5 12 9 7 5 4 10 5 0 2 6 9 6 5 3 6 12 3 2 0 4 11 8 5 3 2 8 5 6 4 0 7 10 10 8 5 1 12 9 11 7 0 5 9 9 10 4 9 6 8 10 5 0                 (8)

The eigenvalue spectrum for this matrix is the following:

{ 4.62, 0.04, ₋2.14, ₋0.98, ₋0.72, ₋0.37, ₋0.19, ₋0.17,

−0.06,₋0.03_}. This spectrum contains 2 strictly positive eigenvalues, showing thatMV

twed has no definiteness.

A.4 The Edit Distance with Real Penalty

For the ERP metric, with g = 0.0 we get the following matrix: MV erp=                 0 2 3 3 4 5 4 2 4 5 2 0 3 5 2 3 2 2 4 5 3 3 0 4 3 2 3 1 3 4 3 5 4 0 7 6 7 5 1 2 4 2 3 7 0 3 2 2 6 5 5 3 2 6 3 0 1 3 5 4 4 2 3 7 2 1 0 2 6 5 2 2 1 5 2 3 2 0 4 5 4 4 3 1 6 5 6 4 0 1 5 5 4 2 5 4 5 5 1 0                 (9)

The eigenvalue spectrum for this matrix is the following:

{ 4.63, 0.02, 1.39e − 17, −2.21, −0.97, −0.56, −0.41,

−0.26,−0.17,−0.08 }. This spectrum contains 3 strictly positive eigenvalues (although the third positive eigenvalue which is very small could be the result of the imprecision of the used diagonalization algorithm), showing thatMV

erp has no definiteness.

A

PPENDIX

B

P

ROOF OF OUR MAIN RESULT B.1 Proof of theorem 4.3

Definition B.1: Let U_n _{be the subset of} U _containing all the sequences whose lengths are lower or equal ton.

(14)

Fig. 5. Example of an alignment path corresponding to the alignment map(0,0)(0,1)(1,2)(1,3)(2,4)(3,4)(4,5). The white squares correspond to substitution operations and black circles to either deletion or insertion operations.

Definition B.2: For all(a, b)∈((S×T)∪{Ω})2letκ(a, b)

be the local kernel defined as follows:

κ(a, b) =f(Γ(a_→b))

where Ωstands for thenullsequence.

Definition B.3: Let π be an ordered alignment map between two finite non empty sequences of successive integers of the form0, .., n. Basicallyπis a finite sequence of pairs of integers π(l) = (il, jl) for l ∈ {0, ...,|π| −1},

satisfying the following conditions

• 0≤il,∀l∈0, ..,|π| −1

• i_l≤i_l₋₁+ 1,∀l∈1, ..,|π| −1 • j_l≤j_l₋₁+ 1,∀l∈1, ..,|π| −1

• i_l₋₁< i_lor j_l₋₁< j_l,∀l∈ {1, ..,|π| −1}

πx(l) = il and πy(l) = jl are the two coordinate access

functions for the lth _{pair of mapped integers so that}

π(l) = (πx(l), πy(l)).

For all n _≥ 1, let _Mn be the set of alignment maps

π such that the two sets of integers mapped by π are

{1_{· · ·}n_{} × {}1_{· · ·}n_}. By convention we set _M0=∅.

As shown in figure 5, there exists a direct correspondence between an alignment map and an alignment path, i.e. a finite sequence of editing operations.

Definition B.4: For allnwe introduce two projections, basically two vectorized representations, ϕπx : Un →

((S _×T)_{∪ {}Ω_})2n _and _ϕ

πy : Un → ((S ×T)∪ {Ω})

2n

induced uniquely by any alignment map π_{∈ M}n.

The principle for the construction of these two unique projections, given any sequence A and any alignment map π _{∈ M}n, is straightforward. With the convention

that if k > _|A_| then A(k) = ΩA(|A|) and ΩA(k) =

ΩA(|A|), during the course along π of length L = |π|

at step l,l_{∈ {}1_{· · ·}L_}, we apply the following rules: i) if both indexes πx(l) and πy(l) increase , then, we

insert A(πx(l))inϕπx(A)and A(πy(l))in ϕπy(A).

ii) if only indexπx(l)increases, then we insertA(πx(l))

in ϕπx(A) and ΩA(πy(l)) in ϕπy(A), with the

convention that, ifπx(l)>|A|,A(πx(l)) = ΩA(πx(l))

iii) if only index πy(l) increases, then we insert

ΩA(πx(l))in ϕπx(A)andA(πy(l))in ϕπy(A).

iv) when we reach the end of π, if the lengths of

ϕπx(A) (respectively ϕπy(A)) is shorter than 2n,

then we insertΩinto the remaining dimensions. For example, for a sequence A of length 5 the two projections in ((S×T)∪ {Ω})2×5 _{corresponding to the}

alignment path depicted in Figure 5 are

ϕπx(A) = Ω(1)A(1)Ω(2)A(2)A(3)A(4)ΩΩΩΩ

ϕπy(A) =A(1)A(2)A(3)A(4)Ω(5)A(5)ΩΩΩΩ

whereΩ(i) = ΩA(i)is the symbol used for an insertion

or deletion operation. This symbol depends on the elastic distance. For DTW,Ω(i) =A(i)if0< i≤ |A|, orΩ(i) = Ω

if i >|A|.

Then, for any A_∈U_n _and _π_{∈ M}_n_{, we call} _P_π₍_A_{) =}

{ϕπx(A), ϕπy(A)} the set of projections (or parts) for

sequenceAinduced byπ. Note that all these projections are sequences whose lengths are2n. Even

Proposition B.5: If the local kernel k(x, y) = f(Γ(x _→ y))is positive definite on ((S×T)∪ {Ω})2 then ∀n≥1

and∀π∈ Mn Kπ(A, B) = X ϕ(A)∈Pπ(A) X ϕ(B)∈Pπ(B) 2n Y p=1 k(ϕ(A)p, ϕ(B)p) (10) is a p.d. kernel on(U_n₎2_.

Proof of lemma B.5 is a direct consequence of the Haussler’s R-convolution kernel theorem [1]. Indeed, since k(x, y) is a p.d. kernel on ((S×T)∪ {Ω})2, and, considering the sets of parts Pπ(A) and Pπ(B)

associated respectively to the sequences A and B, the conditions for the Haussler’s R-convolutionare satisfied.

Note thatKπ(A, B) simply rewrites as

Kπ(A, B) = 2n Y p=1 k(ϕπx(A)p, ϕπy(B)p)) + 2n Y p=1 k(ϕπy(A)p, ϕπx(B)p)) + 2n Y p=1 k(ϕπx(A)p, ϕπx(B)p)) + 2n Y p=1 k(ϕπy(A)p, ϕπy(B)p)) (11) Let Mn,h ⊂ Mn be the subset of paths that are

restricted by the symmetric corridor induced by any symmetric indicator function h entering into the con-struction of the RAfP functions_Ch(., .) andChδ(., .).

(15)

(A, B)_∈(U_n₎2_{, we have} K(A, B) = 1 2 X π∈Mn,h Kπ(A, B) (12)

We state first that ∀π∈ Mn,h, ∃!˜π ∈ Mn,h such that

πx = ˜πy and πy = ˜πx. (π, π˜) is a pair of symmetrical

paths with respect to the main diagonal.

Thus, sinceCh(., .)evaluates the sum of thecost-product

along all paths in Mn,h, that is Ch(A, B) = X π∈Mn,h 2n Y p=1 k(ϕπx(A)p, ϕπy(B)p) (13) we can rewrite Ch(A, B) = X π∈Mn,h 1 2 2n Y p=1 k(ϕπx(A)p, ϕπy(B)p) +1 2 2n Y p=1 k(ϕ˜πx(A)p, ϕπ˜y(B)p) ! = 1 2 X π∈Mn,h 2n Y p=1 k(ϕπx(A)p, ϕπy(B)p) + 2n Y p=1 k(ϕπy(A)p, ϕπx(B)p) !

Similarly, forChδ(., .)we get Ch,δ(A, B) = 1 2 X π∈Mn,h 2n Y p=1 k(ϕπx(A)p, ϕπx(B)p) + 2n Y p=1 k(ϕπy(A)p, ϕπy(B)p) !

Hence, the expected result.

Proof of Proposition 4.3 (Definiteness of REDK)

Proposition B.6 states that K(., .) is a finite sum of p.d. kernels defined on (U_n₎2_{. According to the closure} properties under the addition of p.d. kernels, we show that K(., .)is a p.d. kernel defined on(U_n₎2_.

As this result is true for all n, one can see K(., .) as a point-wise limit as n tends toward infinity of a p.d. kernel defined on(U_n₎2_{. This establishes that}_K₍_{., .B}₎_is a p.d. kernel on U2 _.

A

CKNOWLEDGMENTS

The authors thank the French Ministry of Research, the Brittany Region, the General Council of Morbihan and the European Regional Development Fund that had partially fund this research. The authors also thank Dr. E. Keogh’s team at UC Riverside for making available the time series data sets that have been used in this study.

R

EFERENCES

[1] D. Haussler, “Convolution kernels on discrete structures,” University of California at Santa Cruz, Santa Cruz, CA, USA, Technical Report UCS-CRL-99-10, 1999. [Online]. Available: http://citeseer.ist.psu.edu/haussler99convolution.html

[2] B. Haasdonk and C. Bahlmann, “Learning with distance sub-stitution kernels,” in Pattern Recognition, ser. Lecture Notes in Computer Science, C. Rasmussen, H. Blthoff, B. Schlkopf, and M. Giese, Eds. Springer Berlin Heidelberg, 2004, vol. 3175, pp. 220–227.

[3] J. P. Vert, H. Saigo, and T. Akutsu, “Local alignment kernels for biological sequences,” inKernel Methods in Computational Biology, B. Scholkopf, K. Tsuda, and J. Vert, Eds. MIT Press, 2004, pp. 131–154.

[4] C. Cortes, P. Haffner, and M. Mohri, “Rational kernels: Theory and algorithms,”J. Mach. Learn. Res., vol. 5, pp. 1035–1062, Dec. 2004.

[5] M. Cuturi, J.-P. Vert, . Birkenes, and T. Matsui, “A kernel for time series based on global alignments,” inAcoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, vol. 2, April 2007, pp. II–413–II–416.

[6] R. Bellman,Dynamic Programming. Princeton Univ Press, 1957, new Jersey.

[7] V. I. Levenshtein, “Binary codes capable of correcting dele-tions, inserdele-tions, and reversals,” Doklady Akademii Nauk SSSR, 163(4):845-848, 1965 (Russian), vol. 8, pp. 707–710, 1966, english translation in Soviet Physics Doklady, 10(8).

[8] R. A. Wagner and M. J. Fischer, “The string-to-string correction problem,”Journal of the ACM (JACM), vol. 21, pp. 168–173, 1973. [9] T. Smith and M. Waterman, “Identification of common molecular subsequences,”Journal of Molecular Biology, vol. 147, pp. 195–197, 1981.

[10] S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman, “A basic local alignment search tool,”Journal of Molecular Biology, vol. 215, pp. 403–410, 1990.

[11] W. Pearson, “Rapid and sensitive sequence comparisons with fasp and fasta,”Methods Enzymol, vol. 183, pp. 63–98, 1990.

[12] V. M. Velichko and N. G. Zagoruyko, “Automatic recognition of 200 words,” International Journal of Man-Machine Studies, vol. 2, pp. 223–234, 1970.

[13] H. Sakoe and S. Chiba, “A dynamic programming approach to continuous speech recognition,” in Proceedings of the Seventh International Congress on Acoustics, Budapest, vol. 3. Budapest: Akad´emiai Kiad ´o, 1971, pp. 65–69.

[14] L. Chen and R. Ng, “On the marriage of lp-norms and edit distance,” in Proceedings of the Thirtieth International Conference on Very Large Data Bases - Volume 30, ser. VLDB ’04. VLDB Endowment, September 2004, pp. 792–803.

[15] P.-F. Marteau, “Time warp edit distance with stiffness adjustment for time series matching,”Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 31, no. 2, pp. 306–318, Feb 2009. [16] A. Hayashi, Y. Mizuhara, and N. Suematsu, “Embedding time

series data for classification,” inMachine Learning and Data Mining in Pattern Recognition, ser. Lecture Notes in Computer Science, P. Perner and A. Imiya, Eds. Springer Berlin Heidelberg, 2005, vol. 3587, pp. 356–365.

[17] B. Haasdonk, “Feature space interpretation of svms with in-definite kernels,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 27, no. 4, pp. 482–492, April 2005.

[18] V. Vapnik, Statistical Learning Theory. Wiley-Interscience, ISBN 0-471-03003-1, 1989.

[19] B. E. Boser, I. M. Guyon, and V. N. Vapnik, “A training algorithm for optimal margin classifiers,” inProceedings of the Fifth Annual Workshop on Computational Learning Theory, ser. COLT ’92. New York, NY, USA: ACM, July 1992, pp. 144–152.

[20] B. Scholkopf and A. J. Smola,Learning with Kernels: Support Vector Machines, Regularizatio