Using Segmentation Constraints in an Implicit Segmentation Scheme for Online Word Recognition

(1)

1

Using Segmentation Constraints

in an Implicit Segmentation Scheme

for Online Word Recognition

Émilie CAILLAULT – LASL, ULCO/University of Littoral (Calais)

Christian VIARD-GAUDIN – IRCCyN, University of Nantes

(2)

2

Aim

• Problem :

– Training of a neuro-markovian system in the context

of cursive handwritten word recognition.

• Difficulty :

– Bootstrapping of the system allowing a training from

scratch.

• Proposed solution :

Implicit segmentation procedure

•

Definition of a segmentation control function

(3)

3

How train a neuro-markovian system?

Classifier

HMM

Teaching

Signal: character

Signal

Character level training

Classifier

HMM

Teaching signal:

word

Word level training

RN-MMC

Parole/Hors-ligne

Différents hybrides

MMC-MMC

RN-MMC

MMC - RN

SVM-MMC

Kppv-MMC

(4)

4

Global system overview : MS-TDNN

O b se rva tion w in d o w s O t of th e T D N N S a m e c o nv o lu tion a l w ind o w in sid e th e o b se rva tio n w in do w s of th e T D N N O b s1 O bs n O b s1 O bs n … … ... x r y dx dy cy cx P T D N N in pu ts : te m p o ral se q u en c e o f fe a ture s v e cto rs for th e w o rd « qu a n d »

‘z’ ‘a ’ ‘b ’ ‘c’ ... O b s1 O bs n O b sT

O b se rva tion s v e cto rs se q u en c es p ro ce s se d

by the H M M

T D N N T D N N S a m e w e ig th s

m a tric es

• Scanning of the input word by a

TDNN (temporal convolution).

• TDNN produces the sequence of

observation probabilities.

•

No explicit segmentation

is

carried out but just an overlapping

window is regularly shifted.

• Advantages of the TDNN

architecture: weight sharing

⇒

reduction in computation times

and in memory requirements.

• Reduction of the number of

computation with an adapted

topology of the TDNN due to the

sliding observation window.

1- MsTDNN-HMM 2- Implicit Segmentation Control

3- Conclusion

1- MsTDNN-HMM

P/D

1 state /

character:

66 states

3 states /

character:

198 states =

198 output neurons

(5)

5

Dynamic Programming Likelihood computation

a

b

c

d

…

.n

o

p

q

…

.u

…

.z

Ch

a

ra

c

te

r

m

o

d

e

ls

O1 O2 O3 ….. _{Ot, Observation sequences}

Best Viterbi path

bj(Ot), observation probability aij = cst transition probability

Dynamic Programming

to find the best path alignment

Word Lexicon

---un

deux

trois

…

quand

…

qs 11 uu 11 nn 11 qf qs 11 qq 11 uu 11 aa 11 nn 11 dd 11 _qf

Words HMM = concatenation of letters HMM

DP

Likelihood

Scores

• Hence, the likelihood for each word in

the lexicon is computed by

multiplying the observation

probabilities over the best path

through the graph using the Viterbi

algorithm.

• The word model with the highest

likelihood is the top one recognition

candidate.

For each entry in the

lexicon, a

pseudo-HMM-word model is constructed

dynamically by

concatenating letter HMM

3- Conclusion

1- MsTDNN-HMM

q

s

q

1

q

f

1-state letter model

q

s

q

1

q

2

q

3

q

f

3-state letter model

(6)

6

Training

Word Likelihood Computation

Recognition

List of words ranked by their WL score

Word Lexicon

Normalisation

...

Sequence of features vectors

Original data Delta x - coord y - coord x - direction Y _ direction .... y - coord x - direction Y _ direction .... 1 _N

...

Sequence of vectors of observation probabilities ,

DP

TDNN

Features Extraction

Normalized data

Character Likelihood

Delta x - coord

Global cost function

Backpropagation of the gradient

Training :

• Word level training

• Classical

back-propagation algorithm

• Cost function mixing

MLE, MMI and

character level

discrimination

[ICDAR 2005]

3- Conclusion

1- MsTDNN-HMM

« one » recognized as « over »

True Hmm: (+)

ooonneee

Best Hmm: (–)

oovveeer

Best character: (–)

aouvneel

« one »

True Hmm: (+)

ooonneee

« one » recognized as « over »

True Hmm: (+)

ooonneee

(7)

7

Illustration of the segmentation problem

• Random initialisation of the TDNN

⇒

all outputs are equally probable

Example: Possible paths for the English word

‘of’ with 10 observations (TDNN

windows)

3- Conclusion

2- Implicit Segmentation Control

ooooooooof

ooooooooff

ooooooofff

ooooooffff

ooooofffff

ooooffffff

ooofffffff

ooffffffff

offfffffff

o

f

(8)

8

How to obtain desired Likelihood distribution?

3- Conclusion

P(

Γ

)

offfffffff

1

ooffffffff

ooofffffff

ooooffffff

ooooofffff

ooooooffff

ooooooofff

ooooooooff

ooooooooof

9

Segmentation

paths

for the word

‘of’

Existing solutions:

Training with a character database

- duration model for HMM sub-unit [Penacee]

- manual segmentation constraints [Remus, Npen++]

Proposed solution:

- no modification of the HMM structure (transitions=1)

- no training with a character database

(9)

9

Weighting function

f

T

:

B = First observation position

for the concerned letter

E = Last observation position

for the concerned letter

H = Height of the function

ƒ

T

S = E-B = letter size

3- Conclusion

0 10 20 30 40 50 60 70 80 90 100 1 2 3 4 5 6 7 8 9

Observations range corresponding to one letter

W

e

ig

h

ti

n

g

o

f

T

D

N

o

u

tp

u

ts

%

H=100

H>1000

( , )

min(

NN

( , ) *

T

(

);1)

Outputs t l

=

Outputs

t l

f

H

B

E

(10)

10

Weighting function applied : case of the word ‘of’

H=100

3- Conclusion

0

50

100

1

2

3

4

5

6

7

8

9

10

observation position

w

e

ig

h

ti

n

g

c

o

e

ff

ic

ie

n

t

o

f

1

(11)

11

How to evaluate the effectiveness of this weighting function ? ASR

1- MsTDNN-HMM 2- Implicit Segmentation Control

3- Conclusion

ASR : Averaged homogenous Segmentation Rate

e:

index of the considered sample

T:

number of observations scanned by the TDNN

N:

(fixed or limited) number of training samples

NL(e):

number of letters in the word label

NLs(e): number of sequences of different letters

i:

index of a sequence of letters from 1 to NLs

NLc(i): number of the same consecutive letter

and nb_obs(i): number of observations of the letter i in the optimal Viterbi path.

( )

1

_

( )

(

)

( ) *

/

( )

NLs e

N

e

i

nb obs i

ASR

N

=

NLc i

T NL e

=

∑ ∑

0.36

0.64

0.84

0.96

1

0.96

0.84

0.64

0.36

ASR

1

2

3

4

5

6

7

8

9

NL(f)

9

8

7

6

5

4

3

2

1

NL(o)

Case: ‘of’

(12)

12

Experiments :

Training

in 2 stages :

1-

f

T

segmentation

constraint function

for 20 epochs

2- training with no

segmentation

constraint

3- Conclusion

Word Likelihood Computation

...

Sequence of vectors of observation probabilities _,

True HMM

TDNN

Character Likelihood

Normalisation

Original data Normalized data

...

Sequence of features vectors

Delta x - coord y - coord x - direction Y _ direction .... y - coord x - direction Y _ direction .... 1 _N

Features Extraction

Delta x - coord

Global cost function

Backpropagation

of the gradient

...

Constrained Sequence wtih

duration model

,

Label of the true word

Weighting

function

ƒ

_T

Weighting

function

ƒ

_T

(13)

13

3- Conclusion

0.684

0.498

11.8

17.7

Relative

improvement (%)

0.612

0.423

Average

Segmentation

Rate (ASR)

94.81

3S-TDNN

Segmentation

constraint

28.1

32.6

Relative

improvement (%)

90.51

91.31

87.09

Recognition rate

(%)

3State-TDNN

1S-TDNN

Segmentation

constraint

1State-TDNN

System

Results on online word IRONOFF database

IRONOFF word database

English / French

Lexicon : 197 words

2/3 Training : 20 898 words

(14)

14

Conclusions

• A simple and effective process to bootstrap a hybrid

system without any labeled character database

• Important increasing of the recognition rate by an

initialisation stage with a duration model apply

directly on the TDNN outputs

Weighting function ƒ

T

• Better quality of segmentation

⇒

better recognition

rates

ASR – Averaged Segmentation Rate

(15)

15

Future works : controlled segmentation stage

• Measure of the ASR for each training iteration

• Adaptation of the parameter H in the weighting

function to reduce the segmentation constraints

according to the ASR value.

– H=ƒ(ASR).

– Initialisation with H = 100

⇒

high segmentation

constraint, only balanced segmentation paths are possible

– Then relax constraint on H when ASR has reached a

reasonable value.

(16)

16

(17)

17

Le moteur de reconnaissance : le TDNN

TDNN

...

MLP

...

Global

Vision

Weight

sharing

Local

vision to

global

vision