• No results found

New features for on-line aphasia therapy. Thesis to obtain the Master of Science Degree in Information Systems and Computer Engineering

N/A
N/A
Protected

Academic year: 2021

Share "New features for on-line aphasia therapy. Thesis to obtain the Master of Science Degree in Information Systems and Computer Engineering"

Copied!
92
0
0

Loading.... (view fulltext now)

Full text

(1)

New features for on-line aphasia therapy

Anna Maria Pompili

Thesis to obtain the Master of Science Degree in

Information Systems and Computer Engineering

Examination Committee

Chairperson:

Prof. Pedro Manuel Moreira Vaz Antunes de Sousa

Supervisor:

Prof. Isabel Maria Martins Trancoso

Supervisor:

Dr. Alberto Abad Gareta

Member of the Committee:

Prof. Alfredo Manuel dos Santos Ferreira Júnior

(2)
(3)

To Giuseppina and Francesco.

(4)
(5)

My deepest gratitude goes to Professor Alberto Abad. He always guided and supported me in the most comprehensive and constructive way, providing brilliant ideas, showing me the right approach to address complicated problems, and readily helping me to overcome the many difficulties that I had to tackle while pursuing the objectives of this thesis. His guidance, constant incentives and endless availability have been fundamental for the achievement of this result.

I wish to express my gratitude to Professor Isabel Trancoso, not only for her valuable guidance, but also for having welcome me in the L2F group. During the time spent here, she always motivated me with inspiring discussions, and provided me her full support and availability. She never missed a chance to demonstrate me her trust, and constantly accompanied my work, enlightening my way with her unique ability to identify innovative research directions and enticing applications for the results achieved during this work.

I owe a very special acknowledgment to Isabel Pav ˜ao Martins, Jos ´e Fonseca, Gabriela Leal, Luisa Farrajota, and Sofia Cl ´erigo, from the Language Research Laboratory group (LEL - Laborat ´orio de Estudos de Linguagem) of the Lisbon Faculty of Medicine. Their cooperation has been fundamental to allow the project VITHEA to become a reality.

I also want to thank Professor Nuno Mamede and Professor Sara Candeias from the L2F group, for having kindly provided important resources that have constituted the baseline for some of the results achieved with this work. Without these initial onsets, those results would not have been possible.

Thank you also to all the colleagues and room-mates that I have had the pleasure to know during these years. They have been supporters of this experience not only with their kindness and friendship, but also by providing active participation in the data collection and user evaluation experience.

Finally, my special thanks go to Paolo, my companion. His advices, cares, and support have been invaluable to me to overcome the hardest difficulties.

(6)
(7)

Afasia ´e um tipo particular de dist ´urbio da comunicac¸ ˜ao causada por les ˜oes de uma ou mais ´areas do c ´erebro que afectam diferentes funcionalidades da linguagem e da fala. Os acidentes vasculares cerebrais s ˜ao uma das causas mais comuns dessa doenc¸a.

VITHEA (Terapeuta Virtual para o tratamento da afasia) ´e uma plataforma on -line desenvolvida para o tratamento de doentes af ´asicos, incorporando os recentes avanc¸os das tecnologias de fala para proporcionar exerc´ıcios de nomeac¸ ˜ao `a pessoas com uma reduzida capacidade de nomear objetos. O sistema, dispon´ıvel ao p ´ublico desde Julho de 2011, recebeu j ´a v ´arios pr ´emios nacionais e interna-cionais e ´e atualmente distribu´ıdo a cerca de 160 utilizadores entre profissionais de sa ´ude e doentes.

O foco deste trabalho ´e investigar a viabilidade da incorporac¸ ˜ao de funcionalidades adicionais que podem potenciar o sistema VITHEA. Essas funcionalidades visam tanto estender a usabilidade do sis-tema quanto reforc¸ar o seu desempenho, considerando assim v ´arias ´areas heterog ´eneas do projeto. Entre estas funcionalidades destacam-se: uma nova vers ˜ao do aplicativo cliente para estender a porta-bilidade da plataforma a dispositivos m ´oveis, uma interface hands- free para facilitar os doentes por-tadores de defici ˆencias f´ısicas, e uma funcionalidade de pesquisa avanc¸ada para melhorar a gest ˜ao dos dados da aplicac¸ ˜ao. Foi tamb ´em estudada a viabilidade de um novo tipo de exerc´ıcios e avaliado o desempenho de um novo l ´exico de pron ´uncia com o objectivo de melhorar os resultados de recon-hecimento. Em geral, os resultados de question ´arios de satisfac¸ ˜ao dos utilizadores e as avaliac¸ ˜oes autom ´aticas t ˆem proporcionado feedback encorajador sobre as melhorias desenvolvidas.

Palavras-chave:

Afasia, recuperac¸ ˜ao da linguagem, terapia virtual, dist ´urbio da fala, nomeac¸ ˜ao oral, reconhecimento de fala

(8)
(9)

Aphasia is a particular type of communication disorder caused by the damage of one or more language areas of the brain affecting various speech and language functionalities. Cerebral vascular accidents are one of the most common causes.

VITHEA (Virtual Therapist for Aphasia Treatment) is an on-line platform developed for the treatment of aphasic patients, incorporating recent advances of speech and language technologies to provide word naming exercises to individuals with lost or reduced word naming ability. The system, publicly available since July 2011, received several national and international awards and is currently distributed to almost 160 users among professional health-care and patients.

The focus of this thesis is to investigate the feasibility of incorporating additional functionalities that may enhance the VITHEA system. These features aimed at both extending the usability of the system and strengthening its performance, and thus involve several heterogeneous areas of the project. The main new features were: a new version of the client application to extend the portability of the platform to mobile devices, an ad-hoc hands-free interface to facilitate patients with physical disabilities, and an advanced search capability to improve the management of the application data. This study also included the assessment of the feasibility of a new type of exercises, and the evaluation of a new pronunciation lexicon aimed at improving recognition results. Overall, the results of user interaction satisfaction ques-tionnaires and the automatic evaluations have provided encouraging feedback on the outcome of the developed improvements.

Keywords:

Aphasia, language recovery, virtual therapy, speech disorder, word naming, speech recognition

(10)
(11)

Acknowledgments . . . v

Resumo . . . vii

Abstract . . . ix

List of Tables . . . xv

List of Figures . . . xvii

List of abbreviations . . . xix

1 Introduction 1 1.1 Motivation . . . 1

1.2 Objectives . . . 3

1.3 Structure of this Document . . . 3

2 Related Work 5 2.1 Aphasia language disorder . . . 5

2.1.1 Aphasia symptoms classification . . . 5

2.1.2 Aphasia treatment . . . 6

2.2 Automatic speech recognition . . . 7

2.2.1 Brief introduction to automatic speech recognition . . . 7

2.2.2 AUDIMUS speech recognizer . . . 9

2.2.3 Automatic word verification . . . 10

2.2.3.0.1 Word verification based on keyword spotting . . . 10

2.2.3.0.2 Keyword spotting with AUDIMUS . . . 11

2.3 Platform for speech therapy . . . 11

2.3.1 VITHEA: An on-line system for virtual treatment of aphasia . . . 12

2.3.1.1 The patient and the clinician applications . . . 13

2.3.1.1.1 Patient application module . . . 13

2.3.1.1.2 Virtual character animation and speech synthesis . . . 13

2.3.1.1.3 Speech synthesis . . . 14

2.3.1.1.4 Clinician application module . . . 14

2.3.1.2 Platform architecture overview . . . 16

2.4 New features for aphasia therapy: State of the art . . . 17

2.4.1 Content adaptation for mobile devices . . . 17

2.4.2 Hands-free speech . . . 18

2.4.3 Exploiting IR for improved search functionality . . . 18 xi

(12)

2.4.4 New automatic evocation exercises for therapy treatment . . . 19

2.4.5 Exploiting syllable information in word naming recognition of aphasic speech . . . 20

3 Content adaptation for mobile devices 21 3.1 Service Oriented Architecture . . . 21

3.1.1 Representational State Transfer . . . 21

3.1.2 Data representation . . . 23

3.1.3 Android Platform . . . 24

3.2 Architectural overview of the implemented prototype . . . 24

3.2.1 REST authentication . . . 24

3.2.2 Implemented architecture . . . 25

3.2.2.0.1 Authentication . . . 26

3.2.2.0.2 Data representation . . . 26

3.2.3 Client application . . . 27

3.3 User experience evaluation . . . 27

3.4 Discussion . . . 30

4 Hands-free speech recording 31 4.1 Voice activity detection task . . . 32

4.1.1 Algorithm . . . 33 4.1.2 Architecture . . . 34 4.2 Experimental evaluation . . . 35 4.2.1 Speech corpus . . . 36 4.2.2 Results . . . 38 4.3 Discussion . . . 40

5 Exploiting IR for improved search functionality 41 5.1 Extended search functionality . . . 41

5.1.1 Methodology . . . 42

5.1.1.1 Data description . . . 42

5.1.1.2 Metadata generation . . . 42

5.1.1.3 Indexes generation and management . . . 44

5.2 Experimental evaluation . . . 45

5.3 Discussion . . . 47

6 New automatic evocation exercises for therapy treatment 49 6.1 Automatic animal naming recognition task . . . 49

6.1.1 Keyword spotting . . . 50

6.1.2 Keyword model generation . . . 50

6.1.3 Background penalty for keyword spotting tuning . . . 51

(13)

6.2.1 Speech corpus . . . 52

6.2.2 Results . . . 52

6.3 Discussion . . . 56

7 Exploiting syllable information in word naming recognition of aphasic speech 57 7.1 Syllabification task . . . 57 7.1.1 Methodology . . . 57 7.2 Experimental evaluation . . . 58 7.2.1 Speech corpus . . . 58 7.2.2 Results . . . 59 7.3 Discussion . . . 61 8 Conclusions 63 8.1 Achievements . . . 63 8.2 Future Work . . . 65 Bibliography 72 xiii

(14)
(15)

4.1 Baseline configuration established through exhaustive search. . . 39

4.2 Results obtained on the development test set with the baseline configuration. . . 39

4.3 Results obtained on the evaluation set with the baseline configuration. . . 40

5.1 Coverage of the additional metadata generated. . . 43

5.2 Precision and recall for each of the indexes generated. . . 45

5.3 Number of results returned by the system using the extended search feature and using a standard search functionality. . . 47

6.1 Speech corpus data, including gender, total number of words and the total number of valid words uttered. . . 52

6.2 Experiments data and resulting average WER, including file size information. . . 53

6.3 Experiments data and resulting average WER, including file size information. . . 54

6.4 Experiments data and resulting average WER. . . 55

6.5 Automatic and manual WER with the configuration 2 of the last set of experiments. . . . 56

7.1 Average WVR for the APS-I and APS-II corpus with different pronunciation models. . . 60

7.2 WVR for APS-I and APS-II data sets and average WVR, using automatically calibrated background penalty term. . . 61

(16)
(17)

2.1 Block diagram of AUDIMUS speech recognition system. . . 10

2.2 Comprehensive overview of the VITHEA system. . . 12

2.3 Screen-shots of the VITHEA patient application. . . 14

2.4 Interface for the creation of new stimulus. . . 15

2.5 Interface for the management of multimedia resources. . . 16

3.1 Architecture. . . 26

3.2 Screen-shots of the VITHEA mobile patient application. . . 28

3.3 Results of the evaluation. . . 29

3.4 Distribution of the user grades for the questions of the third group. . . 30

4.1 Architectural implementation of the VAD algorithm. . . 36

4.2 Process of generation of the speech corpus. . . 37

5.1 Structure of the objects of the VITHEA system that are of interest for the search functionality. 43 5.2 Results provided for the search query “seco” (dry) on the field answer of a Question. . . . 46

5.3 Results provided for the search query “alimento” (food) on the field answer of a Question. 46 5.4 Results provided for the search query <“harpia” (harpy), “animais” (animals)> on the fields title and category of a document. . . 47

6.1 First set of experiments using the keyword model with different values for the threshold. . 53

6.2 Second set of experiments using the keyword model filtered with Onto.PT and different values for the threshold. . . 54

6.3 Third set of experiments, including phonetic transcription correction and filled pause models. 55 7.1 Results for the APS-I comparing the two pronunciation lexicons, the standard and the augmented version provided with syllable boundaries. . . 59

7.2 Results for the APS-II comparing the two pronunciation models, the standard and the augmented version provided with syllable boundaries. . . 60

(18)
(19)

ANN Artificial Neural Network

ASR Automatic Speech Recognition

CSR Continuous Speech Recognition

CVA Cerebral Vascular Accident

JSON JavaScript Object Notation

IWR Isolated Word Recognition

LVCSR Large Vocabulary Continuous Speech Recognition

MLP Multilayer Perceptron

REST Representational State Transfer

RPC Remote Procedure Call

WFST Weighted Finite State Transducer

TTS Text-To-Speech

SOA Service-Oriented Architecture

SOAP Simple Object Access Protocol

URI Uniform Resource Identifiers

XML Extensible Markup Language

WVR word verification rate

WER Word Error Rate

VAD Voice Activity Detection

(20)
(21)

1

Aphasia is a particular type of communication disorder caused by the damage of one or more language areas of the brain affecting various speech and language functionalities. Cerebral vascular accidents are one of the most common causes. A frequent syndrome is the difficulty to recall names or words. Typically, such problems can be treated through word naming therapeutic exercises. In fact, frequency and intensity of speech therapy is a key factor in the recovery, thus motivating the development of automatic therapy methods that may be used remotely.

VITHEA (Virtual Therapist for Aphasia Treatment) is an on-line platform developed for the treatment of aphasic patients, incorporating recent advances of speech and language technologies to provide word naming exercises to individuals with lost or reduced word naming ability. The project started in June 2010 and saw the release of the first public prototype in July 2011. Since then, the system has continuously evolved with improvements both on the speech recognition techniques used and on the functionalities provided to patients and therapists. After three years of active development the project is now used daily by patients and speech therapists and has been awarded from both the speech and the health-care communities.

The success of the system motivated the research on additional features which could extend its functionalities and robustness. The new features will, indeed, consider a heterogeneous domain of enhancements that includes, among others, the evaluation of new approaches to improve the recognition quality and the recording process, the development of new interfaces to improve user experience, and the experimental implementation of new types of exercises.

The focus of the present work is, as the name states, to investigate on the feasibility of incorporating additional functionalities that may improve the VITHEA system. These features aim at both extend the usability of the system and strengthen its performance. The first will be achieved by providing a new version of the client application for mobile device, a hands-free interface for an easier recording experi-ence, an advanced search functionality for an improved management of platform data, and new type of exercises. For what concerns system performance, a new approach that considers the syllabic division of words will be studied and tested within the current speech recognition process.

The VITHEA system comprises two specific modules, dedicated respectively to the patients for car-rying out the therapy sessions and to the clinicians for the administration of the functionalities related to them.

(22)

services are more and more integrated in everyday’s life. In some cases, a smartphone may be cheaper than a computer, more practical, and even easier to use as it does not require the use of an external input device. Thus, the adaptation of the VITHEA platform to mobile devices has been considered of high importance for the diffusion of the system. However, this extension is currently limited by the recording module of the application. Here, an architecture compliant with the new requirements and a client version version for mobile devices have been designed and implemented in order to verify the performance and the level of appreciation of the user for the mobile version.

Related also with the client module, it is worth noticing that most of the times aphasia is the con-sequence of a Cerebral Vascular Accident (CVA) and, in those cases, affected patients may also expe-rience some sort of physical disability in arm mobility. In such situations, the support for a hands-free interface will notably improve the usability of the system. However, the typical extension of these inter-faces for human-computer interaction consists of voice commands as alternative input modality. In the particular case of the VITHEA project, being the user of the system affected from a language disor-der, hands-free computing could not be interpreted as an alternative way of interaction, instead it will be selectively applied to automate the process of recording the users answers, and thus, provide addi-tional benefits to people experiencing disabilities. Determining when the recording process should start could be efficiently detected automatically, by considering as reference the end of the description of the stimulus spoken by the virtual therapist or the end of subsequent reproduction of the audio/video file in the case of a multimedia stimulus. The detection of the end of the speech is a more challenging is-sue. Common solutions relies on Voice Activity Detection (VAD) approaches, which automatically try to determine the presence or absence of voice relying on some features of the input signal. Depending on the implementation, the features used may vary. In this work, the energy of the speech has been used as a baseline to develop an algorithm that implements the automatic detection of the end of the speech. On the other hand, the objective of the clinician module is to allow the management of patient data as well as of the collection of exercises and resources associated to them. During the last years sev-eral improvements were introduced to allow the incorporation of new exercises and the creation and the management of groups of speech-language therapists and patients. However, many important function-alities that affect the overall usability of the clinician module were still missing. The management of the exercises data, which now exceeds one thousand of stimuli, only provide a listing functionality, missing the option to search for a given stimuli. Considering the amount of data stored in the system, the lack of a search feature strongly affects the daily usage of the module. Besides, it should be noted that the data constituting the exercises and the stimuli is somehow peculiar in its format. In fact, most of the time, this is represented by a single keyword (i.e.: the title of a document). This means that if the therapist does not remember the exact term he/she is looking for, the search will probably fail. For these reasons, it is important that the search functionality keeps into account these constraints and thus provides extended search capabilities. Techniques, such as Query Expansion, from the area of Information Retrieval will be exploited to achieve this purpose.

For what concerns the therapeutic exercises, there are several naming tasks for assessing the pa-tient’s ability to provide the verbal label of objects, actions, events, attributes, and relationship. There

(23)

are different types of naming tasks, such as category naming, confrontation naming, automatic clo-sure naming, automatic serial naming, recognition naming, repetition naming, and responsive nam-ing [Campbell 05, Murray 01]. Currently, the VITHEA system supports exercises based on visual con-frontation, automatic closure naming, and responsive naming. The integration of automatic serial naming or semantic category naming exercises would be of valuable help for patients in recovering from aphasia. Finally, during preliminary experiments evaluating the performance of the word naming recog-nition task within the VITHEA system, an analysis of word detections errors have been per-formed [Abad 13]. From these results emerged that some characteristics of aphasic patients that sometimes causes keywords to be missed are both pauses in between syllables or mispronounced phonemes. Recordings have confirmed that some patients have the tendency to speak with a slow rhythm, almost as if they were dividing the word into syllables. This phenomenon, even more stressed, was also directly observed in different sessions of experimentation of the system, either performed from a patient or a healthy subject. In these contexts, when the system failed to recognize the user answer, the user then typically starts at syllabify the word. These reasons have motivated the idea of investi-gating on the integrating of an external speech tool that would perform the syllabification of words. The syllabified version will constitute an augmented grammar for the recognizer that will hopefully improve its performance.

All the objectives that were identified in the thesis proposal [Pompili 13] were implemented, with the exception of the ”awareness and profiling” functionality. This feature has been substituted with the pro-viding of an advanced search capability that has been integrated into the clinician module. In fact, during the evolution of the thesis work, this feature appeared more interesting and useful for the improvement of the project at the point of justifying the introduction of this amendment. Thus the main goals of this works now are:

• content adaptation for mobile devices, • hands-free speech interface,

• advanced search capability, • new naming exercises, • syllabification tool.

The present thesis consists of 8 chapters, structured as follows:

• Chapter 2 starts by reporting on background concepts and on the state of the art of on-line plat-forms dedicated to speech disorders, describing with further details the VITHEA system. Then, it

(24)

focuses on the specification of the new features that represents the target of this work, reporting the current state of the art, where applicable.

• Chapter 3 reports on the architecture, the design, and the security pattern that have been followed to develop a new version of the system supported by mobile devices. The constraints that have guided the ultimate prototype and the choices that have been taken, are here justified, motivated and explained. Then, the Chapter concludes with the description of the results of a user experience evaluation.

• Chapter 4 describes in detail both the options chosen for the implementation of a VAD approach carried out at the same time of the recording process and the architectural updates that are in-volved for its integration within the VITHEA system. The Chapter ends with the evaluation of the algorithm through automated tests carried out with the recordings of daily users of the system. • In Chapter 5 the focus is on improving the management of the data of the system, by providing

an advanced search feature. This is achieved through the generation of metadata information provided by ontological resources. These data are then exploited from a query expansion process and a full-text search engine for providing an extended set of results. Precision and recall measures for a given test set of queries are reported at the end of the Chapter.

• Chapter 6 explains the concept of evocation exercises and how a specific subclass, the animal naming, has been implemented through an iterative process of enhancements. The construction of the baseline list of admissible animals, constituting a key component for the recognition process, is detailed together with the automated evaluation carried out through the collection of a speech corpus.

• Chapter 7 introduces the issues that surround the task of syllabic division of words and describes how an external software that provides the orthographic syllabification, has been adapted to the architecture of the speech recognition engine AUDIMUS. Then, the results of the automated test, carried out with the corpus of aphasic patients collected during the VITHEA project, is described. • Finally, Chapter 8 presents the conclusions and future work.

(25)

2

This chapter aims at providing both important background knowledge that will be referred in the rest of this document and the relevant state of the art for the new targeted features. It is divided into 4 main Sections, first a short background on aphasia and common therapeutic approaches are described (Section 2.1), then an overview of an Automatic Speech Recognition (ASR) system is provided in Sec-tion 2.2, focusing on AUDIMUS, the in-house speech recogniSec-tion engine used. SecSec-tion 2.3 describes current known platforms providing on-line tools for voice disorders, with a deep focus on the VITHEA system. Finally, Section 2.4 is devoted at describing the state of the art that is relevant for each of the new features object of this work.

Aphasia is a speech disorder which comprises difficulties in both production and comprehension of spo-ken or written language. It is caused by damage to one or more of the language areas of the brain, typically it occurs after brain injuries. There are several causes of brain injuries affecting communication skills, such as brain tumours, brain infections, severe head injuries, and most commonly, cerebral vas-cular accidents (CVA). Among the effects of aphasia, the difficulty to recall words or names is the most common disorder presented by aphasic individuals. In fact, it has been reported in some cases as the only residual deficit after rehabilitation [Wilshire 00]. Several studies about aphasia have demonstrated the positive effect of speech-language therapy activities for the improvement of social communication abilities [Basso 92]. Moreover, it has been shown that the intensity of therapy positively influences speech and language recovery in aphasic patients [Bhogal 03].

2.1.1

Aphasia symptoms classification

We can classify various aphasia syndromes by characterizing the speech output in two broad categories: fluent and non-fluent aphasia [Goodglass 93]. Fluent aphasia has normal articulation and rhythm of speech, but is deficient in meaning. Typically, there are word-finding problems that most affect nouns and picturable action words. Non-fluent aphasic speech is slow and laboured with short utterance length. The flow of speech is more or less impaired at the levels of speech initiation, the finding and sequenc-ing of articulatory movements, and the production of grammatical sequences. Following the above classification, we list the major types of aphasia and their properties:

1. Fluent

(26)

common syndromes in fluent aphasia. People with Wernicke’s aphasia may speak in long sentences that have no meaning, adding unnecessary or made-up words. Individuals with Wernicke’s aphasia usually have great difficulty understanding the speech of both themselves and others and are therefore often unaware of their mistakes.

(b) Transcortical aphasia presents similar deficits as in Wernicke’s aphasia, but repetition ability remains intact.

(c) Conduction aphasia is caused by deficits in the connections between the speech-comprehension and speech-production areas. Auditory speech-comprehension is near normal, and oral expression is fluent with occasional paraphasic errors. Repetition ability is poor.

(d) Anomic aphasia with anomic aphasia the individual may have difficulties naming certain words, linked by their grammatical type (e.g. difficulty naming verbs and not nouns) or by their semantic category (e.g. difficulty naming words relating to photography but nothing else) or a more general naming difficulty.

2. Non-fluent

(a) Broca’s aphasia is caused by damage to the frontal lobe of the brain. People with Broca’s aphasia may speak in short phrases that make sense but are produced with great effort. People with Broca’s aphasia typically understand the speech of others fairly well. Because of this, they are often aware of their difficulties and can become easily frustrated.

(b) Global aphasia presents severe communication difficulties, individuals with global aphasia will be extremely limited in their ability to speak or comprehend language. They may be totally non-verbal, and/or only use facial expressions and gestures to communicate.

(c) Transcortical Motor aphasia presents similar deficits as Broca’s aphasia, except repetition ability remains intact. Auditory comprehension is generally fine for simple conversations, but declines rapidly for more complex conversations.

2.1.2

Aphasia treatment

In some cases, a person will completely recover from aphasia without treatment. This type of sponta-neous recovery usually occurs following a type of stroke in which blood flow to the brain is temporarily interrupted, but quickly restored, called a transient ischemic attack. In these circumstances, language abilities may return in a few hours or a few days. For most cases, however, language recovery is not as quick or as complete. While many people with aphasia experience partial spontaneous recovery, in which some language abilities return a few days to a month after the brain injury, some residual disor-ders typically remain. In these instances, most clinicians would recommend speech-language therapy. The recovery process usually continues over a two-year period, although clinicians believe that the most effective treatment begins early in the recovery process.

There are multiple modalities of speech therapy [Albert 98]. The most commonly used techniques are focused on improving expressive output, such as the stimulation-response method and the

(27)

Melod-ical Intonation Therapy (MIT). MIT is a formal, hierarchMelod-ically structured treatment program based on the assumption that the stress, intonation, and melodic patterns of language output are controlled pri-marily by the right hemisphere and, thus, are available for use in the individual with aphasia with left hemisphere damage [Albert 94]. Other methods are linguistic-oriented learning approaches, such as the lexical-semantic therapy or the mapping technique for the treatment of agrammatism. Still, other techniques such as Promoting Aphasics’ Communicative Effectiveness (PACE), focus on enhancing communicative ability, non-verbal as well as verbal, in pragmatically realistic settings [Davis 85]. Several non-verbal methods for the treatment of severe global aphasics rely on computer-aided therapy such as the visual analogue communication, iconic communication, visual action and drawing therapies are currently used [Sarno 81]. An example is Computerized visual communication (or C-VIC) designed as an alternative communication system for patients with severe aphasia and is based on the notion that people with severe aphasia can learn an alternative symbol system and can use this alternative system to communicate [Weinrich 91].

Furthermore, although there exists such an extended list of treatments specifically thought to recover from a different disorder caused by aphasia, one class of treatment especially important is the one devoted to help improving word retrieval problems, since as noticed, it is one of the most common residual disorder in all aphasia syndromes. Naming abilities problems are typically treated with semantic exercises like Naming Objects or Naming common actions where commonly the patient is asked to name a subject represented in a picture [Adlam 06].

Speech recognition is the translation, operated by a machine, of spoken words into text. It is a difficult task, whose automation involves many areas of computer science, from signal processing, to statistical frameworks and machine learning techniques. In the following, in order to describe the components of the ASR module that are of relevance for the project, a brief introduction to the speech recognition topics is provided.

2.2.1

Brief introduction to automatic speech recognition

Speech recognition systems do not actually perform the recognition or decoding step directly on the speech signal. Rather, the speech waveform is divided into short frames of samples, which are con-verted to a meaningful set of features. The duration of the frames is selected so that the speech wave-form can be regarded as being stationary. In addition to this transwave-formation, some pre-processing tech-niques are applied to the waveform signal in order to enhance it and to better prepare it for the speech recognition.

In the feature extraction step, the sampled speech signal is parametrized. The goal is to extract a number of parameters (‘features’) from each frame of the signal containing the relevant speech in-formation and being robust to acoustic variations and sensitive to linguistic context. More in detail, features should be robust against noise and factors that are irrelevant for the recognition process, also

(28)

features that are discriminant and allow to distinguish between different linguistic units (e.g., phones) are required.

Then, the next stage in the recognition process is to do a mapping of the speech vectors found at the previous step and the wanted underlying sequence of acoustic classes modelling concrete symbols (phonemes, letters, words...). Acoustic modelling is arguably the central part of any speech recognition system, it plays a critical role in improving ASR performance. The practical challenge is how to build ac-curate acoustic models, that can truly reflect the spoken language to be recognized. Typically, sub-word models like phonemes, diphones or triphones, are more often used as the unit of acoustic model with respect to word model. An extended and successful statistical parametric approach to speech recogni-tion is the Hidden Markov Model (HMM) paradigm [Rabiner 89, Rabiner 93] that supports both acoustic and temporal modelling. HMMs model the sequence of feature vectors as a piecewise stationary pro-cess. An utterance X = x1, . . . , xn, . . . , xN is modelled as a succession of discrete stationary states Q = q1, . . . , qk, . . . , qK, K < N , with instantaneous transitions between these states. An HMM is typi-cally defined as a stochastic finite state automaton, usually with a left-to-right topology. It is called ”hid-den” Markov model because the underlying stochastic process (the sequence of states) is not directly observable, but still affects the observed sequence of acoustic features. Alternatively, Artificial Neu-ral Network (ANN) have been proposed as an efficient approach to acoustic modelling [Tebelskis 95]. Although for the past thirty years ANNs have been used for difficult problems in pattern recognition, more recently many researchers have shown that these nets can be used to estimate probabilities that are useful for speech recognition. Multilayer Perceptron (MLP)s, are the most common ANN used for speech recognition. Typically, MLPs have a layered feedforward architecture with an input layer, zero or more hidden layers, and one output layer. ANN-HMM hybrid systems have been focus of research in order to combine the strengths of the two approaches [Morgan 95]. Systems based on this connectionist approach have performed very well on Large Vocabulary Continuous Speech Recognition (LVCSR).

Knowledge of the rules of a language, the way in which words are connected together into phrases, is expressed by the language model. It is an important building block in the recognition process, it is used to guide the search for an interpretation of the acoustic input. There are two types of models that describe a language: grammar-based and statistical-based language models. When the range of sentences to be recognized is very small, it can be captured by a deterministic grammar that describes the set of allowed phrases. In large vocabulary applications, on the other hand, it is too difficult to write a grammar with sufficient coverage of the language, therefore a stochastic grammar, typically an n-gram model is often used. An n-gram grammar is a representation of an n-th order Markov language model in which the probability of occurrence of a symbol is conditioned upon the prior occurrence of n−1 other symbols. When word models are used, the word model is then obtained by concatenating the sub-word models according to the pronunciation transcription of the sub-words in a dictionary or lexical model. Its purpose is to map the orthography of the words in the search vocabulary to the units that model the actual acoustic realization of the vocabulary entries. Lexicon generation may rely on manual dictionaries or on automatic grapheme-to-phoneme modules, both rule-based or data-driven learned approaches (or hybrid).

(29)

The last step in the recognition process is the decoding phase, whose objective is to find a sequence of words whose corresponding acoustic and language models best match the input signal. Therefore, such a decoding process with trained acoustic and language models is often referred as a search pro-cess. Its complexity varies according to the recognition strategy and to the size of the vocabulary. With Isolated Word Recognition (IWR) word boundaries are known, the word with highest forward probability is chosen as the recognized word and the search problem becomes a simple pattern recognition prob-lem. Search in Continuous Speech Recognition (CSR), on the other side, is more complicated since the search algorithm has to consider the possibility of each word starting at any arbitrary time frame. Also, for small vocabulary tasks, it is possible to expand the whole search network defined by the language and lexical restrictions to directly apply conventional time-synchronous Viterbi search. However, in LVCSR systems different strategies should be addressed. These, span from graph compaction techniques, on-the-fly expansion of the search space [Ortmanns 00] and heuristic methods.

2.2.2

AUDIMUS speech recognizer

AUDIMUS is the ASR system developed by the Spoken Language Processing Lab of INESC-ID (L2F) group and integrated into the VITHEA system. It is the result of several years of dedicated research efforts to the development of ASR systems. AUDIMUS is a hybrid recognizer that follows the above mentioned connectionist approach [Morgan 95]. It combines the temporal modelling capacity of HMMs with the pattern discriminative classification of MLP. A Markov process is used to model the basic temporal nature of the speech signal, while an ANN is used to estimate posterior phone probabilities given the acoustic data at each frame. As shown in Fig. 2.1, the baseline system combines three MLP outputs trained with different feature sets: Perceptual Linear Predective (PLP, 13 static + first deriva-tive) [Hermansky 90], log-RelAtive SpecTrAl (RASTA, 13 static + first derivaderiva-tive) [Hermansky 92], and Modulation SpectroGram (MSG, 28 static) [Kingsbury 98]. This merged approach has proved being more efficient and robust with respect to using one of the feature individually [Meinedo 00]. This is ex-plained by the integration of the advantages of these three feature sets: the inclusion of the attributes of the psychological processes of human hearing into the analysis used with PLP [Jamaati 08] makes the speech perception more human-like, the compensation for linear channel distortions provided by RASTA, the improved performance in terms of stability provided by MSG in the presence of acoustic interferences, like high levels of background noise and reverberation [Koller 10]. The AUDIMUS decoder is based on a Weighted Finite State Transducer (WFST) approach to large vocabulary speech recogni-tion [Mohri 02, Caseiro 06]. AUDIMUS integrates a rule-based grapheme-to-phone conversion module based on WFSTs for European Portuguese [Caseiro 02]. The acoustic model integrated in VITHEA was trained with 57 hours of downsampled Broadcast News data and 58 hours of mixed fixed-telephone and mobile-telephone data in European Portuguese [Abad 08].

(30)

Figure 2.1: Block diagram of AUDIMUS speech recognition system.

2.2.3

Automatic word verification

The task that performs the evaluation of the utterances spoken by the patients, in a similar way to the role of the therapist in a rehabilitation’s session, is referred as word verification. This task consists of deciding whether a claimed word W is uttered in a given speech segment S or not. In the simplest case, a true/false answer is provided, but a verification score might be also generated. It should be noted that the task has been called word verification, although it actually refers to term verification, since a keyword may in fact consist of more than one word (e.g. rocking chair).

2.2.3.0.1 Word verification based on keyword spotting Several approaches exist based on speech recognition technology to tackle the word verification problem. Given that word W is known, forced alignment with an ASR system could be one of the most straightforward possibilities. However, speech from aphasic patients contains a considerable amount of hesitations, doubts, repetitions, descriptions and other speech disturbing factors that are known to degrade ASR performance, and consequently, this will further affect the alignment process. These issues led to consider the forced alignment approach inconvenient for the word verification task. Alternatively, keyword spotting methods can better deal with unexpected speech effects. The object of keyword spotting is to detect a certain set of words of interest in the continuous audio stream. In fact, word verification can be considered a particular case of keyword spotting (with a single search term) and similar approaches can be used.

Keyword spotting approaches can be broadly classified into two categories [Sz ¨oke 05]: based on LVCSR or based on acoustic matching of speech with keyword models in contrast to a background model. Methods based on LVCSR search for the target keywords in the recognition results, usually in lattices, confusion networks or n-best hypothesis results since they allow improved performances compared to searching in the 1-best raw output result. The training process of an LVCSR system requires large amounts of audio and text data, which may be a limitation in some cases. Additionally, LVCSR systems make use of fixed large vocabularies (>100K words), but when a specific keyword is not included in the dictionary, it is never detected. Acoustic approaches are very closely related to IWR. They basically extend the IWR framework by incorporating an alternative competing model to the list of keywords generally known as background, garbage or filler speech model. A robust background speech model must be able to provide low recognition likelihoods for the keywords and high likelihoods for

(31)

out-of-vocabulary words in order to minimize false alarms and false rejections when CSR is performed. Like in the IWR framework, keyword models can be word-based or phonetic-based (or sub-phonetic). The latter allows simple modification of the target keywords since they are described by their sequence of phonetic units.

In order to choose the best approach for this task, preliminary experiments were conducted on a tele-phone speech corpus considering both LVCSR and acoustic matching approach [Abad 13]. According to the results obtained, it was considered that acoustic based approaches were more adequate for the type of problem addressed in the on-line therapy system.

2.2.3.0.2 Keyword spotting with AUDIMUS To accomplish the technique described in the previous section and a successful integration into the VITHEA system, the baseline ASR system was modified to incorporate a competing background speech model that is estimated without the need for acoustic model re-training.

While keyword models are described by their sequence of phonetic units provided by an automatic grapheme-to-phoneme module, the problem of background speech modelling must be specifically ad-dressed. The most common approach consists of building a new phoneme classification network that in addition to the conventional phoneme set, also models the posterior probability of a background speech unit representing “general speech”. This is usually done by using all the training speech as positive examples for background modelling and requires re-training the acoustic networks. Alternatively, the posterior probability of the background unit can be estimated based on the posterior probabilities of the other phones [Pinto 07]. The second approach has been followed, estimating the likelihoods of a background speech unit as the mean of the top-6 most likely outputs of the phonetic network at each time frame.In this way, there is no need for acoustic network re-training. The minimum duration for the background speech word is fixed to 250 msec.

Up to our knowledge, there are only a few of therapeutic tools that support automatic evaluation through speech recognition. Two of the most outstanding are PEAKS (Program for Evaluation and Analysis of all Kinds of Speech disorders) and VITHEA (Virtual Therapist for Aphasia Treatment). PEAKS [Maier 09] is an on-line recording and analysis environment for the automatic or manual evaluation of voice and speech disorders. Once connected to the system, a patient may perform a standardized test which is then analysed by automatic speech recognition and prosodic analysis. The result is presented to the user, and can be compared to previous recordings of the same patient or to recordings from other patients.

VITHEA [Abad 13] is an on-line platform designed to act as a “virtual therapist” for the treatment of Portuguese speaking aphasic patients. The system allows word naming exercise, wherein the patient is asked to recall the content presented in a photo or picture shown. By means of the use of automatic speech recognition, the system processes what is said by the patient and decides if it is correct or wrong. The program provides feedback both as a written solution and as a spoken message produced by an

(32)

Database system AUDIMUS system Web Application server Client

Figure 2.2: Comprehensive overview of the VITHEA system.

animated agent using text-to-speech synthesis.

The VITHEA system, target of this work, will be described deeply in the following Sections.

2.3.1

VITHEA: An on-line system for virtual treatment of aphasia

The on-line system described in [Pompili 11] is the first prototype for aphasia treatment resulting from the collaboration of the Spoken Language Processing Lab of INESC-ID (L2F) and the Language Research Laboratory of the Lisbon Faculty of Medicine (LEL), which has been developed in the context of the activities of the Portuguese national project VITHEA1. It consists of a web-based platform that permits speech-language therapists to easily create therapy exercises that can be later accessed by aphasia patients using a web-browser. During the training sessions, the role of the therapist is taken by a “virtual therapist”that presents the exercises and that is able to validate the patients’ answers. The overall flow of the system can be described as follows: when a therapy session starts, the virtual therapist shows to the patient, one at a time, a series of visual or auditory stimuli. The patient is then required to respond verbally to these stimuli by naming the contents of the object or action that is represented. The utterance produced is recorded, encoded and sent via network to the server side. Here, a web application server receives the audio file and processes it by an ASR module, which generates a textual representation. This result is then compared with a set of predetermined textual answers (for the given question) in order to verify the correctness of the patient’s input. Finally, feedback is sent back to the patient. Figure 2.2 shows a comprehensive view of this process. In practice, the platform is intended not only to serve as an alternative, but most importantly, as a complement to conventional speech-language therapy sessions, permitting intensive and inexpensive therapy to patients, besides providing to the therapists a tool to assess and track the evolution of their patients.

The various approaches for aphasia rehabilitation introduced in Section 2.1.2 aim at different pur-poses. Most of them are focused on restoring language abilities, others are intended to compensate for language problems and learn other methods of communicating. The approach followed by the VITHEA

(33)

system falls in the first category, aiming at restoring linguistic processing by means of linguistic ex-ercises. In particular, the focus of the system is on the recovery of word naming ability for aphasic patients. Exercises are designed for Portuguese speaking aphasia patients.

2.3.1.1 The patient and the clinician applications

The system comprises two specific modules, dedicated respectively to the patients for carrying out the therapy sessions and to the clinicians for the administration of the functionalities related to them. The two modules adhere to different requirements that have been defined for the particular class of user for which they have been developed. Nonetheless they share the set of training exercises, that are built by the clinicians and performed by the patients.

2.3.1.1.1 Patient application module The patient module is meant to be used by aphasic individuals to perform the therapeutic exercises. Figure 3.2 illustrates some screen-shots of the Patient Module. Exercise protocol Following the common therapeutic approach for treatment of word finding

difficul-ties, a training exercise is composed of several semantic stimuli items. Stimuli may be of several different types (text, audio, image and video) and they are classified according to themes, in order to immerse the individual in a pragmatic, familiar environment. Like in ordinary speech-language therapy sessions, once the patient is logged into the system, the virtual therapist guides him/her in carrying out the training sessions, providing a list of possible exercises to be performed. When the patient chooses to start a training exercise, the system presents target stimuli one at a time in a random way and he/she is asked to respond to each stimulus verbally. After the evaluation of the patient’s answer by the system, the patient can listen again to his/her previous answer, record an utterance in case of invalid answer or skip to the next exercise.

Exercise interface The exercise interface has been designed to cope with the functionalities needed for automatic word recalling therapy exercises, which includes among others the integration of an animated virtual character (the virtual therapist), Text-To-Speech (TTS) synthesized voice, im-age and video displaying, speech recording and play-back functionalities, automatic word naming recognition and exercise validation and feed-back prompting, besides conventional exercise navi-gation options. Additionally, the exercise interface has also been designed to maximize simplicity and accessibility. First, because most of the users for whom this application is intended suffered a CVA and they may also have some sort of physical disability. Second, because aphasia is a pre-dominant disorder among elderly people, which are more prone to suffer from visual impairments. Thus, the graphic elements chosen, were carefully considered, using big icons for representing the interface.

2.3.1.1.2 Virtual character animation and speech synthesis The virtual therapist’s representation to the user is achieved through a tri-dimensional (3D) game environment with speech synthesis capa-bilities. Within the context of the VITHEA application, the game environment is essentially dedicated to graphical computations, which are performed locally in the user’s computer. Speech synthesis genera-tion occurs in a remote server, thus ensuring proper hardware performance. The game environment is

(34)

Figure 2.3: Screen-shots of the VITHEA patient application.

based on the Unity2game engine, it contains a low poly 3D model of a cartoon character with visemes and facial emotions, which receives and forwards text (dynamically generated according to the sys-tem’s flow) to the TTS server. Upon server reply, the character’s lips are synchronized with synthesized speech.

2.3.1.1.3 Speech synthesis DIXI [Paulo 08] is the TTS engine developed by the Spoken Language Processing Lab of INESC-ID (L2F) group and integrated into the game environment. It has been con-figured for unit selection synthesis with an open domain cluster voice for European Portuguese. DIXI is used to gather SAMPA phonemes [Trancoso 03], their timings and raw audio signal information, which is lossy encoded for usage in the client game. The phoneme timings are essential for a visual output of the synthesized speech, since the difference between consecutive phoneme timings determines the amount of time a viseme should be animated.

2.3.1.1.4 Clinician application module The clinician module is specifically designed to allow clini-cians to manage patient data, to regulate the creation of new stimuli and the alteration of the existing ones, and to monitor user performance in terms of frequency of access to the system and user progress. The module is composed of three sub-modules:

User management This module allows the management of a knowledge base of patients that can be edited by the therapist at any time. Besides basic information related to the user personal profile, 2http://unity3d.com/

(35)

the database also stores for each individual his/her type of aphasia, his/her aphasia severity (7-level subjective scale) and aphasia quotient (AQ) information from the Western Aphasia Battery. Exercise editor This module allows the clinician to create, update, preview and delete stimuli from an

exercise in an intuitive fashion similar in style to a WYSIWYG editor. In addition to the canonical valid answer, the system accepts for each stimulus an extended word list comprising the most frequent synonyms and diminutives.

Since the stimuli are associated with a wide assortment of multimedia files, besides their manage-ment, the module also provides a rich Web based interface to manage the database of multimedia resources used within the stimuli. The system is capable of handling a wide range of multimedia encoding: audio (accepted file types: wav, mp3), video (accepted file types: wmv, avi, mov, mp4, mpe, mpeg, mpg, swf), and images (accepted file types: jpe, jpeg, jpg, png, gif, bmp, tif, tiff). Given the diversity of the various file types accepted by the system, a conversion to a unique file type was needed, in order to show them all with only one external tool. Audio files are therefore converted to mp3, file format, while video files are converted to flv file format. Figures 2.4 and 2.5 illustrates some screen-shots of the clinician module.

Figure 2.4: Interface for the creation of new stimulus.

Patient tracking This module allows the clinician to monitor statistical information related to user-system interactions and to access the utterances produced by the patient during the therapeutic

(36)

Figure 2.5: Interface for the management of multimedia resources.

sessions. The statistical information comprises data related to the user’s progress and to the fre-quency with which users access the system. On the one hand, all the attempts recorded by the patients are stored in order to allow a re-evaluation by clinicians. This data can be used to identify possible weaknesses or errors from the recognition engine. On the other hand, monitoring the usage of the application by the patients will permit the speech-language therapist to assess the effectiveness of the platform and its impact on the patients’ recovery progress.

2.3.1.2 Platform architecture overview

An ad-hoc multi-tier framework that adheres to the VITHEA requirements has been developed by inte-grating different heterogeneous technologies. The back-end of the system relies on some of the most advanced open source frameworks for the development of web applications: Apache Tiles, Apache Struts 2, Hibernate and Spring. These frameworks follow the best practice and principles of software engineering, thus guaranteeing the reliability of the system on critical tasks such as databases access, security, session management etc. The back-end side also integrates the L2F speech recognition system (AUDIMUS, [Meinedo 03, Meinedo 10]) and TTS synthesizer (DIXI, [Paulo 08]). The ASR component is the backbone of the system and it is responsible for the validation or rejection of the answers pro-vided by the user. TTS and facial animation technologies allow the virtual therapist to “speak”the text associated with a stimulus and supply positive reinforcement to the user. The client side also exploits Adobe RFlash R technology to support rich multimedia interaction, which includes audio and video

stim-uli reproduction and recording and play-back of patients’ answers. Finally, the system implements a data architecture that allows handling groups of speech-language therapists and groups of patients. Thus, a user may belong to a specific group of patients and this group can be assigned to a therapist or to a group of therapists. Therapists who belong to the same group share the clinical information of the

(37)

patients, the set of therapeutics exercises, and also the set of resources used within the various stimuli. In this way patients with the same type and/or degree of severity of aphasia can be clustered together and take advantage of exercises and stimuli that are tailored to their specific disorder, thus improving the benefits resulting from a therapeutic training session.

This Section is devoted at providing the relevant state of the art for each of the new features, target of this work.

2.4.1

Content adaptation for mobile devices

To make available the VITHEA services also from mobile equipments, new client applications that adhere with the specific device standard’s have to be designed and built. This means that two separate software applications have to be built for Android and iOS based equipments. The target of this work, only address the Android platform. On the other hand, the server side services already provided by the system, should preserve their original business logic, thus affecting only the exposition of the services. These constraints lead toward the direction of a Service-Oriented Architecture (SOA). SOA is a set of principles and methodologies for designing and developing software in the form of interoperable services. Here, services are well-defined business functionalities that are built as software components that can be reused for different purposes. Web services are the typical usage scenario for implementing a SOA architecture, they allow the functional building-blocks being accessible over standard Internet protocols in a independent way of platforms and programming languages. In this scenario, the most widely used technologies that can implement a SOA architecture rely on Simple Object Access Protocol (SOAP), Remote Procedure Call (RPC), or on Representational State Transfer (REST) approaches.

SOAP is a message transport protocol for exchanging structured information in the implementation of web services in computer networks, it has been accepted as the default message protocol in SOA. SOAP messages are created by wrapping application specific XML messages within a standard XML-based envelope structure. The result is an extensible message structure which can be transported through most underlying network transports like SMTP and HTTP.

RPC is an inter-process communication allowing to call a procedure in another address space and exchange data by message passing. Methods stubs on the client process make the call making it appear as local, while the stubs take care of marshalling the request and sending it to the server process. The server process then unmarshalls the request and invokes the desired method before replying to the client with the reverse procedure.

REST is an architectural style for distributed hypermedia systems. It describes an architecture where each resource, such as a web service, is represented with an unique Uniform Resource Identifiers (URI). The principle of REST is to use the HTTP protocol as it is modelled, thus accessing and modifying the resources through the standardized HTTP functions GET, POST, PUT, and DELETE.

One of the main criticisms of SOAP relates to the way the SOAP messages are wrapped within 17

(38)

an envelope. Because of the verbose XML format, SOAP can be considerably slower than competing middle-ware technologies. A disadvantage of RPC is that the set of legal actions that are eligible on the server has to be explicitly defined at build time, since these actions are wrapped by the method stubs that are consumed by the client. In a REST scenario, on the other side, the client and the server are much less tied, the obligation within the two parts is minimal, in the case of HTTP’s implementation of REST this corresponds to a single URI that can be accessed through a GET request.

Thus, within a larger context, SOAP is the de-facto standard for web service message ex-change, however within a mobile context the REST architecture is considered as more light-weight [Richardson 07] than a SOAP based Web service architecture, since it avoids those heavy oper-ations which in the SOAP approach are needed in order to maintain a standard format [Knutsen 09].

2.4.2

Hands-free speech

One of the main challenges of the implementation of the hands-free interface is the determination of a robust VAD algorithm. VAD aims at determining the presence or absence of speech. This technique is useful both for speech coding and speech recognition, thus it has been object of many studies leading to several different approaches. In [Sangwan 02] the authors designed a customized algorithm for real-time speech transmission based on the energy of the input signal. This work relies on the estimation of an adaptive threshold meaningful of the background noise. Two refined strategies are defined to recover from misclassification errors that may result from the energy detector. The first of this strategies is based on a feature of the signal, the number of Zero Crossing Rate, while the second relies on the autocorrelation function.

In [Chuangsuwanich 11] the authors investigate on a VAD approach for real world application using a two-stage approach based on two distinguishing features of speech, namely harmonicity and modulation frequency.

In [Ramirez 04] the authors employs long-term signal processing and maximum spectral component tracking to improve VAD algorithm. With the introduction of a noise reduction stage before the long-term spectral tracking, the authors are able to recover from misclassification errors also in highly noisy environments. Experimental results appear to confirm the improvement with respect to VAD methods based on speech/pause discrimination.

2.4.3

Exploiting IR for improved search functionality

The intrinsic ambiguity of natural language is a well known problem in human understanding that affects also, with issues far greater, the computational processing of the data related with human computer interactions. Different issues influence different areas, among these, the possibility to express the same concept using different synonymy has a strong impact on the recall of most information retrieval systems. The methods for tackling this problem split into two major classes: global and local methods. The first includes techniques for expanding or reformulating the original query terms, so to cause the new query to match other semantically similar terms. These techniques, known as “query expansion”, may be based

(39)

on controlled vocabulary, manual or automatically derived thesaurus, and on log mining. Local methods, on the other side, try to adjust a query relative to the results that initially appear to match the query, the most used techniques in this context are known as “relevance feedback”.

In this work the potential of query expansion will be used to provide an enhanced search experience, which should allow a better management of the system data. To this purpose, some of the many ap-proaches that are referred in the literature to address this task, are briefly described, classifying them on the basis of the method used. Lexical resources like WordNet or UMLS metathesaurus, are commonly exploited for query expansion. In [Voorhees 94], lexical-semantic relations are used to improve search performance in large data collections. In [Aronson 97], the authors explore the MetaMap program for as-sociating meta-thesaurus concepts with the original query in order to retrieve MEDLINE citations. Many other approaches use corpus or lexical resources to automatically develop a thesaurus. However, most of these methods are used in domain specific search engines or applications. In [Gong 05], the authors used WordNet and TSN (Term Semantic Network) developed using word co-occurrence in corpus. Here, the author used TSN as a filter and supplement for WordNet. To conclude, with the increase in usage of Web search engines, it is easy to collect and use user query logs. [Cui 02] developed a system that extracts probabilistic correlations between query terms and documents terms using query logs.

2.4.4

New automatic evocation exercises for therapy treatment

There exist several naming exercises for the recovery of lost communication abilities. Among these we mention, category naming, confrontation naming, automatic closure naming, automatic serial naming, recognition naming, repetition naming, and responsive naming. Some of them, are already provided by the VITHEA system, namely visual confrontation, automatic closure naming, and responsive naming. Category naming is a task for assessing the ability to classify semantically related words and concepts in various word frequency categories which are perceptual, conceptual or semantic, and functional cat-egories. The perceptual categories is classified on the basis of the relevant sensory quality of a stimulus such as shape, size or colour. The conceptual or semantic category is classified on the basis of a gener-alized idea of a class of objects. The functional category is classified on the basis of an action of function associated with a class of objects [Campbell 05, Murray 01].

Automatic serial naming is a task for assessing the ability to produce rote or over learned material. A patient may be asked to do tasks such as counting from 1 to 20, naming the days of the week, writing out the letters of the alphabet, and or reciting well-known prayers or nursery rhymes [Campbell 05, Murray 01].

Recognition naming is a task for assessing the ability of recognizing words. It is used when patients are unable to name an item. The patient may be required to indicate the correct word from verbal or written choices. For example for the target stimulus “elephant” the patient has to indicate the correct word from three verbal or written choices such as “girraff”, “elephant”, “telephone”.

Repetition naming is a task for assessing ability in repetition or copying words of patients who cannot verbally name or write.

Currently, with the exception of the VITHEA system, does not appear to exist in the literature an 19

(40)

automatic implementation through speech recognition of the above mentioned exercises.

2.4.5

Exploiting syllable information in word naming recognition of aphasic

speech

Syllables play an important role in speech recognition, in fact the pronunciation of a given phoneme tends to vary depending on its location within a syllable. There is a lot of work in the literature on the genera-tion of new syllables prototypes deriving from several different acoustic-phonetic rules. These are often exploited to explore complementary acoustic model for speech processing. In [Hunt 04] the authors showed how, a statistical approach to phonetics could complement and improve current speech recog-nition by taking syllable boundaries into account. In [Oliveira 05] three methods for dividing European Portuguese word into syllables are presented. Experimental results have shown a percentage of cor-rectly recognized syllable boundaries above 99.5%, and comparable word accuracy. Also, in [Code 94] syllabification is examined in respect of non lexical English and German aphasic speech automatisms (recurring utterances).

(41)

3

The client side of the VITHEA platform exploits Adobe RFlash R technology to record patients’

an-swers. This module, unfortunately, limits the extension of the system to mobile devices, such as tablets and smart-phones. This is due to a limitation in the API provided by Adobe R. In fact, the microphone

class, used to acquire speech input, is not supported by the Flash R player running in a mobile browser.

Therefore, an ad-hoc application specifically suited for these devices has been designed and imple-mented. Even though this application theoretically clones the same implementation logic of the web ver-sion, the underlying technology is different and raised several integration issues due to the heterogeneity of the standard used. New reusable components have been developed in order to provide server-side services in a standardized way, accessible by heterogeneous client device running both iOS or Android operating systems.

Although services have been designed in a service oriented architecture (SOA) fashion that allows easy deployment of client modules for different systems, in this work we have restricted ourselves to the development of a client application running only on Android systems. Android has been chosen as a case study because it is available as open source software, enabling developers to distribute applications to any Android device trough the Android market.

In this Chapter, Section 3.1 introduces the main standards upon which the mobile version is based, while the architecture of the final solution is described in Section 3.2. The results of a user experi-ence evaluation conducted with 16 users are described in Section 3.3, followed by the discussion in Section 3.4.

In the literature review (Section 2.4.1) it has already been discussed the range of technologies that are available within the implementation of a SOA. The disadvantages of these standards have been analysed: the rigidity of RPC and the additional complexity of SOAP propose REST as the favourite candidate for the implementation of the new server-side services. In the following, the principles guiding the REST architectural style are initially described, then the data representation format for the exchange of information between client and server is explained. Finally the Android Platform, used to develop the client application, is briefly introduced.

3.1.1

Representational State Transfer

A software architecture is an abstraction of the runtime elements of a software system during some phase of its operation [Fielding 00]. Therefore, an architecture determines how system elements are

References

Related documents

Raster Functionality AutoCAD AutoCAD Architecture AutoCAD MEP AutoCAD Mechanical AutoCAD Electrical AutoCAD P&amp;ID AutoCAD Map 3D AutoCAD Civil 3D AutoCAD Civil

[r]

 Therefore, the pattern of exports and imports will be determined by the opportunity costs of production in each country—their comparative advantage....  Relative price of cloth

In conclusion, this study showed that in a patient group with mainly moderate and severe COPD experience a loss of peripheral muscle strength and endurance, exercise capacity

Given the deeply invasive endocervical GAS and the presence of focal areas of prominent nuclear atypia in fallopian tube or ovarian lesions, the benign- appearing

The post holder will liaise with a broad range of stakeholders, including the Chief Student Officer, Pro-Vice- Chancellors, Vice-Presidents, Faculty Deans,

Obviously, the maintaining of local electroneutrality is not only a strict rule in the ordered structure, in the disordered structure, the cation ions also tend to

Cases where changes over 0.5 D in astigmatic power and over 20 ° in the astigmatic axis were found, according to the distance between the tunnel and sideport incisions – separately