Psychiatric document retrieval using a discourse-aware model

(1)

Contents lists available atScienceDirect

Artiﬁcial Intelligence

www.elsevier.com/locate/artint

Psychiatric document retrieval using a discourse-aware model

Liang-Chih Yu

a

, Chung-Hsien Wu

b

,

∗

, Fong-Lin Jang

c

a_{Department of Information Management, Yuan-Ze University, Chung-Li, Taiwan, ROC}

b_{Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan, ROC} c_{Department of Psychiatry, Chi-Mei Medical Center, Tainan, Taiwan, ROC}

a r t i c l e i n f o a b s t r a c t

Article history: Received 19 June 2008

Received in revised form 22 December 2008 Accepted 23 December 2008

Available online 31 December 2008 Keywords:

Natural language processing Information retrieval Discourse structure Discourse-aware model Sequence kernel function Discounted cumulative gain

With the increased incidence of depression-related disorders, many psychiatric websites have been developed to provide huge amounts of educational documents along with rich self-help information. Psychiatric document retrieval aims to assist individuals to locate documents relevant to their depressive problems eﬃciently and effectively. By referring to relevant documents, individuals can understand how to alleviate their depression-related symptoms according to recommendations from health professionals. This work proposes the use of high-level discourse information extracted from queries and documents to improve the precision of retrieval results. The discourse information adopted herein includes negative life events, depressive symptoms and semantic relations between symptoms, which are beneﬁcial for better understanding of users’ queries. Experimental results show that the discourse-aware retrieval model achieves higher precision than the word-based retrieval models, namely the vector space model (VSM) and Okapi model, adopting word-level information alone.

1. Introduction

Individuals in their daily life may suffer from negative or stressful life events, such as death of a family member, argument with a spouse and loss of a job, along with depressive symptoms, such as suicidal tendencies and anxiety. Individuals under these circumstances often search for help from psychiatric websites by describing their mental health problems using message boards and other services. Health professionals respond with suggestions as soon as possible. However, the response time is generally several days, depending on both the processing time required by health profes-sionals and the number of problems to be processed. Such a long response time is unacceptable, especially for patients suffering from psychiatric emergencies such as suicide attempts. A potential solution considers the problems that have been processed and the corresponding suggestions, called consultation documents (problem-response pairs), as the psychi-atry web resource, examples of which include online discussion boards such as WebMD (http://www.webmd.com), SA-UK (http://www.social-anxiety-community.org/db), and the John Tung Foundation (http://www.jtf.org.tw), and email databases (http://www.psychpark.org) [4,21].

Psychiatric web resources typically contain thousands of documents along with rich self-help information, making them a useful information resource for psychological treatments such asbibliotherapy, which is a kind of cognitive behavior therapy that can help people deal with their mental health problems through reading self-help literature [5,27] or web resources [2,45]. The process of bibliotherapy generally involves three stages:identification,catharsis, andinsight; that is, readers first identify a character, situation or problem in literature (identification); then release pent-up emotions (catharsis) by

recogniz-*

Corresponding author.

E-mail address:[email protected](C.-H. Wu).

(2)

ing that others have similar problems to their own, and realize that their problems can also be addressed by learning from others’ experiences (insight). Similar to the notion ofbibliotherapy, individuals can read consultation documents relevant to their depressive problems and thereby recognize that they are not alone because many people have suffered from the same or similar problems. Then they can learn how to alleviate their symptoms according to the recommendations. However, browsing and searching all consultation documents to identify the relevant documents is time consuming and tends to be-come overwhelming. Individuals need to be able to retrieve the relevant consultation documents eﬃciently and effectively. Therefore, this work presents a novel mechanism to automatically retrieve the relevant consultation documents with respect to users’ problems.

Traditional information retrieval systems represent queries and documents using a bag-of-words approach. Retrieval models, such as the vector space model(VSM) [3,38,42] and Okapi model [16,30,32,33] are then adopted to estimate the relevance between queries and documents. The VSM represents each query and document as a vector of words, and adopts the cosine measure to estimate their relevance. The Okapi model, which has been used on the Text REtrieval Conference (TREC) collections, developed a family of word-weighting functions for relevance estimation. These functions consider word frequencies and document lengths for word weighting. Both the VSM and Okapi models estimate the relevance by matching the words in a query with the words in a document. Additionally, query words can further be expanded by the concept hierarchy within general-purpose ontologies such as WordNet [14] and EuroWordNet [34], or automatically constructed ontologies [29,40,52].

However, such word-based approaches only consider the word-level information in queries and documents, ignoring the high-level discourse information such as Main Event, Expectation, and Consequence in newspaper texts [23], Title, Thesis, and Main Idea in students essays [7], and Purpose, Methods, Results, and Conclusion in biomedical and scientific abstracts [18,36]. The discourse information considers the structure of texts, which can provide more precise information for many applications such as information retrieval [17,24,25], opinion analysis [39,41], and text summarization [44]. Liddy indicates that the discourse-level structure of documents, which can be identified by the Text Structurer module in DR-LINK [25], allows the retrieval process to assign a higher weight to those discourse units that satisfy users’ information needs [23,24]. Hearst developed the TextTilingalgorithm that can segment texts into subtopic passages so that the retrieval process can compare a query against only a passage to enable passage-based ranking and retrieval [17]. In opinion analysis, discourse features can be combined with lexical features for the detection of opinion types, i.e., sentiment and arguing [39]. Additionally, discourse contexts can be utilized for opinion topic resolution when multiple potential topics are mentioned within a single target span of an opinion [41]. In summarization of scientific articles, the discourse structure is useful for sentence and content selection, since it can provide the information about which sentences describe the authors’ main contribution or previous research [44]. In the biomedical domain, there is also a trend toward the analysis of discourse structure [11,22,35] to improve the performance of biomedical information retrieval [43] and information extraction [28].

For psychiatric documents, the discourse information describes what kinds of depressive symptoms and negative life events people are experiencing and the semantic relations between symptoms, which can help improve understanding of users’ queries. Consider the example consultation document in Fig. 1. A consultation document comprises two parts: the

(3)

query part and recommendation part. The query part is a natural language text, containing rich discourse information re-lated to users’ depressive problems. The discourse information herein includes negative life events, depressive symptoms, and semantic relations between symptoms. As indicated in Fig. 1, the subject suffered from a love-related event and sev-eral depressive symptoms. Moreover, there is a cause-effect relation holding between <Depressed> and <Suicide>, and a temporal relation holding between <Depressed> and <Insomnia>. Different discourse information may lead to different sug-gestions decided by experts. Therefore, an ideal retrieval system for consultation documents should consider such discourse information so as to improve the retrieval precision.

Natural language processing (NLP) techniques can be used to extract more precise information from natural language texts [49–51,53]. This work adopts the methodology presented in [51] to extract depressive symptoms and their relations, and adopts the pattern-based method presented in [53] to extract negative life events from both queries and consultation documents. This work also proposes a discourse-aware model to incorporate discourse knowledge into the retrieval system. The discourse-aware model calculates the similarity between a query and a document by combining the similarities of the extracted discourse information.

The main contributions of this paper are as follows. First, we present a retrieval model that considers the discourse struc-ture of texts for psychiatric document retrieval. The discourse information can improve understanding of users’ depressive problems through an in-depth semantic analysis of both queries and documents. Second, we provide a detailed analysis of the effect of each discourse unit on retrieval precision and eﬃciency.

The rest of this work is organized as follows. Section 2 brieﬂy describes the extraction of discourse information. Section 3 presents the discourse-aware retrieval model. Section 4 summarizes the experimental results. Conclusions are ﬁnally drawn in Section 5.

2. Framework of psychiatric document retrieval

Fig. 2 shows the framework of psychiatric document retrieval. The retrieval process begins with receiving a user’s query about his depressive problems in natural language. The example query is shown in Fig. 1. The three discourse units, i.e., events, symptoms and relations, are then extracted from the query, as shown in the center of Fig. 2. The extracted discourse units are represented by the sets of negative life events, depressive symptoms, and semantic relations. Each element in the event set and symptom set denotes an individual event and symptom, respectively, while each element in the relation set denotes a symptom chain to retain the order of symptoms. Similarly, the query parts of consultation documents are represented in the same manner. The discourse-aware model then calculates the similarity between the input query and the query part of each consultation document by combining the similarities of the sets of events, symptoms, and relations within them. Finally, a list of consultation documents ranked in the descending order of similarities is returned to the user. In the following, the extraction of discourse information is described briefly. The detailed process is described in [51] for symptom and relation identification, and in [53] for event identification.

1) Symptom identification:A total of 17 symptoms are defined based on theHamilton Depression Rating Scale(HDRS), the most prominent rating scale for assessing symptoms of depression [15]. The HDRS includes both psychological and somatic symptoms, as described in Table 1. Each item in HDRS corresponds to a category of symptoms, also defines a set of criteria to recognize them.

The identification of symptoms is sentence-based, which means that the symptom of a sentence is identified by assigning a label (item name) to each sentence. A label <Others> is also added to represent sentences with no defined symptoms. Fig. 3 presents the process of symptom identification based on an example sentence “I often worry about some minor matters.” The main steps are described as follows.

Fig. 2.Framework of psychiatric document retrieval. The rectangle denotes a negative life event related to love relation. Each circle denotes a symptom. D: Depressed,S: Suicide,I: Insomnia,A: Anxiety.

(4)

Table 1

17-item Hamilton Depression Rating Scale (HDRS).

No. Item Examples of the criteria

1 Depressed mood Feelings of sadness, hopeless, helpless, etc.

2 Feelings of guilt Self reproach, feels he has let people down.

3 Suicide Feels life is not worth living.

4 Insomnia-early Complains of diﬃculty falling asleep.

5 Insomnia-middle Complains of being restless and disturbed during the night.

6 Insomnia-late Waking in early hours of the morning.

7 Work and activities Loss of interest in activity; hobbies or work.

8 Retardation Decrease motor activity.

9 Agitation Such as ﬁdgetiness, playing with hands, biting of lips, etc.

10 Anxiety Worrying about minor matters.

11 Anxiety – somatic Such as ﬂushing, sweating, tremor, etc.

12 Somatic symptoms (gastrointestinal) Loss of appetite.

13 Somatic symptoms Such as backaches, headaches, muscle aches, etc.

14 Genital symptoms Such as loss of libido and menstrual disturbances.

15 Hypochondriasis Preoccupation with health.

16 Loss of weight Loss of weight.

17 Insight Acknowledges or denies being depressed.

Fig. 3.Main steps in identifying depressive symptoms.

a) Word segmentation and Part-Of-Speech (POS) tagging: An input sentence is segmented into a word sequence with part-of-speech (POS) tagging for further grammatical structure analysis.

b) Semantic dependency graph (SDG) construction:The sentence with POS tags is then passed to a probabilistic context free grammar (PCFG), generated from the Sinica Treebank corpus developed by Academia Sinica, Taiwan [10]. The PCFG analyzes the sentence structure and determines the semantic dependencies between word tokens, as shown in the lower left part of Fig. 3. These semantic dependencies together form an SDG, denoted by T, to represent the grammatical structure of the sentence, as shown in the lower right part of Fig. 3. Each semantic dependencyt in an SDG has the format (modiﬁer,head,relmodiﬁer,head

)

, wheremodifierandheaddenote the word tokens that function as the modifier and head, respectively, and relmodifier,head denotes a dependency relation frommodifier tohead. For instance, the semantic dependency (I, worry about, experiencer) means that “I” is the experiencer to the head of the sentence “worry about”, and (matters, worry about, goal) means that “matters” is the goal to “worry about”. Accordingly, the inference module can understand who worries and what to worry about from the semantic dependencies.

(5)

c) Semantic label inference: Once the SDG of a sentence is constructed, the semantic label of the sentence is inferred based on the corresponding SDG. Additionally, the previous label, i.e., the label of the sentence preceding the current sentence, may also contribute to the inference process. Therefore, the most probable label,

ˆ

lk, of a sentence can be inferred from its previous label,lk−1, and its corresponding SDG, T, as described in (1).

ˆ

lk

=

arg max lk

P

(

lk

|

lk−1

,

T

).

(1)

Bayes’ rule gives

ˆ

lk

=

arg max lk P

(

T

,

lk−1

|

lk

)

P

(

lk

)

=

arg max lk

t∈T P

(

t

,

lk−1

|

lk

)

P

(

lk

)

=

arg max lk

t∈T P

(

t

|

lk

)

P

(

lk−1

|

lk

)

P

(

lk

)

=

arg max lk

t∈T P

(

t

|

lk

)

P

(

lk−1

,

lk

),

(2)

whereP

(

t

|

lk

)

denotes the probability that a semantic dependency occurs in a label, and P

(

lk−1

,

lk

)

denotes the proba-bility thatlk−1andlk co-occur. These two probabilities are estimated as

P

(

t

|

lk

)

=

N

(

t

,

lk

)

N

(

lk

)

,

P

(

lk−1

,

lk

)

=

N

(

lk−1

,

lk

)

N

(

l

)

,

(3)

whereN

(

t

,

lk

)

denotes the number of training sentences with the labellkcontainingt;N

(

lk

)

denotes the number train-ing sentences with the labellk; N

(

lk−1

,

lk

)

denotes the number of co-occurrences oflk−1 andlk, and N

(

l

)

denotes the total number of the training sentences with any label except <Others>. The training sentences were obtained by seg-menting the training part of the psychiatric documents. Healthcare professionals then annotated each training sentence with a semantic label to enable the retrieval of the statistical information.

The inference process adopts the principle that for different labels the probabilities of dependencies may be different. For instance, in our training data, the probability that the dependency (matters, worry about, goal) occurs in <Anxiety>, i.e. P

((

matters, worry about, goal)

|

<Anxiety>), is much higher than that in all the other labels, revealing that the dependency (matters, worry about, goal) is a good indicator for identifying <Anxiety>. Thus, the label of a new sentence with this dependency is more likely to be <Anxiety> than some other labels. If the difference between P

(

t

|

lk

)

is very small, then the probabilityP

(

lk−1

,

lk

)

may contribute to the inference process.

2) Relation Identiﬁcation:After the symptoms are obtained, the cause-effect and temporal relations holding between symptoms (sentences) are identiﬁed by a set of discourse markers. For instance, the discourse markers “because” and “therefore” may signal cause-effect relations, and “before” and “after” may signal temporal relations. Table 2 lists parts of the discourse markers.

a) Identiﬁcation of cause-effect relations:Cause-effect relations can be identiﬁed from the occurrence of the discourse markers related to cause-effect relations. If a related discourse marker occurs in a symptom, then it is directly translated to the cause-effect relation. For instance, (1b) contains a discourse marker “so”, which is related to cause-effect relations. Therefore, the discourse marker “so” can be translated to a cause-effect relation that holds between the symptoms <Depressed> and <Suicide>.

(1a) During past years, I always felt very depressed. <Depressed>

(1b) So, I have tried to kill myself several times. <Suicide>

b) Identiﬁcation of temporal relations:Temporal relations can be identiﬁed by dividing a document (or query) into dif-ferent time intervals using the temporal discourse markers. For instance, the example query shown in Fig. 1 can be

Table 2

Semantic relations and their corresponding discourse markers.

Semantic relations Discourse markers

Cause-effect because, reason, therefore, hence Temporal before, past, recently, now, after, later

(6)

Fig. 4.Illustrative example of temporal relations. Table 3

Classiﬁcation of negative life events.

Type Description

Family Serious illness or injury of a family member; Son or daughter leaving home;

Lost home through a disaster

Love Marital divorce or separation;

Spouse/mate engaged in inﬁdelity; Broke up with a boyfriend or girlfriend School Examination failed or grade dropped;

Unable to enter/stay in school; Accommodation problems

Work Laid off or ﬁred from a job;

Demotion and salary reduction; Changed jobs for a worse one

Social Substantial conﬂicts with a friend or neighbor; Diﬃculties in social activities

represented as Fig. 4. The temporal discourse markers, “After” and “In recent months”, divide the example query into three different time intervals. Thus, the symptoms that occur in different time intervals specify the temporal relations. For instance, Fig. 4 contains the temporal relation, D

→

I

→

A, indicating that the temporal order of the symptoms in turn is <Depressed>, <Insomnia> and <Anxiety>.

3) Negative life event identification:The negative life events are defined and classified based on recent investigations of negative life events [6,31]. Table 3 shows the classification and general description of negative life events.

Negative life events are embedded in sentences, and can be characterized bypatterns, i.e., semantically plausible combi-nations of words. For instance, (2a) contains a love-related negative life event, which is characterized by the pattern <break up, boyfriend>.

(2a) Ibroke upwith my dear but cruelboyfriendrecently. <Love>

Therefore, such patterns are signiﬁcant features in identifying negative life events. The main steps of event identiﬁcation are described as follows.

a) Pattern induction:Given a set of seed patterns, more relevant patterns are induced from psychiatry web corpora using an evolutionary inference algorithm [53]. The algorithm follows evolutionary computation to perform the following steps iteratively. For each seed pattern, the algorithm ﬁrst creates an initial population of patterns with lengths k (2, 3, or 4). Each pattern of length kis created by selecting kdistinct words from the vocabulary. After initialization, the similarity of each pattern in the population and the seed pattern is calculated by the ﬁtness function. A number of patterns are then selected to be the parents according to their similarity scores. The variation operators, i.e.,crossover

andmutation, are then applied to produce the offspring. The offspring is also evaluated by the ﬁtness function, and the

superior offspring replaces the inferior parents to form a new population. Finally, the relevant patterns generated in each iteration are collected until the termination criteria are satisfied. Table 4 shows the example seed patterns. For the seed pattern <husband, argue>, for example, several relevant patterns such as <husband, fight>, <husband, yell>, <wife, argue>, <husband, fight, money>, <wife, argue, money> can be induced using the evolutionary inference algorithm.

b) SVM classification:The patterns induced by the evolutionary inference algorithm are then manually grouped into five categories to train an SVM classifier [9,46], where each pattern corresponds to a dimension of a feature vector. In the testing phase, each input sentence is transformed into its corresponding feature vector. The SVM then takes the feature vector as input and outputs one of the five categories of negative life events.

(7)

Table 4

Example of seed patterns.

Type Seed pattern

Family <son, injure>; <husband, argue> Love <marriage, break>; <spouse, divorce> School <teacher, blame>; <exam, fail> Work <salary, cut>; <work, stop>

Social <friend, die>

3. Discourse-aware retrieval model

The discourse-aware retrieval model calculates the similarity between a query and a document,Sim

(

q

,

d

)

, by combining the similarities of the sets of events, symptoms and relations within them, as shown in (4).

Sim

(

q

,

d

)

=

α

SimEvn

(

q

,

d

)

+

β

SimSym

(

q

,

d

)

+

(

1

−

α

−

β)

SimRel

(

q

,

d

),

(4)

whereSimEvn

(

q

,

d

),

SimSym

(

q

,

d

)

andSimRel

(

q

,

d

)

, denote the similarities of the sets of events, symptoms and relations, re-spectively, between a query and a document, and

α

and

β

denote the combination factors.

3.1. Similarity of events and symptoms

The similarities of the sets of events and symptoms are calculated in the same method. The similarity of the event set (or symptom set) is calculated by comparing the events (or symptoms) in a query with those in a document. Additionally, only the events (or symptoms) with the same type are considered. The events (or symptoms) with different types are considered as irrelevant, i.e., no similarity. For instance, the event <Love> is considered as irrelevant to <Work>. The similarity of the event set is calculated by

SimEvn

(

q

,

d

)

=

1 N

(

Evnq

∪

Evnd

)

e∈q∩d

Type

(

eq

,

ed

)

cos

(

eq

,

ed

)

+

const

,

(5)

whereEvnqandEvnddenote the event set in a query and a document, respectively;eq andeddenote the events;N

(

Evnq

∪

Evnd

)

denotes the cardinality of the union of Evnq andEvnd as a normalization factor, andType

(

eq

,

ed

)

denotes an identity function to check whether two events have the same type, deﬁned as

Type

(

eq

,

ed

)

=

₁ _Type

(

eq

)

=

Type

(

ed

),

0 otherwise

.

(6)

The cos

(

eq

,

ed

)

denotes the cosine angle between two vectors of words representingeqanded, as shown below. cos

(

eq

,

ed

)

=

V i=1wieqw i ed

V i=1

(

wieq

)

2

V i=1

(

wied

)

2

,

(7)

where wdenotes a word in a vector, and V denotes the dimensionality of vectors. Accordingly, when two events have the same type, their similarity is given as cos

(

eq

,

ed

)

plus a constant,const. Additionally, cos

(

eq

,

ed

)

andconstcan be considered as the word-level and discourse-level similarities, respectively. The optimal setting ofconstis determined empirically. 3.2. Similarity of relations

When calculating the similarity of relations, only the relations with the same type are considered. That is, the cause-effect (or temporal) relations in a query are only compared with the cause-cause-effect (or temporal) relations in a document. Therefore, the similarity of relation sets can be calculated as

SimRel

(

q

,

d

)

=

1 Z

rq,rd Type

(

rq

,

rd

)

Sim

(

rq

,

rd

),

(8) Z

=

NC

(

rq

)

NC

(

rd

)

+

NT

(

rq

)

NT

(

rd

),

(9)

whererq andrd denote the relations in a query and a document, respectively; Z denotes the normalization factor for the number of relations; Type

(

eq

,

ed

)

denotes an identity function similar to (6), and NC

(

· )

andNT

(

· )

denote the numbers of cause-effect and temporal relations.

Both cause-effect and temporal relations are represented by symptom chains. Hence, the similarity of relations is mea-sured by the similarity of symptom chains. The main characteristic of a symptom chain is that it retains the cause-effect or temporal order of the symptoms within it. Therefore, the order of the symptoms must be considered when calculating the similarity of two symptom chains. Accordingly, a sequence kernel function [8,26] is adopted to calculate the similar-ity of two symptom chains. A sequence kernel compares two sequences of symbols (e.g., characters, words) based on the

(8)

subsequences within them, but not individual symbols. Thereby, the order of the symptoms can be incorporated into the comparison process.

The sequence kernel calculates the similarity of two symptom chains by comparing their sub-symptom chains at differ-ent lengths. An increasing number of common sub-symptom chains indicates a greater similarity between two symptom chains. For instance, both the two symptom chains s1s2s3s4 and s3s2s1 contain the same symptoms s1

,

s2 and s3, but in different orders. To calculate the similarity between these two symptom chains, the sequence kernel ﬁrst calculates their similarities at lengths 2 and 3, and then averages the similarities at the two lengths. To calculate the similarity at length 2, the sequence kernel compares their sub-symptom chains of length 2, i.e.,

{

s1s2

,

s1s3

,

s1s4

,

s2s3

,

s2s4

,

s3s4} and

{

s3s2

,

s3s1

,

s2s1}. Similarly, their similarity at length 3 is calculated by comparing their sub-symptom chains of length 3, i.e.,

{

s1s2s3

,

s1s2s4

,

s1s3s4

,

s2s3s4}and

{

s3s2s1}. Obviously, no similarity exists between s1s2s3s4 ands3s2s1, since no sub-symptom chains are matched at both lengths. In this example, the sub-sub-symptom chains of length 1, i.e., individual sub-symptoms, do not have to be compared because they contain no information about the order of symptoms. Additionally, the sub-symptom chains of length 4 do not have to be compared, because the two sub-symptom chains share no sub-sub-symptom chains at this length. Hence, for any two symptom chains, the length of the sub-symptom chains to be compared ranges from two to the minimum length of the two symptom chains. The similarity of two symptom chains can be formally denoted as

Sim

(

rq

,

rd

)

≡

Sim

scN1 q

,

scN_d2

=

K

scN1 q

,

scdN2

=

1 N

−

1 N

n=2 Kn

scN1 q

,

scNd2

,

(10) wherescN1

q andscNd2 denote the symptom chains corresponding torq andrd, respectively; N1 andN2 denote the length of scN1

q andsc N2

d , respectively; K

(

· ,

· )

denotes the sequence kernel for calculating the similarity between two symptom chains; Kn

(

· ,

· )

denotes the sequence kernel for calculating the similarity between two symptom chains at lengthn, and N is the minimum length of the two symptom chains, i.e.,N

=

min

(

N1

,

N2

)

. The sequence kernel Kn

(

sc_iN1

,

scN_j2

)

is deﬁned as

Kn

scN1 i

,

sc N2 j

=

n

(

sc N1 i

)

n

(

sc_iN1

)

·

n

(

sc N2 j

)

n

(

scN_j2

)

=

u∈SCn

φ

u

(

sciN1

)φ

u

(

scNj2

)

u∈SCn

φ

u

(

sc_iN1

)φ

u

(

scN_j1

)

u∈SCn

φ

u

(

sc_iN2

)φ

u

(

scN_j2

)

,

(11) where Kn

(

scNi1

,

sc N2

j

)

is the normalized inner product of vectors

n

(

sciN1

)

and

n

(

scNj2

)

;

n

(

· )

denotes a mapping that transforms a given symptom chain into a vector of the sub-symptom chains of length n;

φ

u

(

· )

denotes an element of the vector, representing the weight of a sub-symptom chain u, andSCn denotes the set of all possible sub-symptom chains of lengthn. For instance, given the symptom chains1s2s3

,

2

(

s1s2s3

)

can be represented as

2

(

s1s2s3

)

=

φ

s1s2

(

s1s2s3

), φ

s1s3

(

s1s2s3

), . . . , φ

s2s3

(

s1s2s3

), . . . , φ

s16s17

(

s1s2s3

)

,

(12)

where the dimensionality of the vector is the number of all possible sub-symptom chains of length 2. The weight of a sub-symptom chain, i.e.,

φ

u

(

· )

, is deﬁned as

φ

u

scN1 i

=

⎧

⎨

⎩

1 uis a contiguous sub-symptom chain ofscN1

i

,

λ

θ _u_{is a non-contiguous sub-symptom chain with}

_θ

_{skipped symptoms}

_,

0 udoes not appear inscN1

i

,

(13)

where

λ

∈ [

0

,

1

]

denotes adecay factor that is adopted to penalize the non-contiguous sub-symptom chains occurred in a symptom chain based on the skipped symptoms. For instance, in (12),

φ

s1s2

(

s1s2s3

)

=

φ

s2s3

(

s1s2s3

)

=

1 sinces1s2 ands2s3 are considered as contiguous ins1s2s3, and

φ

s1s3

(

s1s2s3

)

=

λ

1 _since_s

1s3 is a non-contiguous sub-symptom chain with one skipped symptom. The decay factor is adopted because a contiguous sub-symptom chain is preferable to a non-contiguous chain when comparing two symptom chains. The setting of the decay factor is domain dependent. If

λ

=

1, then no penalty is applied for skipping symptoms, and the cause-effect and temporal relations are transitive. The optimal setting of

λ

is determined empirically. Fig. 5 presents an example to summarize the computation of the similarity between two symptom chains.

4. Experimental results

4.1. Experiment setup

1) Corpus:The consultation documents were collected from the mental health website of the John Tung Foundation (http: //www.jtf.org.tw) and the PsychPark (http://www.psychpark.org), a virtual psychiatric clinic, maintained by a group of

(9)

Fig. 5.Illustrative example of relevance computation using the sequence kernel function. Table 5

Characteristics of the test query set.

Avg. number

Negative life event 1.45

Depressive symptom 4.40

Semantic relation 3.35

volunteer professionals of Taiwan Association of Mental Health Informatics [4,21]. Both of the websites provide various kinds of free psychiatric services and update the consultation documents periodically. For privacy consideration, all personal information has been removed. A total of 3650 consultation documents were collected for evaluating the retrieval model, of which 20 documents were randomly selected as the test query set, 100 documents were randomly selected as the tuning set to obtain the optimal parameter settings of involved retrieval models, and the remaining 3530 documents were the reference set to be retrieved. Table 5 shows the average number of events, symptoms and relations in the test query set.

2) Baselines:The proposed method, denoted as DISCOURSE, was compared to two word-based retrieval models: the VSM and Okapi BM25 models. The VSM was implemented in terms of the standard TF-IDF weight [37]. The Okapi BM25 model is deﬁned as

t∈Q w(1)

(

k1

+

1

)

tf K

+

tf

(

k3

+

1

)

qtf k3

+

qtf

+

k2|Q

|

avdl

−

dl avdl

+

dl

,

(14)

wheret denotes a word in a query Q

;

qtf andtf denote the word frequencies occurring in a query and a document, respectively, and w(1)_{denotes the Robertson–Sparck Jones weight of}_t_{(without relevance feedback), deﬁned as}

w(1)

=

logN

−

n

+

0

.

5

n

+

0

.

5

,

(15)

whereN denotes the total number of documents, andndenotes the number of documents containing t. In (14), K is deﬁned as

K

=

k1

((

1

−

b

)

+

b

·

dl

/

avdl

),

(16)

wheredlandavdldenote the length and average length of a document, respectively. The default values ofk1

,

k2

,

k3 and b are describe in [32], wherek1 ranges from 1.0 to 2.0;k2 is set to 0;k3 is set to 8, andb ranges from 0.6 to 0.75. Additionally, BM25 can be considered as BM15 and BM11 whenbis set to 1 and 0, respectively.

(10)

Table 6

Example of computing DCG values.

Rank Gain Discounted gain Discounted cumulative gain

i G[i] G[i]/log2i DCG[i] 1 3 3a ₃ 2 2 2 5 3 3 1.89 6.89 4 1 0.5 7.39 5 0 0 7.39

a _{The discount factor does not apply to the document at rank 1.}

Table 7

Average number of relevant documents for the test query set.

Relevance level Avg. number

Level 1 18.50

Level 2 9.15

Level 3 2.20

3) Evaluation metric:To evaluate the retrieval models, a multi-level relevance criterion was adopted. The relevance crite-rion was divided into four levels, as described below.

•

Level 0: No discourse units are matched between a query and a document.

•

Level 1: At least one discourse unit is partially matched between a query and document.

•

Level 2: All of the three discourse units are partially matched between a query and a document.

•

Level 3: All of the three discourse units are partially matched, and at least one discourse unit is exactly matched between a query and a document.

To deal with the multi-level relevance, thediscounted cumulative gain(DCG) [13,19,20,47] was adopted as the evaluation metric. Given a ranked list of retrieved documents, the DCG at ranki is deﬁned as

DCG

[

i

] =

G

[

1

]

,

ifi

=

1

,

DCG

[

i

−

1

] +

G

[

i

]

/

logci

,

otherwise

,

(17) whereG

[

i

]

denotes the gain value, i.e., relevance level, of the document retrieved at ranki; log_ci denotes a rank-based discount factor that is used to penalize a retrieved document in a lower rank, andG

[

i

]

/

log_cican thus be considered as the discounted gain value of the document retrieved at ranki. According to this formula, the DCG can simultaneously consider the relevance levels and the ranks in the retrieved list to measure the retrieval precision. Table 6 shows the computation of DCG values for the retrieved list 3

,

2

,

3

,

1

,

0

of ﬁve documents with their relevance levels.

In this example,c

=

2 indicates that the documents retrieved with ranking lower than two will be penalized. Addition-ally, the penalty increases with the increase of the ranking. For instance, the discounted value of the document with ranking 4 isG

[

4

]

/

log24

=

1

/

2

=

0

.

5. Finally, the DCG can be calculated accumulatively, thus producingDCG

[

5

] =

7

.

39. The relevance judgment was performed by three experienced physicians. First, thepoolingmethod [1,12,48] was adopted to generate the candidate relevant documents for each test query by taking the top 50 ranked documents retrieved by each of the involved retrieval models, namely the VSM, BM25 and DISCOURSE. Two physicians then assigned level 1, 2, 3, or 0 (irrelevant) to each candidate document based on the multilevel relevance criterion. Finally, the documents with disagreements between the two physicians were judged by the third physician. Table 7 shows the average number of relevant documents for the test query set.

4) Optimal parameter setting:The parameter settings of BM25 and DISCOURSE were evaluated using the tuning set. The optimal setting of BM25 werek1

=

1 andb

=

0

.

6. The other two parameters were set to the default values, i.e.,k2

=

0 andk3

=

8. For the discourse-aware model, the parameters required to be evaluated include the combination factors,

α

and

β

, described in (4); the constantconstdescribed in (5), and the decay factor,

λ

, described in (13). The optimal settings were

α

=

0

.

3;

β

=

0

.

5;const

=

0

.

6 and

λ

=

0

.

8.

4.2. Evaluation on retrieval precision

The experiments were divided into two groups: the precision andefficiency. The retrieval precision was measured by DCG values. Additionally, the Wilcoxon test was used to determine whether the performance difference was statistically significant. The efficiency was measured by the query processing time.

1) Results of using different discourse units: This experiment evaluated the retrieval precision using different discourse units, namely events, symptoms, and relations. The discourse-aware model was implemented in terms of each single discourse unit, denoted as EVN, SYM, and REL, respectively, and all discourse units, denoted as DISCOURSE (with optimal parameter settings described in the previous section). Table 8 shows the results. The results indicate that using only the

(11)

Table 8

DCG values of using different discourse units.

DCG(5) DCG(10) DCG(20) DCG(50) DCG(100)

DISCOURSE 4.7516a _6.9298a _7.6040a _8.3606a _9.3974a

SYM 4.3224 6.4123 6.9456 7.6294 8.4597

EVN 4.1536 5.9768 6.5479 7.3143 8.1087

REL 3.9105 5.6185 6.0891 6.9622 7.6915

a _{DISCOURSE vs SYM signiﬁcantly different (}_p_<_0.05). Table 9

DCG values of different retrieval models.

DCG(5) DCG(10) DCG(20) DCG(50) DCG(100) DISCOURSE 4.7516a _6.9298 _7.6040a _8.3606a _9.3974a BM25 4.4624 6.7023 7.1156 7.8129 8.6597 BM11 3.8877 4.9328 5.9589 6.9703 7.7057 VSM 2.3454 3.3195 4.4609 5.8179 6.6945 BM15 2.1362 2.6120 3.4487 4.5452 5.7020

a _{DISCOURSE vs BM25 signiﬁcantly different (}_p_<_0.05). Table 10

Average query processing time of different retrieval models.

Retrieval model Avg. time (seconds)

DISCOURSE 17.13

VSM 0.68

BM25 0.48

Table 11

Average time for each step in discourse-aware model.

Sentence parsing Similarity computation

SYM and EVN REL

(cosine measure) (kernel function)

Each step (seconds) 12.11 (71%) 1.75(10%) 3.27 (19%)

Total (seconds) 17.13

discourse unit SYM achieved higher precision than using only EVN and REL. Additionally, using the combination of all discourse units can further improve the performance.

2) Comparative results of different retrieval model: To compare the performance of the discourse-aware model and baseline systems, the best system DISCOURSE presented in Table 8 was chosen for comparison. Table 9 shows the comparative results of retrieval precision. The two variants of BM25, namely BM11 and BM15, were also considered in comparison. For the word-based retrieval models, both BM25 and BM11 outperformed the VSM, and BM15 performed worst. The DISCOURSE achieved higher DCG values than both the BM-series models and VSM. The reasons are three-fold. First, a negative life event and a symptom can each be expressed by different words with the same or similar meaning. Therefore, the word-based models often failed to retrieve the relevant documents when different words were used in the input query. Second, a word may relate to different events and symptoms. For instance, the term “worry about” is a good indicator for both the symptoms <Anxiety> and <Hypochondriasis>. This may result in ambiguity for the word-based models. Third, the word-based models cannot capture semantic relations between symptoms. The discourse-aware model incorporates not only the word-level information, but also more useful discourse information about depressive problems, thus improving the retrieval results.

4.3. Evaluation on retrieval eﬃciency

The retrieval efficiency was measured by the query processing time, i.e., the time for processing all test queries. The query processing time was measured using a personal computer with Windows XP operating system, a 2.4 GHz Pentium IV processor and 512 MB RAM. Table 10 shows the results. The DISCOURSE required more processing time than both VSM and BM25, since identification of discourse information involves more detailed analysis steps for query processing, including sentence parsing and similarity computation for events, symptoms and relations. This finding indicates that although the discourse information can improve the retrieval precision, incorporating such high-precision features reduces the retrieval efficiency. Table 11 shows the average time spent on each step in DISCOURSE. The results indicate that sentence parsing required about 70% query processing time, and similarity computation for relations, i.e., kernel function, is more complex than that for events and symptoms, i.e., cosine measure.

(12)

5. Conclusions

This work has presented the use of discourse information for retrieving psychiatric consultation documents. The main contributions of this work can be divided into two aspects: technical and practical use. In the technical aspect, the discourse information can provide more precise information about users’ depressive problems, thus improving the retrieval precision. Additionally, the proposed framework can also be applied to different domains as long as the domain-speciﬁc discourse information is identiﬁed. In the practical use, the psychiatric document retrieval can support psychological treatments such

asbibliotherapyby retrieving consultation documents relevant to users’ depressive problems so that they can learn self-help

skills to alleviate their symptoms.

Future work will focus on three directions. First, more effective information fusion methods will be investigated to combine the discourse information. Second, shallow paring techniques will be adopted to improve the retrieval eﬃciency. Finally, more diverse discourse information will be incorporated to evaluate the portability.

Acknowledgement

This work was supported by the National Science Council, Taiwan, ROC, under Grant No. NSC 97-2218-E-155-011. The authors would like to thank the anonymous reviewers and the associate editor for their constructive comments.

References

[1] P. Ahlgrena, L. Grönqvist, Evaluation of retrieval effectiveness with incomplete relevance data: Theoretical and experimental comparison of three measures, Inf. Process. Manage. 44 (1) (2008) 212–225.

[2] N. Alireza, Webotherapy: Reading web resources for problem solving, The Electronic Library 25 (6) (2007) 741–756. [3] R. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley, Reading, MA, 1999.

[4] Y.M. Bai, C.C. Lin, J.Y. Chen, W.C. Liu, Virtual psychiatric clinics, Am. J. Psychiat. 158 (7) (2001) 1160–1161. [5] G. Bohning, Bibliotherapy: Fitting the resources together, Elementary School J. 82 (2) (1981) 166–170.

[6] E.M. Brostedt, N.L. Pedersen, Stressful life events and affective illness, Acta Psychiat. Scand. 107 (3) (2003) 208–215.

[7] J. Burstein, D. Marcu, K. Knight, Finding the WRITE stuff: Automatic identiﬁcation of discourse structure in student essays, IEEE Intell. Syst. 18 (1) (2003) 32–39.

[8] N. Cancedda, E. Gaussier, C. Goutte, J.M. Renders, Word-sequence kernels, J. Mach. Learn. Res. 3 (6) (2003) 1059–1082. [9] C.C. Chang, C.J. Lin, LIBSVM: A library for support vector machines, 2001,http://www.csie.ntu.edu.tw/~cjlin/libsvm.

[10] K.J. Chen, C.R. Huang, F.Y. Chen, C.C. Luo, M.C. Chang, C.J. Chen, Sinica treebank: De-sign criteria, representational issues and implementation, in: A. Abeille (Ed.), Building and Using Syntactically Annotated Corpora, Kluwer, 2001, pp. 29–37.

[11] G.Y. Chung, E. Coiera, A study of structured clinical abstracts and the semantic classiﬁcation of sentences, in: Proc. of the BioNLP Workshop on Biological, Translational, and Clinical Language Processing at ACL-07, Prague, Czech Republic, 2007, pp. 121–128.

[12] T.A.S. Coelho, P. Calado, L.V. Souza, B. Ribeiro-Neto, R. Muntz, Image retrieval using mul-tiple evidence ranking, IEEE Trans. Knowl. Data Eng. 16 (4) (2004) 408–417.

[13] W.B. Croft, D. Metzler, T. Strohman, Search Engines: Information Retrieval in Practice, Pearson Education, 2009. [14] C. Fellbaum, WordNet: An Electronic Lexical Database, MIT Press, Cambridge, MA, 1998.

[15] M. Hamilton, A rating scale for depression, J. Neurol. Neurosur. Ps. 23 (1960) 56–62.

[16] B. He, I. Ounis, Combining ﬁelds for query expansion and adaptive query expansion, Inf. Process. Manage. 43 (5) (2007) 1294–1307. [17] M.A. Hearst, TextTiling: Segmenting text into multi-paragraph subtopic passages, Comput. Linguist. 23 (1) (1997) 33–64.

[18] F. Ibekwe-SanJuan, S. Fernandez, E. SanJuan, E. Charton, Annotation of scientiﬁc summaries for information retrieval, in: Proc. of the Workshop on Exploiting Semantic Annotations in Information Retrieval (ESAIR ‘08), Glasgow, Scotland, 2008.

[19] K. Jarvelin, J. Kekalainen, IR Evaluation methods for retrieving highly relevant documents, in: Proc. of the 23rd Annual International ACM SIGIR Confer-ence on Research and Development in Information Retrieval, 2000, pp. 41–48.

[20] K. Jarvelin, J. Kekalainen, Cumulated gain-based evaluation of IR techniques, ACM Trans. Inf. Syst. 20 (4) (2002) 422–446.

[21] C.C. Lin, Y.M. Bai, J.Y. Chen, Reliability of information provided by patients of a virtual psychiatric clinic, Psychiat. Serv. 54 (8) (2003) 1167–1168. [22] J. Lin, D. Karakos, D. Demner-Fushman, S. Khudanpur, Generative content models for structural analysis of medical abstracts, in: Proc. of the BioNLP

Workshop on Linking Natural Language Processing and Biology at HLT/NAACL-06, Brooklyn, NY, 2006, pp. 65–72.

[23] E.D. Liddy, W. Paik, M. McKenna, Development and implementation of a discourse model for newspaper texts, in: Proc. of the AAAI Symposium on Empirical Methods in Discourse Interpretation and Generation, Stanford, CA, 1995.

[24] E.D. Liddy, Enhanced text retrieval using natural language processing, Bulletin of the American Society for Information Science 24 (4) (1998) 14–16. [25] E.D. Liddy, T. Diamond, M. McKenna, DR-LINK in TIPSTER III, Inform. Retrieval 3 (4) (2000) 291–311.

[26] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, C. Watkins, Text classiﬁcation using string kernels, J. Mach. Learn. Res. 2 (3) (2002) 419–444. [27] R.W. Marrs, A meta-analysis of bibliotherapy studies, Am. J. Community Psychol. 23 (6) (1995) 843–870.

[28] Y. Mizuta, A. Korhonen, T. Mullen, N. Collier, Zone analysis in biology articles as a basis for information extraction, Int. J. Med. Inform. 75 (6) (2006) 468–487.

[29] R. Navigli, P. Velardi, A. Gangemi, Ontology learning and its application to automated terminology translation, IEEE Intell. Syst. 18 (1) (2003) 22–31. [30] M. Okabe, K. Umemura, S. Yamada, Query expansion with the minimum user feedback by transductive learning, in: Proc. of HLT/EMNLP-05, Vancouver,

Canada, 2005, pp. 963–970.

[31] M.E. Pagano, A.E. Skodol, R.L. Stout, M.T. Shea, S. Yen, C.M. Grilo, C.A. Sanislow, D.S. Bender, T.H. McGlashan, M.C. Zanarini, J.G. Gunderson, Stressful life events as predictors of functioning: Findings from the collaborative longitudinal personality disorders study, Acta Psychiat. Scand. 110 (2004) 421–429. [32] S.E. Robertson, S. Walker, M.M. Beaulieu, M. Gatford, Okapi at TREC-4, in: Proc. of the Fourth Text REtrieval Conference (TREC-4), NIST, 1996. [33] S.E. Robertson, S. Walker, S. Jones, M.M. Hancock-Beaulieu M. Gatford, Okapi at TREC-3, in: Proc. of the Third Text REtrieval Conference (TREC-3), NIST,

1995.

[34] H. Rodríguez, S. Climent, P. Vossen, L. Bloksma, W. Peters, A. Alonge, F. Bertagna, A. Roventint, The top-down strategy for building EuroWordNet: Vocabulary coverage, base concepts and top ontology, Comput. Humanities 32 (1998) 117–159.

[35] B. Rosario, M.A. Hearst, Classifying semantic relations in bioscience texts, in Proc. of the 42th Annual Meeting of the Association for Computational Linguistics (ACL ‘04), Barcelona, Spain, 2004, pp. 430–437.

(13)

[36] P. Ruch, C. Boyer, C. Chichester, I. Tbahriti, A. Geissbuhler, P. Fabry, J. Gobeill, V. Pillet, D. Rebholz-Schuhmann, C. Lovis, A.L. Veuthey, Using argumen-tation to extract key sentences from biomedical abstracts, Int. J. Med. Inform. 76 (2–3) (2007) 195–200.

[37] G. Salton, C. Buckley, Term-weighting approaches in automatic text retrieval, Inf. Process. Manage. 24 (5) (1988) 513–523. [38] G. Salton, M.J. McGill, Introduction to Modern Information Retrieval, McGraw-Hill, New York, 1983.

[39] S. Somasundaran, J. Ruppenhofer, J. Wiebe, Detecting arguing and sentiment in meetings, in: Proc. of SIGdial Workshop on Discourse and Dialogue, Antwerp, Belgium, 2007, pp. 26–34.

[40] R. Stevens, C. Goble, I. Horrocks, S. Bechhofer, Building a bioinformatics ontology using OIL, IEEE Trans. Inf. Technol. Biomed. 6 (2) (2002) 135–141. [41] V. Stoyanov, C. Cardie, Topic identiﬁcation for ﬁne-grained opinion analysis, in: Proc. of the 22th International Conference on Computational Linguistics

(COLING ‘08), Manchester, United Kingdom, 2008, pp.817–824.

[42] K. Sparck Jones, Information retrieval and artiﬁcial intelligence, Artif. Intell. 114 (1–2) (1999) 257–281.

[43] I. Tbahriti, C. Chichester, F. Lisacek, P. Ruch, Using argumentation to retrieve articles with similar citations: An inquiry into improving related articles search in the MEDLINE Digital Library, Int. J. Med. Inform. 75 (6) (2006) 488–495.

[44] S. Teufel, M. Moens, Summarising scientiﬁc articles experiments with relevance and rhetorical status, Comput. Linguist. 28 (4) (2002) 409–445. [45] A. van Straten, P. Cuijpers, N. Smits, Effectiveness of a web-based self-help intervention for symptoms of depression, anxiety, and stress: Randomized

controlled trial, J. Med. Internet Res. 10 (1) (2008).

[46] V. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, 1995.

[47] E.M. Voorhees, Evaluation by highly relevant documents, in: Proc. of the 24th Annual International ACM SIGIR Conference on Research and Develop-ment in Information Retrieval, 2001, pp. 74–82.

[48] E.M. Voorhees, D.K. Harman, Overview of the Sixth Text REtrieval Conference (TREC-6), Inf. Process. Manage. 36 (1) (2000) 3–35.

[49] C.H. Wu, J.F. Yeh, M.J. Chen, Domain-speciﬁc FAQ retrieval using independent aspects, ACM Trans. on Asian Language Information Processing. 4 (1) (2005) 1–17.

[50] C.H. Wu, J.F. Yeh, Y.S. Lai, Semantic segment extraction and matching for internet FAQ retrieval, IEEE Trans. Knowl. Data Eng. 18 (7) (2006) 930–940. [51] C.H. Wu, L.C. Yu, F.L. Jang, Using semantic dependencies to mine depressive symptoms from consultation records, IEEE Intell. Syst. 20 (6) (2005) 50–58. [52] J.F. Yeh, C.H. Wu, M.J. Chen, Ontology-based speech act identiﬁcation in a bilingual dialog system using partial pattern trees, J. Am. Soc. Inf. Sci.

Technol. 59 (5) (2008) 684–694.

[53] L.C. Yu, C.H. Wu, J.F. Yeh, F.L. Jang, HAL-based evolutionary inference for pattern induction from psychiatry web resources, IEEE Trans. Evol. Com-put. 12 (2) (2008) 160–170.