Multi-Task Unsupervised Contextual Learning for Behavioral Annotation

(1)

Multi-Task Unsupervised Contextual Learning for

Behavioral Annotation

Shao-Yen Tseng and Panayiotis Georgiou, Senior Member, IEEE,

Abstract—Unsupervised learning has been an attractive method for easily deriving meaningful data representations from vast amounts of unlabeled data. These representations or, in the context of deep learning, embeddings, often yield superior results in many tasks, whether used directly or as features in subsequent training stages. However, the quality of the embeddings is highly dependent on the assumed knowledge in the unlabeled data and how the system extracts information without supervision. Domain portability is also very limited in unsupervised learning, often requiring re-training on other large-scale corpora to achieve robustness. In this work we present a paradigm for unsupervised contextual learning for behavioral interactions which addresses unsupervised domain adaption. We introduce a multitask ob-jective into unsupervised learning and show that embeddings generated through this process increases performance of behavior related tasks.

I. INTRODUCTION

Representation learning has become a crucial tool for obtaining superior results in many machine learning tasks [1]. The process of transforming input into more informative non-linear abstractions has been especially beneficial for natural language processing (NLP), with a notable example being word embeddings [2] and its proven effectiveness in many tasks [3]. Word embeddings exploit the use of language by learning semantic regularities based on a context of neighboring words. This form of contextual learning is unsupervised, which allows learning from large-scale corpora and is the main reason for the strength and popularity of word embeddings.

Recently, vector representations were obtained from entire sentences to allow the capture of longer context and higher level of concepts in the text. For example, [4] obtained sentence embeddings, which they referred to as skip-thoughts, by training models to generate the surrounding sentences of extracts from contiguous pieces of text from novels. The authors showed that the embeddings were adept at representing the semantic and syntactic properties of sentences through evaluation on various semantic related tasks. In [5] the authors extracted sentence embeddings from an LSTM-RNN which was trained using user click-through data logged from a web search engine. They then showed that embeddings generated by their models were especially useful for web document retrieval tasks. Later, [6] extracted sentence embeddings from a conversation model and showed the richness of semantic content by applying an additional weakly-supervised architecture to estimate the behavioral ratings of couples therapy sessions. Many other works have studied the idea of general purpose representations [7], [8], [9].

The benefit of the methods in the aforementioned works is that the embedding transformation is learned using large

amounts of unlabeled data. This greatly reduces the effort of data annotation while increasing the breadth of language understanding. However, it has also been noted that the quality of embeddings is often highly dependent on the training dataset [5], [6]. So much so that the use of embeddings trained on small domain-relevant datasets could yield results better than those trained on larger generic unsupervised datasets [4].

Unsupervised learning of sentence representations learning: applications, benefits. Unsupervised training has been used to train embedding transformations that improves performance of many tasks. The use of unlabeled data is attractive as it requires least effort and is also possible to incorporate vast amounts of both semantic and syntactic information. Through this method we are able to obtain representations of words or sentences that encode concepts in some form or other.

However a common issue with unsupervised training is the unpredictability of the embedding transformation. In other words this embedding manifold is highly random and may contain redundant and useless information. In addition, depending on the data used, it might not capture all concepts or even the ones relevant to the final task.

In this work we propose a multi-task learning (MTL) framework which guides unsupervised sentence embeddings into a space that is more discriminative in identifying human behaviors. We analyze the performance of the sentence embed-dings in annotating behaviors in Couples Therapy and show that MTL improves the potency of unsupervised contextual learning.

II. RELATED WORK

Many works have focused on leveraging multi-task learning to enhance the informational content of sentence embeddings. These methods can generally be categorized into task-specific or general-purpose applications.

In task-specific implementations a multi-task function is often added to a primary objective. For example, [10] jointly learned sentence embeddings with an additional pivot prediction task in conjunction with sentiment classification. [11] predicted neighboring words as a secondary objective to improve accuracy of various sequence labeling tasks.

On the other hand, general purpose sentence embeddings aim to encode as much content of the input as to excel in a number of unrelated tasks. [12] achieved this by combining various tasks such as machine translation, constituency parsing, and image caption generation, which improved the translation quality between English and German as well as constituent parsing. Recently, [13] presented a large-scale multi-task

(2)

framework for learning general purpose sentence embeddings by training with a multitude of NLP tasks.

Our work differs in that we build on unsupervised sentence embeddings and attempt to guide learning through a task-related multi-task objective. We argue that multi-task sentence embeddings can be more efficiently obtained for single tasks by joint training on unlabeled data and task-related objectives. Labels for the multi-task objective are generated automati-cally from unlabeled training data, which both removes the requirement for labeled data and decreases the complexity of training. This approach is well-suited for low-resource settings as the sentence embeddings learn task-related concepts using out-of-domain data, as long as a related multi-task objective is available.

In this paper we evaluate the performance of the unsupervised multi-task sentence embeddings by identifying various human behaviors in conversational dialogue. We then provide an analysis of the results to give insight on the benefits of multi-task sentence embeddings.

III. LEARNINGCONTEXTUALEMBEDDINGS A. Sequence-to-sequence sentence embeddings

The sequence-to-sequence model (seq2seq) [14] maps input sequences to output sequences using an encoder-decoder architecture. Given an input sentence x = (x0, x2, ..., xT)

and output sentence y = (y0, y2, ..., yT0), where xt and yt

represent individual words, the standard sequence model can be expressed as computing the conditional probability

P (y_{| x) =}

T0

Y

t=0

P (yt| yi<t, s, h) (1)

where s is the sequence of outputs stfrom the encoder and h is

the internal representation of the input given by the last hidden state of the encoder. For a given dataset D = {(xn, yn)}Nn=1,

the internal representation h can be expressed as

hθ≡ f(x | D) = f(x | θ) (2)

where f(·) is the encoder function and θ is the set of parameters resulting from D.

The internal representation hθ encodes the input x into an

internal representation that allows the decoder to generate the best estimate of y. In cases where D contains semantically-related data pairs, hθ is the semantic vector representation, or

sentence embedding, of the input.

The internal representation hθencodes the input x into an

in-ternal representation that allows the decoder to generate the best estimate of y. In cases where D contains semantically-related data pairs, it has been theorized that hθ is a semantic vector

representation of the input, or sentence embedding, which can be useful for subsequent NLP tasks. For example, [4] obtained embeddings, which they referred to as skip-thoughts, by training models to generate the surrounding sentences of extracts from contiguous pieces of text from novels. The authors showed that the embeddings were adept at representing the semantic and syntactic properties of sentences through evaluation using various semantic related tasks. In [5] the authors extracted sentence embeddings from an LSTM-RNN which was trained

using user click-through data logged from a web search engine. They then showed that embeddings generated by their models were especially useful for web document retrieval tasks. Later, [6] extracted sentence embeddings from a conversation model and showed the richness of semantic content by applying an additional weakly-supervised architecture to estimate the behavioral ratings of couples therapy sessions.

The benefit of the proposed methods in the aforementioned works is that the embedding transformation is learned using large amounts of unlabeled data. This greatly reduces the effort of data annotation while increasing the breadth of language understanding. However, it has also been noted that the quality of embeddings is highly dependent on the training dataset D [5], [6]. So much so that the use of embeddings trained on small domain-relevant datasets could yield results better than those trained on larger generic unsupervised datasets [4]. B. Multitask embeddings

We propose to enhance the quality of sentence embeddings through multi-task learning. We hypothesize that the addition of a multi-task objective can guide the embeddings into a space that is more discriminative in our target application.

Assuming an additional task with label b we can augment the dataset to yield Daug={(xn, yn, bn)}Nn=1. We then aim

to predict this new label b in conjunction with the original output sequence y. This is implemented by applying a graph structure after the internal representation h, as shown in Figure 1. The model now also estimates the conditional probability

P (b| x) = g(h | Daug) = g(hθaug) (3)

where g(·) is the multi-task function and hθaug is the new

internal representation given by Daug. In our work, the

multi-task function g(·) is implemented with a multilayer perceptron. Building on our previous work [6], we propose to enhance the quality of sentence embeddings through multitask learning. We hypothesize that the addition of a multitask objective can guide the embeddings into a manifold that is more discriminative in our target application. Assuming an additional task with label b we can augment the dataset to yield Daug={(xn, yn, bn)}Nn=1. We then aim to predict this new

label b in conjunction with the original output sequence y. This can be implemented by applying a graph structure after the internal representation h, as shown in Figure 1. Where the model now estimates the conditional probability

P (b_{| x) = g(h | D}aug) = g(hθaug)

where g(·) is the multitask function and hθaug is the new

internal representation given by Daug.

The additional task attempts to predict the positive or negative state of the input sentences. By jointly optimizing on the multiple tasks we hypothesize that the embeddings will become more discriminative in behavioral signal processing while still preserving high level sentence concepts.

Multitask training can be implemented by applying a graph structure to the sequence-to-sequence model. So now, in addition to generating consecutive sentences in a conversation,

(3)

GRU C GRU’ GRU C GRU’ GRU C GRU’ ~h<latexit sha1_base64="SLbxmCksNACtGk7E4II6a+xmM74=">AAAB+3icbVA9SwNBEJ2LXzF+RWNnsxgECwl3NmoXsLGM4JlALoS9zSRZsrd37O4FjiP+FRsLFVv/iJ3/xs1HoYkPBh7vzezOvDARXBvX/XYKa+sbm1vF7dLO7t7+Qfnw6FHHqWLos1jEqhVSjYJL9A03AluJQhqFApvh6HbqN8eoNI/lg8kS7ER0IHmfM2qs1C1XgjGyPBihkm5iyHDSDUvdctWtuTOQVeItSLXuBMdg0eiWv4JezNIIpWGCat327FudnCrDmcBJKUg1JpSN6ADblkoaoe7ks+Un5MwqPdKPlS1pyEz9PZHTSOssCm1nRM1QL3tT8T+vnZr+dSfnMkkNSjb/qJ8KYmIyTYL0uEJmRGYJZYrbXQkbUkWZsXlNQ/CWT14l/mXtpubde9X6BcxRhBM4hXPw4ArqcAcN8IFBBs/wCm/Ok/PivDsf89aCs5ipwB84nz9415Ta</latexit><latexit sha1_base64="wxA3x+a5V5V/cVOLbD/vNsjcWuQ=">AAAB+3icbVBNSwMxEE2q1lqtVnv0EiyCBym7XtRbwYvHCq4tdEvJprNtaDa7JNnCstS/4sWDilf/iDd/jGD6cdDWBwOP92aSmRckgmvjOF+4sLG5Vdwu7ZR39yr7B9XDowcdp4qBx2IRq05ANQguwTPcCOgkCmgUCGgH45uZ356A0jyW9yZLoBfRoeQhZ9RYqV+t+RNguT8GJZ3EkNG0H5T71brTcOYg68RdknoT+7XvSjFr9auf/iBmaQTSMEG17rr2rV5OleFMwLTspxoSysZ0CF1LJY1A9/L58lNyapUBCWNlSxoyV39P5DTSOosC2xlRM9Kr3kz8z+umJrzq5VwmqQHJFh+FqSAmJrMkyIArYEZkllCmuN2VsBFVlBmb1ywEd/XkdeJdNK4b7p1bb56jBUroGJ2gM+SiS9REt6iFPMRQhp7QC3rFj/gZv+H3RWsBL2dq6A/wxw+EqJZi</latexit><latexit sha1_base64="wxA3x+a5V5V/cVOLbD/vNsjcWuQ=">AAAB+3icbVBNSwMxEE2q1lqtVnv0EiyCBym7XtRbwYvHCq4tdEvJprNtaDa7JNnCstS/4sWDilf/iDd/jGD6cdDWBwOP92aSmRckgmvjOF+4sLG5Vdwu7ZR39yr7B9XDowcdp4qBx2IRq05ANQguwTPcCOgkCmgUCGgH45uZ356A0jyW9yZLoBfRoeQhZ9RYqV+t+RNguT8GJZ3EkNG0H5T71brTcOYg68RdknoT+7XvSjFr9auf/iBmaQTSMEG17rr2rV5OleFMwLTspxoSysZ0CF1LJY1A9/L58lNyapUBCWNlSxoyV39P5DTSOosC2xlRM9Kr3kz8z+umJrzq5VwmqQHJFh+FqSAmJrMkyIArYEZkllCmuN2VsBFVlBmb1ywEd/XkdeJdNK4b7p1bb56jBUroGJ2gM+SiS9REt6iFPMRQhp7QC3rFj/gZv+H3RWsBL2dq6A/wxw+EqJZi</latexit><latexit sha1_base64="R0YtV2E7BnX0g0fi/mjtSY0dm1c=">AAAB+3icbVBNS8NAEN34WetXtEcvi0XwICXxot4KXjxWMLbQhLDZTtulm03Y3RRCqH/FiwcVr/4Rb/4bN20O2vpg4PHezO7Mi1LOlHacb2ttfWNza7u2U9/d2z84tI+OH1WSSQoeTXgiexFRwJkATzPNoZdKIHHEoRtNbku/OwWpWCIedJ5CEJORYENGiTZSaDf8KdDCn4AUTqrxeBZG9dBuOi1nDrxK3Io0UYVOaH/5g4RmMQhNOVGq75q3goJIzSiHWd3PFKSETsgI+oYKEoMKivnyM3xmlAEeJtKU0Hiu/p4oSKxUHkemMyZ6rJa9UvzP62d6eB0UTKSZBkEXHw0zjnWCyyTwgEmgmueGECqZ2RXTMZGEapNXGYK7fPIq8S5bNy333m22L6o0augEnaJz5KIr1EZ3qIM8RFGOntErerOerBfr3fpYtK5Z1UwD/YH1+QOJ0pQp</latexit> b

~h<latexit sha1_base64="SLbxmCksNACtGk7E4II6a+xmM74=">AAAB+3icbVA9SwNBEJ2LXzF+RWNnsxgECwl3NmoXsLGM4JlALoS9zSRZsrd37O4FjiP+FRsLFVv/iJ3/xs1HoYkPBh7vzezOvDARXBvX/XYKa+sbm1vF7dLO7t7+Qfnw6FHHqWLos1jEqhVSjYJL9A03AluJQhqFApvh6HbqN8eoNI/lg8kS7ER0IHmfM2qs1C1XgjGyPBihkm5iyHDSDUvdctWtuTOQVeItSLXuBMdg0eiWv4JezNIIpWGCat327FudnCrDmcBJKUg1JpSN6ADblkoaoe7ks+Un5MwqPdKPlS1pyEz9PZHTSOssCm1nRM1QL3tT8T+vnZr+dSfnMkkNSjb/qJ8KYmIyTYL0uEJmRGYJZYrbXQkbUkWZsXlNQ/CWT14l/mXtpubde9X6BcxRhBM4hXPw4ArqcAcN8IFBBs/wCm/Ok/PivDsf89aCs5ipwB84nz9415Ta</latexit><latexit sha1_base64="wxA3x+a5V5V/cVOLbD/vNsjcWuQ=">AAAB+3icbVBNSwMxEE2q1lqtVnv0EiyCBym7XtRbwYvHCq4tdEvJprNtaDa7JNnCstS/4sWDilf/iDd/jGD6cdDWBwOP92aSmRckgmvjOF+4sLG5Vdwu7ZR39yr7B9XDowcdp4qBx2IRq05ANQguwTPcCOgkCmgUCGgH45uZ356A0jyW9yZLoBfRoeQhZ9RYqV+t+RNguT8GJZ3EkNG0H5T71brTcOYg68RdknoT+7XvSjFr9auf/iBmaQTSMEG17rr2rV5OleFMwLTspxoSysZ0CF1LJY1A9/L58lNyapUBCWNlSxoyV39P5DTSOosC2xlRM9Kr3kz8z+umJrzq5VwmqQHJFh+FqSAmJrMkyIArYEZkllCmuN2VsBFVlBmb1ywEd/XkdeJdNK4b7p1bb56jBUroGJ2gM+SiS9REt6iFPMRQhp7QC3rFj/gZv+H3RWsBL2dq6A/wxw+EqJZi</latexit><latexit sha1_base64="wxA3x+a5V5V/cVOLbD/vNsjcWuQ=">AAAB+3icbVBNSwMxEE2q1lqtVnv0EiyCBym7XtRbwYvHCq4tdEvJprNtaDa7JNnCstS/4sWDilf/iDd/jGD6cdDWBwOP92aSmRckgmvjOF+4sLG5Vdwu7ZR39yr7B9XDowcdp4qBx2IRq05ANQguwTPcCOgkCmgUCGgH45uZ356A0jyW9yZLoBfRoeQhZ9RYqV+t+RNguT8GJZ3EkNG0H5T71brTcOYg68RdknoT+7XvSjFr9auf/iBmaQTSMEG17rr2rV5OleFMwLTspxoSysZ0CF1LJY1A9/L58lNyapUBCWNlSxoyV39P5DTSOosC2xlRM9Kr3kz8z+umJrzq5VwmqQHJFh+FqSAmJrMkyIArYEZkllCmuN2VsBFVlBmb1ywEd/XkdeJdNK4b7p1bb56jBUroGJ2gM+SiS9REt6iFPMRQhp7QC3rFj/gZv+H3RWsBL2dq6A/wxw+EqJZi</latexit><latexit sha1_base64="R0YtV2E7BnX0g0fi/mjtSY0dm1c=">AAAB+3icbVBNS8NAEN34WetXtEcvi0XwICXxot4KXjxWMLbQhLDZTtulm03Y3RRCqH/FiwcVr/4Rb/4bN20O2vpg4PHezO7Mi1LOlHacb2ttfWNza7u2U9/d2z84tI+OH1WSSQoeTXgiexFRwJkATzPNoZdKIHHEoRtNbku/OwWpWCIedJ5CEJORYENGiTZSaDf8KdDCn4AUTqrxeBZG9dBuOi1nDrxK3Io0UYVOaH/5g4RmMQhNOVGq75q3goJIzSiHWd3PFKSETsgI+oYKEoMKivnyM3xmlAEeJtKU0Hiu/p4oSKxUHkemMyZ6rJa9UvzP62d6eB0UTKSZBkEXHw0zjnWCyyTwgEmgmueGECqZ2RXTMZGEapNXGYK7fPIq8S5bNy333m22L6o0augEnaJz5KIr1EZ3qIM8RFGOntErerOerBfr3fpYtK5Z1UwD/YH1+QOJ0pQp</latexit> b

~hf

<latexit sha1_base64="qriE3AlJItZg6jaiDQUmIaKzXQM=">AAAB+3icbVA9SwNBEJ2LXzF+naa0WQyChYQ7G7UL2NiIEYwJJCHsbeaSJXt7x+5eIBzxr9hYqNj6R+zs/CluPgpNfDDweG9md+YFieDaeN6Xk1tZXVvfyG8WtrZ3dvfc/YMHHaeKYY3FIlaNgGoUXGLNcCOwkSikUSCwHgyuJn59iErzWN6bUYLtiPYkDzmjxkodt9gaIstaA1TSSwzpjzthoeOWvLI3BVkm/pyUKu7N7TcAVDvuZ6sbszRCaZigWjd9+1Y7o8pwJnBcaKUaE8oGtIdNSyWNULez6fJjcmyVLgljZUsaMlV/T2Q00noUBbYzoqavF72J+J/XTE140c64TFKDks0+ClNBTEwmSZAuV8iMGFlCmeJ2V8L6VFFmbF6TEPzFk5dJ7ax8Wfbv/FLlFGbIwyEcwQn4cA4VuIYq1IDBCJ7gBV6dR+fZeXPeZ605Zz5ThD9wPn4A+m2V+Q==</latexit><latexit sha1_base64="JMMwmdPiiytBmnNHeLn5dD6mFLw=">AAAB+3icbVBNS8NAEN3Ur1q/oj16WSyCBymJFxUvBS9exArGFptSNttJu3SzCbubQgn1r3jxoOLVP+LNk3/FTduDtj4YeLw3szvzgoQzpR3nyyosLa+srhXXSxubW9s79u7evYpTScGjMY9lMyAKOBPgaaY5NBMJJAo4NILBZe43hiAVi8WdHiXQjkhPsJBRoo3Uscv+EGjmD0AKJ9G4P+6EpY5dcarOBHiRuDNSqdnXN98X3Yd6x/70uzFNIxCacqJUyzVvtTMiNaMcxiU/VZAQOiA9aBkqSASqnU2WH+NDo3RxGEtTQuOJ+nsiI5FSoygwnRHRfTXv5eJ/XivV4Vk7YyJJNQg6/ShMOdYxzpPAXSaBaj4yhFDJzK6Y9okkVJu88hDc+ZMXiXdSPa+6t26ldoymKKJ9dICOkItOUQ1doTryEEUj9IRe0Kv1aD1bb9b7tLVgzWbK6A+sjx9Ewpbx</latexit><latexit sha1_base64="JMMwmdPiiytBmnNHeLn5dD6mFLw=">AAAB+3icbVBNS8NAEN3Ur1q/oj16WSyCBymJFxUvBS9exArGFptSNttJu3SzCbubQgn1r3jxoOLVP+LNk3/FTduDtj4YeLw3szvzgoQzpR3nyyosLa+srhXXSxubW9s79u7evYpTScGjMY9lMyAKOBPgaaY5NBMJJAo4NILBZe43hiAVi8WdHiXQjkhPsJBRoo3Uscv+EGjmD0AKJ9G4P+6EpY5dcarOBHiRuDNSqdnXN98X3Yd6x/70uzFNIxCacqJUyzVvtTMiNaMcxiU/VZAQOiA9aBkqSASqnU2WH+NDo3RxGEtTQuOJ+nsiI5FSoygwnRHRfTXv5eJ/XivV4Vk7YyJJNQg6/ShMOdYxzpPAXSaBaj4yhFDJzK6Y9okkVJu88hDc+ZMXiXdSPa+6t26ldoymKKJ9dICOkItOUQ1doTryEEUj9IRe0Kv1aD1bb9b7tLVgzWbK6A+sjx9Ewpbx</latexit><latexit sha1_base64="UujVHHB/1Q3svLSKzu+MWMzhcxc=">AAAB+3icbVBNS8NAEN34WetXtEcvi0XwICXxot4KXjxWMLbQhLLZTtqlm03Y3RRCqH/FiwcVr/4Rb/4bN20O2vpg4PHezO7MC1POlHacb2ttfWNza7u2U9/d2z84tI+OH1WSSQoeTXgieyFRwJkATzPNoZdKIHHIoRtObku/OwWpWCIedJ5CEJORYBGjRBtpYDf8KdDCn4AUTqrxeDaI6gO76bScOfAqcSvSRBU6A/vLHyY0i0FoyolSfde8FRREakY5zOp+piAldEJG0DdUkBhUUMyXn+EzowxxlEhTQuO5+nuiILFSeRyazpjosVr2SvE/r5/p6DoomEgzDYIuPooyjnWCyyTwkEmgmueGECqZ2RXTMZGEapNXGYK7fPIq8S5bNy333m22L6o0augEnaJz5KIr1EZ3qIM8RFGOntErerOerBfr3fpYtK5Z1UwD/YH1+QOP4pQt</latexit>

~hf

<latexit sha1_base64="qriE3AlJItZg6jaiDQUmIaKzXQM=">AAAB+3icbVA9SwNBEJ2LXzF+naa0WQyChYQ7G7UL2NiIEYwJJCHsbeaSJXt7x+5eIBzxr9hYqNj6R+zs/CluPgpNfDDweG9md+YFieDaeN6Xk1tZXVvfyG8WtrZ3dvfc/YMHHaeKYY3FIlaNgGoUXGLNcCOwkSikUSCwHgyuJn59iErzWN6bUYLtiPYkDzmjxkodt9gaIstaA1TSSwzpjzthoeOWvLI3BVkm/pyUKu7N7TcAVDvuZ6sbszRCaZigWjd9+1Y7o8pwJnBcaKUaE8oGtIdNSyWNULez6fJjcmyVLgljZUsaMlV/T2Q00noUBbYzoqavF72J+J/XTE140c64TFKDks0+ClNBTEwmSZAuV8iMGFlCmeJ2V8L6VFFmbF6TEPzFk5dJ7ax8Wfbv/FLlFGbIwyEcwQn4cA4VuIYq1IDBCJ7gBV6dR+fZeXPeZ605Zz5ThD9wPn4A+m2V+Q==</latexit><latexit sha1_base64="JMMwmdPiiytBmnNHeLn5dD6mFLw=">AAAB+3icbVBNS8NAEN3Ur1q/oj16WSyCBymJFxUvBS9exArGFptSNttJu3SzCbubQgn1r3jxoOLVP+LNk3/FTduDtj4YeLw3szvzgoQzpR3nyyosLa+srhXXSxubW9s79u7evYpTScGjMY9lMyAKOBPgaaY5NBMJJAo4NILBZe43hiAVi8WdHiXQjkhPsJBRoo3Uscv+EGjmD0AKJ9G4P+6EpY5dcarOBHiRuDNSqdnXN98X3Yd6x/70uzFNIxCacqJUyzVvtTMiNaMcxiU/VZAQOiA9aBkqSASqnU2WH+NDo3RxGEtTQuOJ+nsiI5FSoygwnRHRfTXv5eJ/XivV4Vk7YyJJNQg6/ShMOdYxzpPAXSaBaj4yhFDJzK6Y9okkVJu88hDc+ZMXiXdSPa+6t26ldoymKKJ9dICOkItOUQ1doTryEEUj9IRe0Kv1aD1bb9b7tLVgzWbK6A+sjx9Ewpbx</latexit><latexit sha1_base64="JMMwmdPiiytBmnNHeLn5dD6mFLw=">AAAB+3icbVBNS8NAEN3Ur1q/oj16WSyCBymJFxUvBS9exArGFptSNttJu3SzCbubQgn1r3jxoOLVP+LNk3/FTduDtj4YeLw3szvzgoQzpR3nyyosLa+srhXXSxubW9s79u7evYpTScGjMY9lMyAKOBPgaaY5NBMJJAo4NILBZe43hiAVi8WdHiXQjkhPsJBRoo3Uscv+EGjmD0AKJ9G4P+6EpY5dcarOBHiRuDNSqdnXN98X3Yd6x/70uzFNIxCacqJUyzVvtTMiNaMcxiU/VZAQOiA9aBkqSASqnU2WH+NDo3RxGEtTQuOJ+nsiI5FSoygwnRHRfTXv5eJ/XivV4Vk7YyJJNQg6/ShMOdYxzpPAXSaBaj4yhFDJzK6Y9okkVJu88hDc+ZMXiXdSPa+6t26ldoymKKJ9dICOkItOUQ1doTryEEUj9IRe0Kv1aD1bb9b7tLVgzWbK6A+sjx9Ewpbx</latexit><latexit sha1_base64="UujVHHB/1Q3svLSKzu+MWMzhcxc=">AAAB+3icbVBNS8NAEN34WetXtEcvi0XwICXxot4KXjxWMLbQhLLZTtqlm03Y3RRCqH/FiwcVr/4Rb/4bN20O2vpg4PHezO7MC1POlHacb2ttfWNza7u2U9/d2z84tI+OH1WSSQoeTXgieyFRwJkATzPNoZdKIHHIoRtObku/OwWpWCIedJ5CEJORYBGjRBtpYDf8KdDCn4AUTqrxeDaI6gO76bScOfAqcSvSRBU6A/vLHyY0i0FoyolSfde8FRREakY5zOp+piAldEJG0DdUkBhUUMyXn+EzowxxlEhTQuO5+nuiILFSeRyazpjosVr2SvE/r5/p6DoomEgzDYIuPooyjnWCyyTwkEmgmueGECqZ2RXTMZGEapNXGYK7fPIq8S5bNy333m22L6o0augEnaJz5KIr1EZ3qIM8RFGOntErerOerBfr3fpYtK5Z1UwD/YH1+QOP4pQt</latexit>

h’0

x0 x1 x2

h0

~h<latexit sha1_base64="SLbxmCksNACtGk7E4II6a+xmM74=">AAAB+3icbVA9SwNBEJ2LXzF+RWNnsxgECwl3NmoXsLGM4JlALoS9zSRZsrd37O4FjiP+FRsLFVv/iJ3/xs1HoYkPBh7vzezOvDARXBvX/XYKa+sbm1vF7dLO7t7+Qfnw6FHHqWLos1jEqhVSjYJL9A03AluJQhqFApvh6HbqN8eoNI/lg8kS7ER0IHmfM2qs1C1XgjGyPBihkm5iyHDSDUvdctWtuTOQVeItSLXuBMdg0eiWv4JezNIIpWGCat327FudnCrDmcBJKUg1JpSN6ADblkoaoe7ks+Un5MwqPdKPlS1pyEz9PZHTSOssCm1nRM1QL3tT8T+vnZr+dSfnMkkNSjb/qJ8KYmIyTYL0uEJmRGYJZYrbXQkbUkWZsXlNQ/CWT14l/mXtpubde9X6BcxRhBM4hXPw4ArqcAcN8IFBBs/wCm/Ok/PivDsf89aCs5ipwB84nz9415Ta</latexit><latexit sha1_base64="wxA3x+a5V5V/cVOLbD/vNsjcWuQ=">AAAB+3icbVBNSwMxEE2q1lqtVnv0EiyCBym7XtRbwYvHCq4tdEvJprNtaDa7JNnCstS/4sWDilf/iDd/jGD6cdDWBwOP92aSmRckgmvjOF+4sLG5Vdwu7ZR39yr7B9XDowcdp4qBx2IRq05ANQguwTPcCOgkCmgUCGgH45uZ356A0jyW9yZLoBfRoeQhZ9RYqV+t+RNguT8GJZ3EkNG0H5T71brTcOYg68RdknoT+7XvSjFr9auf/iBmaQTSMEG17rr2rV5OleFMwLTspxoSysZ0CF1LJY1A9/L58lNyapUBCWNlSxoyV39P5DTSOosC2xlRM9Kr3kz8z+umJrzq5VwmqQHJFh+FqSAmJrMkyIArYEZkllCmuN2VsBFVlBmb1ywEd/XkdeJdNK4b7p1bb56jBUroGJ2gM+SiS9REt6iFPMRQhp7QC3rFj/gZv+H3RWsBL2dq6A/wxw+EqJZi</latexit><latexit sha1_base64="wxA3x+a5V5V/cVOLbD/vNsjcWuQ=">AAAB+3icbVBNSwMxEE2q1lqtVnv0EiyCBym7XtRbwYvHCq4tdEvJprNtaDa7JNnCstS/4sWDilf/iDd/jGD6cdDWBwOP92aSmRckgmvjOF+4sLG5Vdwu7ZR39yr7B9XDowcdp4qBx2IRq05ANQguwTPcCOgkCmgUCGgH45uZ356A0jyW9yZLoBfRoeQhZ9RYqV+t+RNguT8GJZ3EkNG0H5T71brTcOYg68RdknoT+7XvSjFr9auf/iBmaQTSMEG17rr2rV5OleFMwLTspxoSysZ0CF1LJY1A9/L58lNyapUBCWNlSxoyV39P5DTSOosC2xlRM9Kr3kz8z+umJrzq5VwmqQHJFh+FqSAmJrMkyIArYEZkllCmuN2VsBFVlBmb1ywEd/XkdeJdNK4b7p1bb56jBUroGJ2gM+SiS9REt6iFPMRQhp7QC3rFj/gZv+H3RWsBL2dq6A/wxw+EqJZi</latexit><latexit sha1_base64="R0YtV2E7BnX0g0fi/mjtSY0dm1c=">AAAB+3icbVBNS8NAEN34WetXtEcvi0XwICXxot4KXjxWMLbQhLDZTtulm03Y3RRCqH/FiwcVr/4Rb/4bN20O2vpg4PHezO7Mi1LOlHacb2ttfWNza7u2U9/d2z84tI+OH1WSSQoeTXgiexFRwJkATzPNoZdKIHHEoRtNbku/OwWpWCIedJ5CEJORYENGiTZSaDf8KdDCn4AUTqrxeBZG9dBuOi1nDrxK3Io0UYVOaH/5g4RmMQhNOVGq75q3goJIzSiHWd3PFKSETsgI+oYKEoMKivnyM3xmlAEeJtKU0Hiu/p4oSKxUHkemMyZ6rJa9UvzP62d6eB0UTKSZBkEXHw0zjnWCyyTwgEmgmueGECqZ2RXTMZGEapNXGYK7fPIq8S5bNy333m22L6o0augEnaJz5KIr1EZ3qIM8RFGOntErerOerBfr3fpYtK5Z1UwD/YH1+QOJ0pQp</latexit> b

C NN

Attn GRU yt

…

b

Fig. 1. Bi-directional sequence-to-sequence conversation model with multitask objective. The GRU blocks represent multi-layered RNNs using GRU units and Cis the concatenation function.

the model also predicts the affective state of the original sentence.

The outputs of the encoder are used to predict the sentiment of the input in addition to generating the following sentence.

The prediction is given through additional dense layers. IV. BEHAVIORIDENTIFICATION USINGEMBEDDINGS After unsupervised multitask training of the sequence-to-sequence model the encoder is used to extract embeddings for use as features in behavior annotation. We define sentence embeddings to be the concatenation of the final output states of both the forward and backward RNNs in the encoder. We also concatenated the output states from all the intermediate layers of the encoder. This is an extension of history-of-word embeddings [15] and is motivated by the intuition that intermediate layers represent different levels of concept. By utilizing all possible concept representations of the sentence, we hypothesize that subsequent systems will be more adept at extracting information related to human behavior.

Annotation of human behavior using sentence embeddings can be applied using many different methods. In this work we explored various supervised and unsupervised methods. A. Unsupervised clustering

An intuitive first step in analyzing the embeddings is to see how well they perform on a classification task without any supervision. To do this we applied a simple k-means clustering method on individual sentence embeddings to generate multiple clusters. We then labeled the clusters by randomly selecting a single seed article and assigning the article label to the cluster center the majority of embeddings were closest to. During testing, article labels were predicted based on the cluster with which the majority of embeddings from the article were closest to.

B. k-Nearest neighbors

Supervised classification can also be achieved using a simple k-nearest neighbor (kNN) approach. In kNN, an embedding is

labeled according to its k-nearest neighbors in the training set. The final label can be obtained using either a majority vote of the neighbors or as a probability based on the ratio of labels. C. Neural networks

Finally, we applied neural networks on top of the embed-dings for behavior annotation. Through the use of sentence embeddings we segment long pieces of text into sentences and represent them as a sequence of embeddings. In applications where the target label is identifiable at finer resolutions, the article can be split into sentence segments. In which case, we can represent the article as multiple sequences of N embeddings. For the prediction and annotation architecture we use framework proposed in [6].

V. EXPERIMENTAL SETUP A. Datasets

1) Unsupervised learning on movie subtitles: We use separate datasets to train the unsupervised and supervised portions of our proposed method. Since our final task is behavior annotation of human interaction, we wish to use a dataset that contains conversational speech when learning the unsupervised sentence embedding. A natural choice for a source rich in dialogue is movie subtitles. To this end we used the OpenSubtitles Corpus [16]. This corpus was generated using data from the website opensubtitles.org and contains user-submitted subtitles of movies and TV shows.

We applied additional pre-processing in addition to the steps already taken in [16]. Mainly, we attempted to generate a back-and-forth conversation by taking consecutive lines in the subtitles and assigning them as utterance and replies in an interaction. As there is no speaker information in the corpus it is hard to distinguish between dialogues and monologues without the use of advanced content analysis methods. However, we assume that this difference in conversational continuity will be dampened by the greater amount of data. We also assume that monologues also represent some form of internal dialogue which closely ties the concepts between sentences.

(4)

Finally, we applied standard text processing techniques to clean up the text further. These included auto-correction of commonly misspelled words, contraction removal, and replacement of proper nouns through parts-of-speech tagging. The final unsupervised training set consists of 30 million sentence pairs. The objective of the sequence-to-sequence model is then to predict one sentence given the other of these sentence pairs.

2) Semi-supervised multitask labels: As described in the previous section, a multi-task objective is introduced to the sequence-to-sequence model to guide the embeddings into a behavior-rich manifold. The labels for this multi-task objective were generated in a semi-supervised manner by automatically labeling the subtitles dataset using a simple look-up table of affective words [17]. Each input sentence in the subtitles dataset was assigned a negative or positive label based on the count of words corresponding to each affective state. Examples of negative or positive words are {hate, hurt, nasty} and {love, sweet, cute}, respectively. Although this labeling approach is very naive and has a high chance of mis-classification, we hypothesize that under multi-task learning this enriches embeddings with behavioral information while maintaining sentence concepts. Examples of affective words in the look-up table we used are shown in Table I.

TABLE I

EXAMPLES OF AFFECTIVE WORDS Affective State Positive Negative love hate nice hurt sweet ugly cute nasty

3) Behavior dataset: To train behavior models we use data from the UCLA/UW Couple Therapy Research Project [18] which contains audio and video recordings of 134 couples with real marital issues interacting over multiple sessions. In each session, couples discuss a specific topic chosen in turn for around 10 minutes. The behaviors of each speaker are then rated by multiple annotators based on the Couples Interaction [19] and Social Support [20] Rating Systems. The rating system contains 33 behavioral codes such as “Acceptance”, “Blame” and “Negativity”. Each annotator provides ratings for these codes on a Likert scale of 1 to 9 for every session, where 1 indicates strong absence and 9 indicates strong presence of the behavior. There are 2 to 12 annotators per session with the majority of sessions (∼ 90%) having 3 to 4 annotators. Finally, these ratings are averaged to obtain a 33 dimensional vector of behavior ratings per interlocutor for every session. B. Model architectures and training methods

1) Sentence embeddings: The sequence-to-sequence model with multitask objective, shown in Figure 1, can be described as three sections: the encoder, the decoder, and the multitask network. The encoder was constructed using a multi-layered bi-directional RNN using GRU units. We performed a grid search

on the architecture hyper-parameters and used a final setting of 3 layers and 300 dimensions in each direction. For the decoder a multi-layered uni-directional RNN using GRU units was used. The number of layers were the same as the encoder while the dimension size was doubled to account for the concatenation of states and outputs from both directions. The multitask network was built using four fully-connected layers with hidden sizes 512, 512, 256, and 128 respectively. The activation function used in the hidden layers was the ReLU function and the final output was a 2-dimensional softmax output. This model was trained with the OpenSubtitles dataset for 5 epochs using SGD with momentum. The learning rate was set to 0.05 with a momentum of 0.9. We also reduced the learning rate by a factor of 10 every epoch.

2) Supervised behavior annotation: The Couples Therapy Dataset was split into train and test sets using a leave-one-couple-out scheme. That is, for each fold, one couple was used as the test set and the remaining as the train set. This resulted in 85-fold cross-validation. During training, a separate couple was randomly selected from the train set as a validation set and the best model was selected based on this set.

VI. RESULTS

We compared the performance of identifying various be-haviors using sentence embeddings trained with and without multi-task using unsupervised and supervised methods. With unsupervised clustering, the multi-task embeddings give an average absolute improvement of 1.1% accuracy in classfying behaviors in interactions. When multi-task embeddings are used with supervised methods we see an average absolute improvement of 3.5% accuracy across the different behaviors. The results of behavior annotation using multi-task sentence embeddings for different behaviors are shown in Table II.

The t-SNE plot of multi-task embeddings from a sample session are shown in Figure 2.

VII. DISCUSSION

In this work we explored the benefits of introducing addi-tional objectives to unsupervised contextual learning of sentence embeddings. We found empirical evidence that supports the theory that multi-task learning can increase behavioral concepts in unsupervised sentence embeddings, even when the objective labels are semi-supervised and extremely unreliable. While we do expect that further improvements can be obtained by using better labels for the multi-task objective, that would entail additional effort in system design and label generation. In addition, we also expect that multi-task labels that are too domain-specific (e.g. focusing on a specific way or definition of affective expression) may actually hinder the performance of unsupervised embeddings. However, we do not verify this claim and leave it to future work.

We analyzed the performance of various models in classi-fying Negativity over multiple checkpoints of training. From Figure 3 we can see that depending on model hyper-parameters, there is a slight overlap in performance between models with and without MTL. However, a trend of a collective increase in performance can still be observed from the introduction of the multi-task objective.

(5)

TABLE II

RESULTS OFBEHAVIORIDENTIFICATIONACCURACY(%)USINGMULTI-TASKSENTENCEEMBEDDINGS Model Acceptance Blame Negativity Positivity Sadness Humor

k-Means 61.9 65.4 64.6 65.7 57.9 59.1

+ MTL 64.0 66.4 65.0 62.1 61.4 62.1

k-NN 79.6 80.0 85.7 82.5 64.6 59.6

+ MTL 85.0 85.4 87.6 86.8 67.9 60.0

Fig. 2. t-SNE analysis of the embeddings annotated with husband versus wife and negative versus non-negative labels.

Fig. 3. Comparison of classification accuracy of various models and multi-task unsupervised learning.

VIII. CONCLUSION

In this work we proposed the use of semi-supervised multitask learning to enrich the behavioral content of unsuper-vised sentence embeddings. The sentence embeddings were learned using an unsupervised contextual learning framework. A multitask objective was then added to the sequence-to-sequence model and trained with multi-task labels which were dynamically generated from unlabeled data using a lookup table. We showed that sentence embeddings extracted from this model improved performance in a final task of classifying and annotating human behaviors.

The proposed model has the benefit of not requiring additional effort in generating suitable multi-task labels. This also allows learning from large-scale corpora in an unsupervised manner which increases robustness and allows easy domain transfer ability.

REFERENCES

[1] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.

[2] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” in In Proceedings of Workshop at ICLR, 2013.

[3] C. dos Santos and M. Gatti, “Deep convolutional neural networks for sentiment analysis of short texts,” in Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, 2014, pp. 69–78.

[4] R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler, “Skip-thought vectors,” in Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, pp. 3294–3302. [Online]. Available: http://papers.nips.cc/paper/5950-skip-thought-vectors.pdf

[5] H. Palangi, L. Deng, Y. Shen, J. Gao, X. He, J. Chen, X. Song, and R. Ward, “Deep sentence embedding using long short-term memory networks: Analysis and application to information retrieval,” IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 24, no. 4, pp. 694–707, Apr. 2016. [Online]. Available: http://dl.acm.org/citation.cfm?id=2992449.2992457

[6] S.-Y. Tseng, B. Baucom, and P. Georgiou, “Approaching human perfor-mance in behavior estimation in couples therapy using deep sentence embeddings,” in Proceedings of Interspeech, August 2017.

[7] F. Hill, K. Cho, and A. Korhonen, “Learning distributed representations of sentences from unlabelled data,” in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016, pp. 1367–1377. [8] J. Wieting, J. Mallinson, and K. Gimpel, “Learning paraphrastic sentence

embeddings from back-translated bitext,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp. 274–285.

[9] M. Pagliardini, P. Gupta, and M. Jaggi, “Unsupervised learning of sentence embeddings using compositional n-gram features,” arXiv preprint arXiv:1703.02507, 2017.

[10] J. Yu and J. Jiang, “Learning sentence embeddings with auxiliary tasks for cross-domain sentiment classification,” in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin, Texas: Association for Computational Linguistics, November 2016, pp. 236–246. [Online]. Available: https://aclweb.org/anthology/ D16-1023

[11] M. Rei, “Semi-supervised multitask learning for sequence labeling,” arXiv preprint arXiv:1704.07156, 2017.

[12] M.-T. Luong, Q. V. Le, I. Sutskever, O. Vinyals, and L. Kaiser, “Multi-task sequence to sequence learning,” in International Conference on Learning Representations, 2016.

[13] S. Subramanian, A. Trischler, Y. Bengio, and C. J. Pal, “Learning general purpose distributed sentence representations via large scale multi-task learning,” in International Conference on Learning Representations, 2018. [14] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2014, pp. 3104–3112. [Online]. Available: http://papers.nips.cc/paper/ 5346-sequence-to-sequence-learning-with-neural-networks.pdf [15] H.-Y. Huang, C. Zhu, Y. Shen, and W. Chen, “Fusionnet: Fusing via

fully-aware attention with application to machine comprehension,” in International Conference on Learning Representations, 2018. [Online]. Available: https://openreview.net/forum?id=BJIgi_eCZ

[16] J. Tiedemann, “News from OPUS - A collection of multilingual parallel corpora with tools and interfaces,” in Recent Advances in Natural Lan-guage Processing, N. Nicolov, K. Bontcheva, G. Angelova, and R. Mitkov, Eds. Borovets, Bulgaria: John Benjamins, Amsterdam/Philadelphia, 2009, vol. V, pp. 237–248.

(6)

of words: Liwc and computerized text analysis methods,” Journal of language and social psychology, vol. 29, no. 1, pp. 24–54, 2010. [18] A. Christensen, D. Atkins, S. Berns, J. Wheeler, D. Baucom, and

L. Simpson, “Traditional versus integrative behavioral couple therapy for significantly and chronically distressed married couples,” Journal of Consulting and Clinical Psychology, vol. 72, no. 2, pp. 176–191, 2004. [19] C. Heavey, D. Gill, and A. Christensen, “Couples interaction rating system 2 (cirs2),” University of California, Los Angeles, vol. 7, 2002. [20] J. Jones and A. Christensen, “Couples interaction study: Social

sup-port interaction rating system,” University of California, Los Angeles, Technical manual, 1998.