Learning Representations from Imperfect Time Series Data via Tensor Rank Regularization

(1)

1569

Learning Representations from Imperfect Time Series Data

via Tensor Rank Regularization

Paul Pu Liang♣∗_{, Zhun Liu}♢∗_{, Yao-Hung Hubert Tsai}♣

Qibin Zhao♡_{, Ruslan Salakhutdinov}♣_{, Louis-Philippe Morency}♢

♣_{Machine Learning Department, Carnegie Mellon University, USA}

♢_{Language Technologies Institute, Carnegie Mellon University, USA}

♡_{Tensor Learning Unit, RIKEN Center for Artificial Intelligence Project, Japan}

{pliang,zhunl,yaohungt,rsalakhu,morency}@cs.cmu.edu

[email protected]

Abstract

There has been an increased interest in mul-timodal language processing including multi-modal dialog, question answering, sentiment analysis, and speech recognition. However, naturally occurring multimodal data is often

imperfectas a result of imperfect modalities,

missing entries or noise corruption. To ad-dress these concerns, we present a regulariza-tion method based ontensor rank minimiza-tion. Our method is based on the observation that high-dimensional multimodal time series data often exhibit correlations across time and modalities which leads to low-rank tensor rep-resentations. However, the presence of noise or incomplete values breaks these correlations and results in tensor representations of higher rank. We design a model to learn such ten-sor representations and effectively regularize their rank. Experiments on multimodal lan-guage data show that our model achieves good results across various levels of imperfection.

1 Introduction

Analyzing multimodal language sequences spans various fields including multimodal dialog (Das

et al., 2017; Rudnicky, 2005), question

answer-ing (Antol et al., 2015; Tapaswi et al., 2015;

Das et al., 2018), sentiment analysis (Morency

et al., 2011), and speech recognition (Palaskar

et al., 2018). Generally, these multimodal

se-quences contain heterogeneous sources of infor-mation across the language, visual and

acous-tic modalities. For example, when instructing

robots, these machines have to comprehend our verbal instructions and interpret our nonverbal be-haviors while grounding these inputs in their

vi-sual sensors (Schmerling et al., 2017; Iba et al.,

2005). Likewise, comprehending human inten-tions requires integrating human language, speech,

∗

first two authors contributed equally

clean multimodal data

<latexit sha1_base64="h1pzHCMMRR0RKSBQVEr4zlmGh8c=">AAACBXicbVA9SwNBEN3z2/gVtdRiMQhW4U4FLYM2lgomCskR5vY2yeLu3rE7J4YjjY1/xcZCEVv/g53/xk1yhSY+GHi8N8PMvCiVwqLvf3szs3PzC4tLy6WV1bX1jfLmVsMmmWG8zhKZmNsILJdC8zoKlPw2NRxUJPlNdHc+9G/uubEi0dfYT3mooKtFRzBAJ7XLuy3kD5gzyUFTlUkUKolB0hgQBu1yxa/6I9BpEhSkQgpctstfrThhmeIamQRrm4GfYpiDQeE2DEqtzPIU2B10edNRDYrbMB99MaD7TolpJzGuNNKR+nsiB2VtX0WuUwH27KQ3FP/zmhl2TsNc6DRDrtl4USeTFBM6jITGwnCGsu8IMCPcrZT1wABDF1zJhRBMvjxNGofV4KjqXx1XamdFHEtkh+yRAxKQE1IjF+SS1Akjj+SZvJI378l78d69j3HrjFfMbJM/8D5/AB0omPM=</latexit>

imperfect multimodal data

<latexit sha1_base64="I3wmMCkkoG+gn08puQZqeM8eRbQ=">AAACCXicbVC7SgNBFJ31GeMramkzGASrsKuClkEbywhGhSSEu7N34+DM7jJzVwxLWht/xcZCEVv/wM6/cfIo1Hhg4HDOvdw5J8yUtOT7X97M7Nz8wmJpqby8srq2XtnYvLRpbgQ2RapScx2CRSUTbJIkhdeZQdChwqvw9nToX92hsTJNLqifYUdDL5GxFEBO6lZ4m/CeCqkzNDEK4jpXJHUageIREAy6lapf80fg0ySYkCqboNGtfLajVOQaExIKrG0FfkadAgxJoXBQbucWMxC30MOWowlotJ1ilGTAd50S8Tg17iXER+rPjQK0tX0dukkNdGP/ekPxP6+VU3zcKWSS5YSJGB+Kc8Up5cNaeCSNS6/6joAw0v2VixswIMiVV3YlBH8jT5PL/VpwUPPPD6v1k0kdJbbNdtgeC9gRq7Mz1mBNJtgDe2Iv7NV79J69N+99PDrjTXa22C94H9+GKZrX</latexit>

tensor rank regularization<latexit sha1_base64="qWvflx4gMMCU5rXKIOGbeflghyQ=">AAACDHicbVC7TgJBFJ3FF+ILtbSZSEysyK6aaEm0scREHgkQMjvchQmzs5uZu0bc8AE2/oqNhcbY+gF2/o3DQqHgrU7OOTf3nuPHUhh03W8nt7S8srqWXy9sbG5t7xR39+omSjSHGo9kpJs+MyCFghoKlNCMNbDQl9Dwh1cTvXEH2ohI3eIohk7I+koEgjO0VLdYaiPcox+kCMpEmmqmhlRDP5FMi4fMNLYut+xmQxeBNwMlMptqt/jV7kU8CUEhl8yYlufG2EmZRsEljAvtxEDM+JD1oWWhYiGYTpqFGdMjy/RoYH8JIoU0Y39vpCw0ZhT61hkyHJh5bUL+p7USDC46qVBxYrPy6aEgkRQjOmmG9oQGjnJkAeNa2F8pHzDNONr+CrYEbz7yIqiflL3TsntzVqpczurIkwNySI6JR85JhVyTKqkRTh7JM3klb86T8+K8Ox9Ta86Z7eyTP+N8/gBHLZxj</latexit>

clean entries<latexit sha1_base64="wuMzvz1f+N4dxLXkrZwJSF97FMk=">AAAB/XicbVDLSgMxFM3UV62v+ti5CRbBVZlRQZdFNy4r2Ae0Q8mkt21oJjMkd8Q6FH/FjQtF3Pof7vwb03YW2nogcDjnvnKCWAqDrvvt5JaWV1bX8uuFjc2t7Z3i7l7dRInmUOORjHQzYAakUFBDgRKasQYWBhIawfB64jfuQRsRqTscxeCHrK9ET3CGVuoUD9oID5hyCUxRUKgFmHGnWHLL7hR0kXgZKZEM1U7xq92NeBLaAVwyY1qeG6OfMo3CTh4X2omBmPEh60PLUsVCMH46vX5Mj63Spb1I26eQTtXfHSkLjRmFga0MGQ7MvDcR//NaCfYu/VSoOEFQfLaol0iKEZ1EQbtCA0c5soRxLeytlA+YZhxtYAUbgjf/5UVSPy17Z2X39rxUucriyJNDckROiEcuSIXckCqpEU4eyTN5JW/Ok/PivDsfs9Kck/Xskz9wPn8AKAWVqw==</latexit> imperfect entries

[image:1.595.315.522.230.366.2]

<latexit sha1_base64="uEy9xlvD1dXmU2/rxG3w4dCph+Q=">AAACAXicbVDLSgMxFM3UV62vUTeCm2ARXJUZFXRZdOOygn1AO5RMeqcNzTxI7ohlqBt/xY0LRdz6F+78G9N2Ftp6IHA4594k5/iJFBod59sqLC2vrK4V10sbm1vbO/buXkPHqeJQ57GMVctnGqSIoI4CJbQSBSz0JTT94fXEb96D0iKO7nCUgBeyfiQCwRkaqWsfdBAeMBNhAioAjhQiVAL0uGuXnYozBV0kbk7KJEeta391ejFPQ3MBl0zrtusk6GVMoeASxqVOqiFhfMj60DY0YiFoL5smGNNjo/RoECtzIqRT9fdGxkKtR6FvJkOGAz3vTcT/vHaKwaWXiShJESI+eyhIJcWYTuqgPaFMajkyhHElzF8pHzDFOJrSSqYEdz7yImmcVtyzinN7Xq5e5XUUySE5IifEJRekSm5IjdQJJ4/kmbySN+vJerHerY/ZaMHKd/bJH1ifP4Lml48=</latexit>

Figure 1: Clean multimodal time series data (in shades of green) exhibits correlations across time and across modalities, leading to redundancy in low rank tensor representations. On the other hand, the presence of imperfect entries (in gray, blue, and red) breaks these correlations and leads to higher rank tensors. In these scenarios, we usetensor rank regularization to learn tensors that more accurately represent the true correla-tions and latent structures in multimodal data.

facial behaviors, and body postures (Mihalcea,

2012;Rossiter,2011). However, as much as more

modalities are required for improved performance,

we now face a challenge ofimperfectdata where

data might be 1) incomplete due to mismatched modalities or sensor failure, or 2) corrupted with random or structured noise. As a result, an im-portant research question involves learning robust representations from imperfect multimodal data.

Recent research in both unimodal and multi-modal learning has investigated the use of tensors for representation learning (Anandkumar et al.,

2014). Given representationsh1, ...,hM fromM

modalities, the order-M outer product tensorT =

h1⊗h2⊗...⊗hM is a natural representation for

all possible interactions between the modality

di-mensions (Liu et al., 2018). In this paper, we

(2)

learns a tensor representation that captures mul-timodal interactions across time. A key

observa-tion is that clean data exhibits tensors that are

low-ranksince high-dimensional real-world data is

of-ten generated from lower dimensional laof-tent

struc-tures (Lakshmanan et al., 2015). Furthermore,

clean multimodal time series data exhibits corre-lations across time and across modalities (Yang

et al., 2017; Hidaka and Yu, 2010). This leads

to redundancy in these overparametrized tensors

which explains their low rank (Figure1). On the

other hand, the presence of noise or incomplete values breaks these natural correlations and leads to higher rank tensor representations. As a

re-sult, we can usetensor rank minimizationto learn

tensors that more accurately represent the true correlations and latent structures in multimodal data, thereby alleviating imperfection in the input. With these insights, we show how to integrate ten-sor rank minimization as a simple regularizer for training in the presence of imperfect data. As com-pared to previous work on imperfect data (Sohn

et al., 2014;Srivastava and Salakhutdinov,2014;

Pham et al., 2019), our model does not need to

know which of the entries or modalities are imper-fect beforehand. Our model combines the strength of temporal non-linear transformations of multi-modal data with a simple regularization technique on tensor structures. We perform experiments on multimodal video data consisting of humans ex-pressing their opinions using a combination of lan-guage and nonverbal behaviors. Our results back up our intuitions that imperfect data increases ten-sor rank. Finally, we show that our model achieves good results across various levels of imperfection.

2 Related Work

Tensor Methods: Tensor representations have been used for learning discriminative representa-tions in unimodal and multimodal tasks. Tensors are powerful because they can capture important higher order interactions across time, feature di-mensions, and multiple modalities (Kossaifi et al., 2017). For unimodal tasks, tensors have been used for part-of-speech tagging (Srikumar and

Man-ning,2014), dependency parsing (Lei et al.,2014),

word segmentation (Pei et al., 2014), question

answering (Qiu and Huang, 2015), and machine

translation (Setiawan et al.,2015). For multimodal

tasks,Huang et al.(2017) used tensor products

be-tween images and text features for image caption-ing. A similar approach was proposed to learn

representations across text, visual, and acoustic features to infer speaker sentiment (Liu et al.,

2018;Zadeh et al.,2017). Other applications

in-clude multimodal machine translation (Delbrouck

and Dupont, 2017), audio-visual speech

recogni-tion (Zhang et al.,2017), and video semantic

anal-ysis (Wu et al.,2009;Gao et al.,2009).

Imperfect Data: In order to account for imper-fect data, several works have proposed generative

approaches for multimodal data (Sohn et al.,2014;

Srivastava and Salakhutdinov, 2014). Recently,

neural models such as cascaded residual

autoen-coders (Tran et al.,2017), deep adversarial

learn-ing (Cai et al., 2018), or translation-based

learn-ing (Pham et al.,2019) have also been proposed.

However, these methods often require knowing which of the entries or modalities are imperfect beforehand. While there has been some work on using low-rank tensor representations for

imper-fect data (Chang et al., 2017; Fan et al., 2017;

Chen et al.,2017;Long et al.,2018;Nimishakavi

et al., 2018), our approach is the first to

inte-grate rank minimization with neural networks for multimodal language data, thereby combining the strength of non-linear transformations with the mathematical foundations of tensor structures.

3 Proposed Method

In this section, we present our method for learning representations from imperfect human language across the language, visual, and acoustic modal-ities. In §3.1, we discuss some background on tensor ranks. We outline our method for learn-ing tensor representations via a model called Tem-poral Tensor Fusion Network (T2FN) in §3.2. In §3.3, we investigate the relationship between ten-sor rank and imperfect data. Finally, in §3.4, we show how to regularize our model using tensor rank minimization.

We use lowercase letters x ∈ R to denote

scalars, boldface lowercase letters x ∈ Rd to

de-note vectors, and boldface capital letters X ∈

Rd1×d2 to denote matrices. Tensors, which we

de-note by calligraphic lettersX, are generalizations

of matrices to multidimensional arrays. An

order-M tensor hasM dimensions,X ∈Rd1×...×dM. We

use⊗to denote outer product between vectors.

3.1 Background: Tensor Rank

(3)

h1

ℓ h2ℓ hTℓ

h1 a h2 a hT a . . . .. . h3 ℓ h3 a .. . h1 v h2 v h3 v hT v ...

[x1 ℓ, . . .xTℓ]

[

x

1,...a

x

T a]

[

x

1,...v

x

T v]

y Mt₌_ht

`⌦htv⌦hta

[image:3.595.73.292.59.279.2]

<latexit sha1_base64="c6ZUwHfEASXn8HdFB5sLByT5Xx8=">AAACPnicdVC7SgNBFJ2Nrxhfq5Y2g0GwCrsqaCMEbWyECOYB2RhmJ7PJkNkHM3cDYdkvs/Eb7CxtLBSxtXR2k8IkemDgzDn3cu89biS4Ast6MQpLyyura8X10sbm1vaOubvXUGEsKavTUISy5RLFBA9YHTgI1ookI74rWNMdXmd+c8Sk4mFwD+OIdXzSD7jHKQEtdc264xMYUCKS2/QB8CXO/66XDNJu4jAhMtUJgftMzXij/wyija5ZtipWDrxI7CkpoylqXfPZ6YU09lkAVBCl2rYVQSchEjgVLC05sWIRoUPSZ21NA6KndpL8/BQfaaWHvVDqFwDO1d8dCfGVGvuursz2VPNeJv7ltWPwLjoJD6IYWEAng7xYYAhxliXucckoiLEmhEqud8V0QCShoBMv6RDs+ZMXSeOkYp9WrLuzcvVqGkcRHaBDdIxsdI6q6AbVUB1R9Ihe0Tv6MJ6MN+PT+JqUFoxpzz6agfH9A0UUsZg=</latexit> M= T X t=1 Mt <latexit sha1_base64="a1jY5th1Xy+fYanMukUdv4ksWM4=">AAACEXicbVC7SgNBFJ2Nrxhfq5Y2g0FIFXZV0CYQtLERIuQF2U2YnUySIbMPZu4KYckv2PgrNhaK2NrZ+TfOJlvExAMXDufcy733eJHgCizrx8itrW9sbuW3Czu7e/sH5uFRU4WxpKxBQxHKtkcUEzxgDeAgWDuSjPieYC1vfJv6rUcmFQ+DOkwi5vpkGPABpwS01DNLjk9gRIlI7qe4gh0V+70EKva0W8cLVhd6ZtEqWzPgVWJnpIgy1Hrmt9MPaeyzAKggSnVsKwI3IRI4FWxacGLFIkLHZMg6mgbEZ8pNZh9N8ZlW+ngQSl0B4Jm6OJEQX6mJ7+nO9Ei17KXif14nhsG1m/AgioEFdL5oEAsMIU7jwX0uGQUx0YRQyfWtmI6IJBR0iAUdgr388ippnpfti7L1cFms3mRx5NEJOkUlZKMrVEV3qIYaiKIn9ILe0LvxbLwaH8bnvDVnZDPH6A+Mr18w0J0+</latexit> M <latexit sha1_base64="uZvzERQPKMMR9NiLGAcS/T5qG+0=">AAAB8nicbVBNS8NAFHypX7V+VT16WSyCp5KooMeiFy9CBWsLaSib7bZdutmE3RehhP4MLx4U8eqv8ea/cdPmoK0DC8PMe+y8CRMpDLrut1NaWV1b3yhvVra2d3b3qvsHjyZONeMtFstYd0JquBSKt1Cg5J1EcxqFkrfD8U3ut5+4NiJWDzhJeBDRoRIDwShaye9GFEeMyuxu2qvW3Lo7A1kmXkFqUKDZq351+zFLI66QSWqM77kJBhnVKJjk00o3NTyhbEyH3LdU0YibIJtFnpITq/TJINb2KSQz9fdGRiNjJlFoJ/OIZtHLxf88P8XBVZAJlaTIFZt/NEglwZjk95O+0JyhnFhCmRY2K2EjqilD21LFluAtnrxMHs/q3nndvb+oNa6LOspwBMdwCh5cQgNuoQktYBDDM7zCm4POi/PufMxHS06xcwh/4Hz+AIM0kWU=</latexit> MT <latexit sha1_base64="8QgwEIjtHbIpE/ny7x/TC5TA1HM=">AAAB9HicbVDLSgMxFL3xWeur6tJNsAiuyowKuiy6cSNU6AvasWTSTBuayYxJplCGfocbF4q49WPc+Tdm2llo64HA4Zx7uSfHjwXXxnG+0crq2vrGZmGruL2zu7dfOjhs6ihRlDVoJCLV9olmgkvWMNwI1o4VI6EvWMsf3WZ+a8yU5pGsm0nMvJAMJA84JcZKXjckZkiJSO+nj/VeqexUnBnwMnFzUoYctV7pq9uPaBIyaaggWndcJzZeSpThVLBpsZtoFhM6IgPWsVSSkGkvnYWe4lOr9HEQKfukwTP190ZKQq0noW8ns5B60cvE/7xOYoJrL+UyTgyTdH4oSAQ2Ec4awH2uGDViYgmhitusmA6JItTYnoq2BHfxy8ukeV5xLyrOw2W5epPXUYBjOIEzcOEKqnAHNWgAhSd4hld4Q2P0gt7Rx3x0BeU7R/AH6PMH5FGSKw==</latexit> M1 <latexit sha1_base64="kEkwVroTad49Yl6qxE3XJxZ+nes=">AAAB9HicbVDLSgMxFL2pr1pfVZdugkVwVWZU0GXRjRuhgn1AO5ZMmmlDM5kxyRTK0O9w40IRt36MO//GTDsLbT0QOJxzL/fk+LHg2jjONyqsrK6tbxQ3S1vbO7t75f2Dpo4SRVmDRiJSbZ9oJrhkDcONYO1YMRL6grX80U3mt8ZMaR7JBzOJmReSgeQBp8RYyeuGxAwpEend9NHtlStO1ZkBLxM3JxXIUe+Vv7r9iCYhk4YKonXHdWLjpUQZTgWblrqJZjGhIzJgHUslCZn20lnoKT6xSh8HkbJPGjxTf2+kJNR6Evp2MgupF71M/M/rJCa48lIu48QwSeeHgkRgE+GsAdznilEjJpYQqrjNiumQKEKN7alkS3AXv7xMmmdV97zq3F9Uatd5HUU4gmM4BRcuoQa3UIcGUHiCZ3iFNzRGL+gdfcxHCyjfOYQ/QJ8/r0WSCA==</latexit> M2 <latexit sha1_base64="zcVGz7+t4ZIbw5VqiTiAzOeOQvE=">AAAB9HicbVDLSgMxFL3js9ZX1aWbYBFclZkq6LLoxo1QwT6gHUsmzbShmWRMMoUy9DvcuFDErR/jzr8x085CWw8EDufcyz05QcyZNq777aysrq1vbBa2its7u3v7pYPDppaJIrRBJJeqHWBNORO0YZjhtB0riqOA01Ywusn81pgqzaR4MJOY+hEeCBYygo2V/G6EzZBgnt5NH6u9UtmtuDOgZeLlpAw56r3SV7cvSRJRYQjHWnc8NzZ+ipVhhNNpsZtoGmMywgPasVTgiGo/nYWeolOr9FEolX3CoJn6eyPFkdaTKLCTWUi96GXif14nMeGVnzIRJ4YKMj8UJhwZibIGUJ8pSgyfWIKJYjYrIkOsMDG2p6ItwVv88jJpViveecW9vyjXrvM6CnAMJ3AGHlxCDW6hDg0g8ATP8Apvzth5cd6dj/noipPvHMEfOJ8/sMmSCQ==</latexit> M3 <latexit sha1_base64="UTWzPLBL6vIDwYDXR4ODZOrB8sY=">AAAB9HicbVDLSgMxFL1TX7W+qi7dBIvgqsyooMuiGzdCBfuAdiyZNNOGZpIxyRTK0O9w40IRt36MO//GTDsLbT0QOJxzL/fkBDFn2rjut1NYWV1b3yhulra2d3b3yvsHTS0TRWiDSC5VO8CaciZowzDDaTtWFEcBp61gdJP5rTFVmknxYCYx9SM8ECxkBBsr+d0ImyHBPL2bPp73yhW36s6AlomXkwrkqPfKX92+JElEhSEca93x3Nj4KVaGEU6npW6iaYzJCA9ox1KBI6r9dBZ6ik6s0kehVPYJg2bq740UR1pPosBOZiH1opeJ/3mdxIRXfspEnBgqyPxQmHBkJMoaQH2mKDF8YgkmitmsiAyxwsTYnkq2BG/xy8ukeVb1zqvu/UWldp3XUYQjOIZT8OASanALdWgAgSd4hld4c8bOi/PufMxHC06+cwh/4Hz+ALJNkgo=</latexit> language LSTM <latexit sha1_base64="4wOX6NYqLG9nll+8bFLf3vGko/o=">AAAB/XicbVDLSsNAFJ3UV62v+Ni5GSyCq5KooMuiGxcKFfuCNpTJdNIOnUzCzI1YQ/FX3LhQxK3/4c6/cdpmoa0HLhzOuZd77/FjwTU4zreVW1hcWl7JrxbW1jc2t+ztnbqOEkVZjUYiUk2faCa4ZDXgIFgzVoyEvmANf3A59hv3TGkeySoMY+aFpCd5wCkBI3XsvTawB0gFkb2E9Bi+vqvejDp20Sk5E+B54makiDJUOvZXuxvRJGQSqCBat1wnBi8lCjgVbFRoJ5rFhA7MhpahkoRMe+nk+hE+NEoXB5EyJQFP1N8TKQm1Hoa+6QwJ9PWsNxb/81oJBOdeymWcAJN0uihIBIYIj6PAXa4YBTE0hFDFza2Y9okiFExgBROCO/vyPKkfl9yTknN7WixfZHHk0T46QEfIRWeojK5QBdUQRY/oGb2iN+vJerHerY9pa87KZnbRH1ifP3B7lTI=</latexit> v is u al LS T M <latexit sha1_base64="PkdTAP5iIoRE3tOjUNsjVhu5mek=">AAAB+3icbVBNS8NAEN3Ur1q/Yj16WSyCp5KooMeiFw8KFfsFbSib7bZdutmE3UlpCfkrXjwo4tU/4s1/47bNQVsfDDzem2Fmnh8JrsFxvq3c2vrG5lZ+u7Czu7d/YB8WGzqMFWV1GopQtXyimeCS1YGDYK1IMRL4gjX90e3Mb46Z0jyUNZhGzAvIQPI+pwSM1LWLHWATSMZcx0Tg+6faQ9q1S07ZmQOvEjcjJZSh2rW/Or2QxgGTQAXRuu06EXgJUcCpYGmhE2sWEToiA9Y2VJKAaS+Z357iU6P0cD9UpiTgufp7IiGB1tPAN50BgaFe9mbif147hv61l3AZxcAkXSzqxwJDiGdB4B5XjIKYGkKo4uZWTIdEEQomroIJwV1+eZU0zsvuRdl5vCxVbrI48ugYnaAz5KIrVEF3qIrqiKIJekav6M1KrRfr3fpYtOasbOYI/YH1+QMNGpRu</latexit> ac ou st ic LS T M <latexit sha1_base64="x1Fk1eUX+Il9eXi9bofgql+EJEs=">AAAB/XicbVDLSgMxFM3UV62v8bFzEyyCqzKjgi6LblwoVOwL2qFk0kwbmskMyR2xDsVfceNCEbf+hzv/xkzbhbYeuHA4597k3uPHgmtwnG8rt7C4tLySXy2srW9sbtnbO3UdJYqyGo1EpJo+0UxwyWrAQbBmrBgJfcEa/uAy8xv3TGkeySoMY+aFpCd5wCkBI3XsvTawB0gJjRINnOLru+rNqGMXnZIzBp4n7pQU0RSVjv3V7kY0CZkEKojWLdeJwUuJMk8KNiq0E81iQgekx1qGShIy7aXj7Uf40ChdHETKlAQ8Vn9PpCTUehj6pjMk0NezXib+57USCM69lMs4ASbp5KMgERginEWBu1wxCmJoCKGKZ+fTPlGEggmsYEJwZ0+eJ/XjkntScm5Pi+WLaRx5tI8O0BFy0RkqoytUQTVE0SN6Rq/ozXqyXqx362PSmrOmM7voD6zPH5P/lUk=</latexit>

Figure 2: The Temporal Tensor Fusion Network (T2FN) creates a tensorMfrom temporal data. The rank ofMincreases with imperfection in data so we regularize our model by minimizing an upper bound on the rank ofM.

vectors have lower rank, while complex tensors have higher rank. To be more precise, we de-fine the rank of a tensor using Canonical Polyadic

(CP)-decomposition (Carroll and Chang, 1970).

For an order-MtensorX ∈Rd1×...×dM, there exists

an exact decomposition into vectorsw:

X =∑r

i=1

M

⊗

m=1

w_mi . (1)

The minimalr for exact decomposition is called

therankof the tensor. The vectors{{w_mi }M_m=1}

r i=1

are called the rankrdecomposition factors ofX.

3.2 Multimodal Tensor Representations

Our model for creating tensor representations is called the Temporal Tensor Fusion Network

(T2FN), which extends the model inZadeh et al.

(2017) to include a temporal component. We

show that T2FN increases the capacity of TFN to capture high-rank tensor representations, which itself leads to improved prediction performance. More importantly, our knowledge about tensor rank properties allows us to regularize our model effectively for imperfect data.

We begin with time series data from the lan-guage, visual and acoustic modalities, denoted

as [x1

`, ...,xT`], [x1v, ...,xTv], and [x1a, ...,xTa]

re-spectively. We first use Long Short-term Mem-ory (LSTM) networks (Hochreiter and

Schmid-huber, 1997) to encode the temporal information

from each modality, resulting in a sequence of

hid-den representations[h1_`, ...,hT_`],[h1_v, ...,hT_v], and

[h1

a, ...,hTa]. Similar to prior work which found

tensor representations to capture higher-order

in-teractions from multimodal data (Liu et al.,2018;

Zadeh et al., 2017; Fukui et al., 2016), we form

tensors via outer products of the individual

repre-sentations through time (as shown in Figure2):

M =∑T

t=1

[ht`

1] ⊗ [

ht v

1] ⊗ [

ht a

1] (2)

where we append a 1 so that unimodal, bimodal, and trimodal interactions are all captured as

de-scribed in Zadeh et al.(2017). M is our

multi-modal representation which can then be used to

predict the labely using a fully connected layer.

Observe how the construction of M closely

re-sembles equation (1) as the sum of vector outer products. As compared to TFN which uses a sin-gle outer product to obtain a multimodal tensor of rank one, T2FN creates a tensor of high rank

(up-per bounded byT). As a result, the notion ofrank

naturally emerges when we reason about the

prop-erties ofM.

3.3 How Does Imperfection Affect Rank?

We first state several observations about the rank

of multimodal representationM:

1)rnoisy: The rank ofMis maximized when data

entries are sampled from i.i.d. noise (e.g. Gaus-sian distributions). This is because this setting leads to no redundancy at all between the feature dimensions across time steps.

2) rclean < rnoisy: Clean real-world data is

of-ten generated from lower dimensional laof-tent

struc-tures (Lakshmanan et al., 2015). Furthermore,

multimodal time series data exhibits correlations across time and across modalities (Yang et al.,

2017; Hidaka and Yu, 2010). This redundancy

leads to low-rank tensor representations.

3)rclean <rimperf ect < rnoisy: If the data is

im-perfect, the presence of noise or incomplete val-ues breaks these natural correlations and leads to higher rank tensor representations.

These intuitions are also backed up by several experimental results which are presented in §4.2.

3.4 Tensor Rank Regularization

Given our intuitions above, it would then seem natural to augment the discriminative objective

(4)

In practice, the rank of an order-Mtensor is

com-puted using the nuclear norm∥X ∥∗ which is

de-fined as (Friedland and Lim,2014),

∥X ∥∗=inf{

r

∑

i=1

∣λi∣ ∶ X =

r

∑

i=1

λi(

M

⊗

m=1

wim),∥w_mi ∥ =1, r∈_N}.

(3)

When M = 2, this reduces to the matrix nuclear

norm (sum of singular values). However, com-puting the rank of a tensor or its nuclear norm is

NP-hard for tensors of order ≥ 3 (Friedland and

Lim, 2014). Fortunately, there exist efficiently

computable upper bounds on the nuclear norm and minimizing these upper bounds would also

mini-mize the nuclear norm∥M∥∗. We choose the

up-per bound as presented inHu(2014), which upper

bounds the nuclear norm with the tensor Frobenius norm scaled by the tensor dimensions:

∥M∥∗≤

¿ Á Á

À ∏M

i=1di

max{d1, ..., dM}

∥M∥F, (4)

where the Frobenius norm∥M∥F is defined as the

sum of squared entries inMwhich is easily

com-putable and convex. Since ∥M∥F is easily

com-putable and convex, including this term adds neg-ligible computational cost to the model. We will use this upper bound as a surrogate for the nu-clear norm in our objective function. Our objec-tive function is therefore a weighted combination of the prediction loss and the tensor rank regular-izer in equation (4).

4 Experiments

Our experiments are designed with two research questions in mind: 1) What is the effect of various levels of imperfect data on tensor rank in T2FN? 2) Does T2FN with rank regularization perform well on prediction with imperfect data? We

an-swer these questions in §4.2and §4.3respectively.

4.1 Datasets

We experiment with real video data consisting of humans expressing their opinions using a com-bination of language and nonverbal behaviors. We use the CMU-MOSI dataset which contains 2199 videos annotated for sentiment in the range

[−3,+3] (Zadeh et al., 2016). CMU-MOSI and

related multimodal language datasets have been

studied in the NLP community (Gu et al., 2018;

Liu et al., 2018; Liang et al., 2018) from fully

supervised settings but not from the perspective of supervised learning with imperfect data. We

use 52 segments for training, 10 for validation and 31 for testing. GloVe word embeddings

(Pen-nington et al.,2014), Facet (iMotions,2017), and

COVAREP (Degottex et al., 2014) features are

extracted for the language, visual and acoustic modalities respectively. Forced alignment is

per-formed using P2FA (Yuan and Liberman,2008) to

align visual and acoustic features to each word, re-sulting in a multimodal sequence. Our data splits, features, alignment, and preprocessing steps are consistent with prior work on the CMU-MOSI

dataset (Liu et al.,2018).

4.2 Rank Analysis

We first study the effect of imperfect data on the

rank of tensor M. We introduce the following

types of noises parametrized bynoise level=

[0.0,0.1, ...,1.0]. Higher noise levels implies

more imperfection: 1) clean: no imperfection,

2) random drop: each entry is dropped

inde-pendently with probability p ∈ noise level,

and 3) structured drop: independently for each

modality, each time step is chosen with probabil-ity p ∈ noise level. If a time step is cho-sen, all feature dimensions at that time step are dropped. For all imperfect settings, features are dropped during both training and testing.

We would like to show how the tensor ranks vary under different imperfection settings. How-ever, as is mentioned above, determining the exact rank of a tensor is an NP-hard problem (Friedland

and Lim,2014). In order to analyze the effect of

imperfections on tensor rank, we perform CP de-composition (equation (5)) on the tensor

represen-tations under different rank settingsrand measure

the reconstruction error,

=min

wi m

∥(∑r

i=1

M

⊗

m=1

wi_m) − X ∥ F

. (5)

Given the true rank r∗_, _{will be high at ranks}

r<r∗_{, while}_{will be approximately zero at ranks}

r≥r∗

(for example, a rank3tensor would display

a large reconstruction error with CP

decomposi-tion at rank 1, but would show almost zero error

with CP decomposition at rank3). By analyzing

the effect ofr on, we are then able to derive a

surrogater˜to the true rankr∗_.

Using this approach, we experimented on CMU-MOSI and the results are shown in

Fig-ure3(a). We observe that imperfection leads to an

(5)

1 2 3 4 5 6 7 8 9 1011121314151617181920212223 Decomposition rank

0.0 0.1 0.2 0.3 0.4 0.5

Decomposition loss

clean random drop structured drop

(a) CP decomposition error ofMunder

random and structured dropping of fea-tures. Imperfect data leads to an increase in decomposition error and an increase in (approximate) tensor rank.

(b) Sentiment classification accuracy

un-der random drop (i.e. dropping

en-tries randomly with probability p ∈

noise level). T2FN with rank reg-ularization (green) performs well.

(c) Sentiment classification accuracy un-der structured drop (dropping entire time

steps randomly with probability p ∈

noise level). T2FN with rank reg-ularization (green) performs well.

Figure 3: (a) Effect of imperfect data on tensor rank. (b) and (c): CMU-MOSI test accuracy under imperfect data.

by reconstruction error (the graph shifts outwards and to the right), supporting our hypothesis that imperfect data increases tensor rank (§3.3).

4.3 Prediction Results

Our next experiment tests the ability of our model to learn robust representations despite data

im-perfections. We use the tensor M for

predic-tion and report binary classificapredic-tion accuracy on

CMU-MOSI test set. We compare to several

baselines: Early Fusion (EF)-LSTM, Late Fusion

(LF)-LSTM, TFN, and T2FNwithout rank

regu-larization. These results are shown in Figure3(b)

for random drop and Figure 3(c) for structured

drop. T2FN with rank regularization maintains

good performance despite imperfections in data. We also observe that our model’s improvement is more significant on random drop settings, which results in a higher tensor rank as compared to

structured drop settings (from Figure3(a)). This

supports our hypothesis that our model learns ro-bust representations when imperfections that in-crease tensor rank are introduced. On the other hand, the existing baselines suffer in the presence of imperfect data.

5 Discussion and Future Work

We acknowledge that there are other alternative methods to upper bound the true rank of a

ten-sor (Alexeev et al., 2011; Atkinson and Lloyd,

1980;Ballico,2014). From a theoretical

perspec-tive, there exists a trade-off between the cost of computation and the tightness of approximation. In addition, the tensor rank can (far) exceed the maximum dimension, and a low-rank approxima-tion for tensors may not even exist (de Silva and

Lim, 2008). While our tensor rank

regulariza-tion method seems to work well empirically, there

is definitely room for a more thorough theoreti-cal analysis of constructing and regularizing ten-sor representations for multimodal learning.

6 Conclusion

This paper presented a regularization method

based on tensor rank minimization. We observe

that clean multimodal sequences often exhibit cor-relations across time and modalities which leads to low-rank tensors, while the presence of imper-fect data breaks these correlations and results in tensors of higher rank. We designed a model, the Temporal Tensor Fusion Network, to learn such tensor representations and effectively regularize their rank. Experiments on multimodal language data show that our model achieves good results across various levels of imperfections. We hope to inspire future work on regularizing tensor rep-resentations of multimodal data for robust predic-tion in the presence of imperfect data.

Acknowledgements

PPL, ZL, and LM are partially supported by the National Science Foundation (Award #1750439 and #1722822) and Samsung. Any opinions, find-ings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of Samsung and NSF, and no official endorsement should be

in-ferred. YHT and RS are supported in part by

DARPA HR00111990016, AFRL FA8750-18-C-0014, NSF IIS1763562, Apple, and Google

fo-cused award. QZ is supported by JSPS

[image:5.595.81.520.65.235.2]

(6)

References

Boris Alexeev, Michael A. Forbes, and Jacob Tsimer-man. 2011. Tensor rank: Some lower and upper bounds.CoRR, abs/1102.0072.

Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and Matus Telgarsky. 2014. Ten-sor decompositions for learning latent variable

mod-els.J. Mach. Learn. Res., 15(1):2773–2832.

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-garet Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question An-swering. InInternational Conference on Computer

Vision (ICCV).

M.D. Atkinson and S. Lloyd. 1980. Bounds on the ranks of some 3-tensors.Linear Algebra and its

Ap-plications, 31:19 – 31.

E. Ballico. 2014. An upper bound for the real tensor rank and the real symmetric tensor rank in terms of the complex ranks. Linear and Multilinear Algebra, 62(11):1546–1552.

Lei Cai, Zhengyang Wang, Hongyang Gao, Dinggang Shen, and Shuiwang Ji. 2018. Deep adversarial learning for multi-modality missing data comple-tion. InKDD ’18, pages 1158–1166.

J. Douglas Carroll and Jih-Jie Chang. 1970. Analysis of individual differences in multidimensional scal-ing via an n-way generalization of “eckart-young” decomposition.Psychometrika, 35(3):283–319.

Yi Chang, Luxin Yan, Houzhang Fang, Sheng Zhong, and Zhijun Zhang. 2017. Weighted low-rank tensor recovery for hyperspectral image restoration.CoRR, abs/1709.00192.

Xiai Chen, Zhi Han, Yao Wang, Qian Zhao, Deyu Meng, Lin Lin, and Yandong Tang. 2017. A general model for robust tensor factorization with unknown noise.CoRR, abs/1705.06755.

Abhishek Das, Samyak Datta, Georgia Gkioxari, Ste-fan Lee, Devi Parikh, and Dhruv Batra. 2018. Em-bodied Question Answering. InProceedings of the IEEE Conference on Computer Vision and Pattern

Recognition (CVPR).

Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Jos´e M.F. Moura, Devi Parikh, and Dhruv Batra. 2017. Visual Dialog. In

Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition (CVPR).

Gilles Degottex, John Kane, Thomas Drugman, Tuomo Raitio, and Stefan Scherer. 2014. Covarep - a col-laborative voice analysis repository for speech tech-nologies. InICASSP. IEEE.

Jean-Benoit Delbrouck and St´ephane Dupont. 2017. Multimodal compact bilinear pooling for mul-timodal neural machine translation. CoRR, abs/1703.08084.

Haiyan Fan, Yunjin Chen, Yulan Guo, Hongyan Zhang, and Gangyao Kuang. 2017. Hyperspectral image restoration using low-rank tensor recovery. IEEE Journal of Selected Topics in Applied Earth

Obser-vations and Remote Sensing, PP:1–16.

Shmuel Friedland and Lek-Heng Lim. 2014. Compu-tational complexity of tensor nuclear norm. CoRR, abs/1410.6072.

Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding.

arXiv preprint arXiv:1606.01847.

Xinbo Gao, Yimin Yang, Dacheng Tao, and Xuelong Li. 2009. Discriminative optical flow tensor for video semantic analysis. Computer Vision and

Im-age Understanding, 113(3):372 – 383. Special Issue

on Video Analysis.

Yue Gu, Kangning Yang, Shiyu Fu, Shuhong Chen, Xinyu Li, and Ivan Marsic. 2018. Multimodal af-fective analysis using hierarchical attention strategy with word-level alignment. InACL.

Shohei Hidaka and Chen Yu. 2010. Analyzing multi-modal time series as dynamical systems. In Interna-tional Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal

In-teraction, ICMI-MLMI ’10, pages 53:1–53:8, New

York, NY, USA. ACM.

Sepp Hochreiter and J¨urgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.

Shenglong Hu. 2014. Relations of the Nuclear Norms of a Tensor and its Matrix Flattenings. arXiv

e-prints, page arXiv:1412.2443.

Qiuyuan Huang, Paul Smolensky, Xiaodong He, Li Deng, and Dapeng Oliver Wu. 2017. Tensor product generation networks. CoRR, abs/1709.09118.

Soshi Iba, Christiaan J. J. Paredis, and Pradeep K. Khosla. 2005. Interactive multimodal robot pro-gramming. The International Journal of Robotics

Research, 24(1):83–104.

iMotions. 2017.Facial expression analysis.

Jean Kossaifi, Zachary C. Lipton, Aran Khanna, Tommaso Furlanello, and Anima Anandkumar. 2017. Tensor regression networks. CoRR, abs/1707.08308.

Karthik Lakshmanan, Patrick T. Sadtler, Elizabeth C. Tyler-Kabara, Aaron P. Batista, and Byron M. Yu. 2015. Extracting low-dimensional latent structure from time series in the presence of delays. Neural

(7)

Tao Lei, Yu Xin, Yuan Zhang, Regina Barzilay, and Tommi Jaakkola. 2014. Low-rank tensors for scor-ing dependency structures. In Proceedings of the 52nd Annual Meeting of the Association for

Compu-tational Linguistics (Volume 1: Long Papers),

vol-ume 1, pages 1381–1391.

Paul Pu Liang, Ziyin Liu, Amir Zadeh, and Louis-Philippe Morency. 2018. Multimodal language analysis with recurrent multistage fusion. EMNLP.

Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshmi-narasimhan, Paul Pu Liang, AmirAli Bagher Zadeh, and Louis-Philippe Morency. 2018. Efficient low-rank multimodal fusion with modality-specific fac-tors. InACL.

Zhen Long, Yipeng Liu, Longxi Chen, and Ce Zhu. 2018. Low rank tensor completion for multiway vi-sual data.CoRR, abs/1805.03967.

Rada Mihalcea. 2012. Multimodal sentiment analy-sis. InProceedings of the 3rd Workshop in Com-putational Approaches to Subjectivity and Sentiment

Analysis, WASSA ’12, pages 1–1, Stroudsburg, PA,

USA. Association for Computational Linguistics.

Louis-Philippe Morency, Rada Mihalcea, and Payal Doshi. 2011. Towards multimodal sentiment analy-sis: Harvesting opinions from the web. In Proceed-ings of the 13th international conference on

multi-modal interfaces, pages 169–176. ACM.

Madhav Nimishakavi, Pratik Kumar Jawanpuria, and Bamdev Mishra. 2018. A dual framework for low-rank tensor completion. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,Advances in Neural Information

Processing Systems 31, pages 5484–5495. Curran

Associates, Inc.

Shruti Palaskar, Ramon Sanabria, and Florian Metze. 2018. End-to-end multimodal speech recognition.

In Proceedings of the IEEE International

Confer-ence on Acoustics, Speech, and Signal Processing

(ICASSP).

Wenzhe Pei, Tao Ge, and Baobao Chang. 2014. Max-margin tensor neural network for chinese word seg-mentation. InProceedings of the 52nd Annual Meet-ing of the Association for Computational LMeet-inguistics

(Volume 1: Long Papers), pages 293–303.

Associa-tion for ComputaAssocia-tional Linguistics.

Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. InEMNLP.

Hai Pham, Paul Pu Liang, Thomas Manzini, Louis-Philippe Morency, and Barnabas Poczos. 2019. Found in translation: Learning robust joint repre-sentations by cyclic translations between modalities.

AAAI.

Xipeng Qiu and Xuanjing Huang. 2015. Convolutional neural tensor network architecture for community-based question answering. In Proceedings of the 24th International Conference on Artificial

Intelli-gence, IJCAI’15, pages 1305–1311. AAAI Press.

James Rossiter. 2011. Multimodal intent recognition for natural human-robotic interaction.

Alexander I. Rudnicky. 2005. Multimodal Dialogue Systems, pages 3–11. Springer Netherlands, Dor-drecht.

Edward Schmerling, Karen Leung, Wolf Vollprecht, and Marco Pavone. 2017. Multimodal probabilistic model-based planning for human-robot interaction.

CoRR, abs/1710.09483.

Hendra Setiawan, Zhongqiang Huang, Jacob Devlin, Thomas Lamar, Rabih Zbib, Richard Schwartz, and John Makhoul. 2015. Statistical machine translation features with multitask tensor networks. In Proceed-ings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th Interna-tional Joint Conference on Natural Language

Pro-cessing (Volume 1: Long Papers), pages 31–41.

As-sociation for Computational Linguistics.

Vin de Silva and Lek-Heng Lim. 2008.Tensor rank and the ill-posedness of the best low-rank approximation problem. SIAM J. Matrix Anal. Appl., 30(3):1084– 1127.

Kihyuk Sohn, Wenling Shang, and Honglak Lee. 2014. Improved multimodal deep learning with variation of information. InNIPS.

Vivek Srikumar and Christopher D Manning. 2014. Learning distributed representations for structured output prediction. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors,Advances in Neural Information Processing

Systems 27, pages 3266–3274. Curran Associates,

Inc.

Nitish Srivastava and Ruslan Salakhutdinov. 2014. Multimodal learning with deep boltzmann ma-chines. JMLR, 15.

Makarand Tapaswi, Yukun Zhu, Rainer Stiefelha-gen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2015. Movieqa: Understanding stories in movies through question-answering. CoRR, abs/1512.02902.

Luan Tran, Xiaoming Liu, Jiayu Zhou, and Rong Jin. 2017. Missing modalities imputation via cascaded residual autoencoder. InCVPR.

F. Wu, Y. Liu, and Y. Zhuang. 2009. Tensor-based transductive learning for multimodality video se-mantic concept detection. IEEE Transactions on

(8)

Xiao Yang, Ersin Yumer, Paul Asente, Mike Kraley, Daniel Kifer, and C. Lee Giles. 2017. Learning to extract semantic structure from documents using multimodal fully convolutional neural networks. In

The IEEE Conference on Computer Vision and

Pat-tern Recognition (CVPR).

Jiahong Yuan and Mark Liberman. 2008. Speaker identification on the scotus corpus. Journal of the

Acoustical Society of America.

Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. Ten-sor fusion network for multimodal sentiment analy-sis. InEMNLP, pages 1114–1125.

Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016. Mosi: Multimodal cor-pus of sentiment intensity and subjectivity anal-ysis in online opinion videos. arXiv preprint

arXiv:1606.06259.

Qingchen Zhang, Laurence T. Yang, Xingang Liu, Zhikui Chen, and Peng Li. 2017. A tucker deep computation model for mobile multimedia feature learning. ACM Trans. Multimedia Comput.