Multi-Scale 3D Convolution Network for Video Based Person Re-Identification

(1)

The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19)

Multi-Scale 3D Convolution Network for Video Based Person Re-Identification

Jianing Li, Shiliang Zhang, Tiejun Huang

School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China

{ljn-vmc, slzhang.jdl, tjhuang}@pku.edu.cn

Abstract

This paper proposes a two-stream convolution network to extract spatial and temporal cues for video based person Re-Identification (ReID). A temporal stream in this network is constructed by inserting several Multi-scale 3D (M3D) con-volution layers into a 2D CNN network. The resulting M3D convolution network introduces a fraction of parameters into the 2D CNN, but gains the ability of multi-scale temporal feature learning. With this compact architecture, M3D convo-lution network is also more efficient and easier to optimize than existing 3D convolution networks. The temporal stream further involves Residual Attention Layers (RAL) to refine the temporal features. By jointly learning spatial-temporal atten-tion masks in a residual manner, RAL identifies the discrim-inative spatial regions and temporal cues. The other stream in our network is implemented with a 2D CNN for spatial feature extraction. The spatial and temporal features from two streams are finally fused for the video based person ReID. Evaluations on three widely used benchmarks datasets,i.e.,

MARS,PRID2011, andiLIDS-VIDdemonstrate the substan-tial advantages of our method over existing 3D convolution networks and state-of-art methods.

Introduction

Current researches on person Re-Identification (ReID) mainly focus on two lines of tasks depending on still images and video sequences, respectively. Recent years have witnessed the impressive progresses in image based person ReID,

e.g., deep visual representations have significantly boosted the ReID performance on image based ReID datasets (Li, Zhu, and Gong 2018b; Xu et al. 2018; Liu et al. 2018b; Su et al. 2016; 2015). Being able to explore plenty of spa-tial and temporal cues, video based person ReID has better potentials to address some challenges in image based per-son ReID. Fig. 1 shows several sampled frames from perper-son tracklets. As shown in Fig. 1(a), solely relying on visual cues is hard to identify those two persons wearing visually similar clothes. However, they can be easily distinguished by gait cues. Meanwhile, video based person ReID could also lever-age the latest progresses in imlever-age based person ReID. The two persons in Fig. 1(b) show similar gait cues, but can be easily distinguished by their spatial and appearance cues. It is

960 1430

100

time

(a)

(b)

(a)

(b)

time

160

time/s

(a) (b)

0 0.5 1 1.5 2 0 0.5 1 1.5 2

time/s

(a) (b)

0 0.5 1 1.5 2 0 0.5 1 1.5 2

Figure 1: Illustration of video frames sampled from person tracklets. (a) shows two persons with similar appearance but different gaits; (b) shows two persons with similar gait but totally different appearance.

easier to infer that, extracting and fusing spatial and temporal cues is important for video based person ReID.

Existing studies on video based person ReID have signifi-cantly boosted the performance on existing datasets. Those works can be summarized into two categories,i.e., 1) ex-tracting frame-level features and generating video feature thorough pooling or weight learning (Liu et al. 2017a; Zhou et al. 2017; Li et al. 2018), and 2) extracting frame-level features then applying the Recurrent Neural Net-works (RNN) to generate video features (Yan et al. 2016; McLaughlin et al. 2016). Both those two categories of meth-ods first treat each video frame independently. The feature ge-nearted by pooling strategy are generally not affected by the order of video frames. The RNN only builds temporal connec-tions on high-level features, hence is not capable to capture the temporal cues on image local details. Therefore, more effective way of acquiring spatial-temporal feature should still be investigated.

(2)

single 3D convolution kernel can only cover short temporal cues, researcher usually stack several 3D kernels together to gain the stronger temporal cue learning ability. Although showing better performance, stacked 3D convolutions results in substantial growth of parameters,e.g., the widely used C3D (Tran et al. 2015) network reaches the model size of 321MB with only 8 3D convolution layers, almost 3 times to the 95.7MB parameters of ResNet50 (He et al. 2016). Too many parameters not only make 3D CNNs computationally expensive, but also leads to the difficulty in model training and optimization. This makes 3D CNN not readily applica-ble on video based person ReID, where the training set is commonly small and person ID annotation is expensive.

This work aims to explore the rich temporal cues for per-son ReID through applying 3D convolution, while mitigating the shortcomings in existing 3D CNN models. A Multi-scale 3D (M3D) convolution layer is proposed as a more efficient and compact alternatives to traditional 3D CNN layer. M3D layer is implemented using several parallel temporal convo-lution kernels with different temporal ranges. Several M3D layers are inserted into a 2D CNN architecture. The resulting M3D convolution network (M3D CNN) introduces marginal parameters to the 2D CNN, but gains the multi-scale tempo-ral cues modeling ability. Compared with existing 3D CNNs, M3D CNN is more compact and easier to train. To further refine the learned temporal cues by M3D convolution layer, a Residual Attention Layer (RAL) is proposed to jointly learn spatial and temporal attention masks. With RAL, more impor-tant spatial and temporal cues can be kept and the noises can be depressed, enabling M3D CNN to extract discriminative temporal feature.

We further introduce a 2D CNN to learn and extract the spatial and appearance features from video sequences. This 2D CNN and the M3D CNN compose a two-stream CNN ar-chitecture, where the extracted spatial and temporal features are fused for video based person ReID. Extensive experi-ments demonstrate that our method outperforms a wide range of state-of-art methods on three widely used benchmarks datasets,i.e.,MARS(Zheng et al. 2016),PRID2011(Hirzer et al. 2011) andiLIDS-VID(Wang and Zhao 2014). Moreover, we achieve a reasonable trade-off between ReID accuracy and model size. Introducing only about 4MB parameter over-head to the 2D CNN, M3D CNN boosts the mAP of 2D

CNN from 0.625 to 0.699 onMARS. The 3D CNN model

I3D (Carreira and Zisserman 2017) achieves mAP of 0.628 with 186MB parameters. Compared with I3D, M3D performs better and saves about 86MB of parameters, thus could be a better temporal feature learning model for video based person ReID.

The contribution of this work can be summarized into two aspects. 1) we propose a M3D convolution layer as a more compact and efficient alternative to 3D CNN layer. M3D layer makes multi-scale temporal feature learning with a compact neural network possible. To our best knowledge, this is the first attempt of introduce 3D convolution in person ReID. 2) we further propose the RAL to learn spatial-temporal attention masks, and use a 2D CNN to extract complementary spatial and appearance features. Those components further boost the video based person ReID performance.

1×1 conv

3×3 conv

1×1 conv

(a) 2D Residual Block

1×1×1 conv

3×3×3 conv

1×1×1 conv

(b) I3D

1×1×1 conv

1×3×3 conv

1×1×1 conv 3×1×1 conv

(c) P3D-A

1×1×1 conv

1×1×1 conv 1×3×3 conv 3× 1 ×1 conv

(d) P3D-B

1×1×1 conv

1×3×3 conv

1×1×1 conv

3× 1 ×1 conv

(e) P3D-C

(a) 2D Residual Block 1×1 conv

3×3 conv

1×1 conv

(b) I3D 1×1×1 conv

3×3×3 conv

1×1×1 conv

(c) P3D-A 1×1×1 conv

1×3×3 conv

3×1×1 conv

1×1×1 conv

(e) P3D-C 1×1×1 conv

1×3×3 conv

3×1×1 conv

1×1×1 conv

(d) P3D-B 1×1×1 conv

1×3×3 conv 3×1×1 conv

1×1×1 conv

Figure 2: Some widely used convolution layer in video tasks, (a) 2D residual block; (b) I3D, which inflates 2D kernels to the 3D version; (c-e) three versions of P3D, which factorizes the 3D kernels into separate spatial and temporal ones.

Related work

Existing person ReID works can be summarized into image based person ReID and video based person ReID, respec-tively. Most image based person ReID works focus on two approaches: 1) learning discriminative image features (Wang and Zhao 2014; Su et al. 2017; Wei et al. 2017) and 2) learning discriminative distance metrics for feature match-ing (Pedagadi et al. 2013; Xiong et al. 2014). Impressive progresses have been made on image person ReID in recent years.

Many works regard video based person ReID as an exten-sion of the image based one. For instance, 3D-SIFT (Scov-anner, Ali, and Shah 2007) and HOG3D (Klaser, Marszałek, and Schmid ), design hand-crafted methods to extract spa-tiotemporal cues, but present limited robustness when com-pared with deep features (Zheng et al. 2016). Some other works first extract image features from still frames, then accumulate frame features as video features. A previous work (Zheng et al. 2016) applies pooling for video fea-ture generation. (McLaughlin et al. 2016) apply RNN to model temporal cues cross frames. (Li et al. 2018) utilize part cues and learn a weighting strategy to fuse the features ex-tract from still frames. Unsupervised learning is also widely applied in video person ReID (Li, Zhu, and Gong 2018a; Ye, Lan, and Yuen 2018; Wu et al. 2018). Most of those works extract frame features independently and ignore the temporal cues among adjacent frames.

(3)

first one apply two-stream network to learn the spatial and temporal features, respectively, then fuse those features (Si-monyan and Zisserman 2014; Feichtenhofer, Pinz, and Zis-serman 2016; Feichtenhofer, Pinz, and Wildes 2017). Most of those work use still image and stacked optical flow as inputs for the two streams, respectively. The second strategy utilizes 3D CNNs to jointly explore spatiotemporal cues (Tran et al. 2015; Carreira and Zisserman 2017; Qiu, Yao, and Mei 2017; Liu et al. 2018a). Fig. 2(b-e) show 4 types of commonly used 3D CNN layer. As shown by the above works, extra temporal cues boost the performance of video tasks. However, optical flow is sensitive to the spatial misalignment between adjacent frames, which commonly exist in person ReID datasets. 3D CNNs need to stack a certain number of 3D CNN kernels to capture the long-term temporal cues. This introduces a large number of parameters and increases the difficult of 3D CNN optimization.

Our method also fuses the spatial and temporal features extracted from a two stream network. Different from previous works using stacked optical flow as input, our method directly extracts temporal feature from video sequence, hence would be more robust to the misalignment error between adjacent frames. Compared with traditional 3D CNN, our proposed M3D CNN presents better temporal cue learning ability with a more compact architecture. Those differences highlight our contribution to video based person ReID.

Two-stream M3D Convolution Network

Problem Formulation

Person ReID aims to identify a specific person from a large scale database, which can be implemented as a retrieval task. Given a query video sequenceQ = (s1, s2, ...sT), where T is the sequence length andst_{is the}_t_{-th frame at time}_t_. Video based person ReID can be tackled by ranking gallery sequences based on the video representationf, and a distance metricDcomputed betweenQand each gallery sequence. In the returned rank list, sequences containing the identical person with queryQare expected to appear on top of the list. Therefore, learning discriminative video representation f and designing the distance metricDare two critical steps for video based person ReID.

This work focuses on designing a discriminative video representation. As illustrated in Fig. 1, both the spatial and temporal cues embedded in video sequences could be impor-tant for identifying a specific person. Because the spatial and temporal cues are complementary with each other, we extract them with two modules. The video representationfstcan be formulated as

fst= [fs, ft] (1)

where fs andftdenote the spatial and temporal features, respectively, and[,]denotes feature concatenation.

Existing image based person ReID works have proposed many successful methods for spatial feature extraction. As a mainstream method, 2D CNN is commonly adopted by those works. We hence refer to existing works and utilize 2D CNN to extract the sequence spatial featurefs. Specifically, this is finished by first extracting spatial representation from

t × 128×64

t ×64×32 1×7×7 conv

Temporal Feature

RAL M3D Res block×2

M3D Res block×2

t ×64×32

𝑡 2×32×16

𝑡 4×16×8

𝑡 8×8×4

Pooling Pooling

t ×256×128

t ×64×32

t ×32×16

t ×16×8

t ×8×4 Res block×3

Res block×4

Res block×6

Res block×3 7×7 conv

Spatial Feature Pooling

Pooling

Temporal stream Spatial stream

Figure 3: Illustration of Two-Stream M3D network.

each individual video frame, then aggregating frame features through average pooling,i.e.,

fs= 1 T

T X

t=1

F2d(st) (2)

whereF2drefers to 2D CNN used to extract frame feature. As discussed in the above sections, more effective ways of acquiring temporal feature should be investigated. For the temporal representationft, we propose a Multi-scale 3D (M3D) convolution network to learn the multi-scale temporal cues,

ft=FM3D(Q), (3)

whereFM3Ddenotes the M3D network. It directly learns the temporal feature from video sequences.

The 2D CNN and M3D network compose a two stream neural network illustrated in Fig. 3. The following sections describe our design of the M3D network, which is imple-mented as the temporal stream in Fig. 3.

M3D Convolution Network

(4)

1×1×1 conv 1×3×3 conv

Temporal kernel1

1×1×1 conv Temporal kernel2

Temporal kernel3

Temporal kernel3 Temporal kernel2 Temporal kernel1

M3D layer

Temporal kernel3 Temporal kernel2 Temporal kernel1

M3D layer

1×1×1 conv 1×3×3 conv 1×1×1 conv Temporal kernel1

Temporal kernel2

Temporal kernel3

Figure 4: Illustration of M3D layer build in residual block with three temporal kernels,i.e.,n= 3.

kernel can be formulated as a 3D tensor with size oft× h×w (the channel dimension is omitted for simplicity), wheretis the temporal depth of kernel, whilehandware the spatial sizes. The 3D convolution encodes the spatial-temporal cues through sliding along both the spatial and temporal dimensions of the video clip.

3D convolution kernel only captures the short-term tem-poral cues,e.g., the 3D kernels in Fig. 2 (b-e) capture the temporal relations across 3 frames. To model longer-term temporal cues, multiple 3D convolution kernels have to be concatenated as a deep network. A deep 3D CNN involves a large amount of parameters. Moreover, 3D CNNs can not leverage the 2D images in ImageNet (Deng et al. 2009) for model pre-training, making it further difficult to be optimized. The following parts show how our M3D layer mitigates those shortcomings in 3D convolution.

Multi-scale 3D Convolution Shortcomings of 3D CNN motivate us to design a compact convolution kernel that cap-tures longer-term temporal cues. Inspired by dilated convolu-tion (Yu and Koltun 2015), we propose to capture temporal cues through parallel dilated convolutions on the temporal dimension.

A M3D layer contains a spatial convolution kernel and nparallel temporal kernels with different temporal ranges. Given an input feature mapx∈ RC×T×H×W_{, we define} the output of M3D layer as:

y=S(x) + n X

i=1

T(i)(S(x)) (4)

whereSis the spatial convolution andT(i)_{is the temporal}

convolution with dilation ratei. The computation ofS fol-lows the ones in 2D convolution. We define the computation ofT(i)_as

y=T(i)₍_x₎_, _y

t,h,w=

1 X

a=−1

xt+a×i,h,w×W(i), (5)

whereW(i)denotes thei-th temporal kernel.

Fig. 4 illustrates the detailed structure of M3D layer with n= 3in residual block. As shown in Fig. 4,ncontrols the receptive field size in time dimension. If we setn= 1, the M3D layer equals to P3D-C (Qiu, Yao, and Mei 2017),i.e., factorizing the 3D convolutional kernels into a spatial kernel

and a temporal kernel. To limit the receptive field size no larger than the temporal dimension of input signal, given an input feature map with temporal dimension ofT, we compute the number of temporal kernelsnas,

n=bT−1

2 c (6)

wherebcmeans rounded down operation.

As shown in Fig. 4, withn= 3, M3D layer has a larger temporal receptive field than 3D convolution,e.g., covering 7 time dimensions. Another advantage of M3D layer is the learning of rich long and short temporal cues through intro-ducing multiple temporal kernels. Moreover, any 2D CNN layer can become a M3D layer by inserting temporal kernels through a residual connection as shown in Fig. 4. This struc-ture allows M3D layer can be initialized with well-trained 2D CNN layers. For example, M3D layer can be initialized by setting weights of temporal kernels to 0, which equals to a 2D CNN layer. Initialized on a good 2D CNN model, M3D CNN would be easier to be optimized.

Residual Attention Layer In a long video sequence, dif-ferent frames may present difdif-ferent visual qualities. Temporal cues extracted on some consecutive frames could be more important or robust than the others. Therefore, it is not rea-sonable to treat different spatial and temporal cues equally. We hence propose attention selection mechanisms to refine spatial and temporal cues learned by M3D layer.

We propose a Residual Attention Layer (RAL) to learn the spatial-temporal attention masks. Given an input tensor x∈RC×T×H×W, the RAL computes a saliency attention maskM ∈RC×T×H×W _{of the same size as}_x_{. Traditional} attention masks are commonly multiplied on feature map to emphasize important local regions. As shown in (Li, Zhu, and Gong 2018b), solely emphasizing local regions and discard-ing the global cues may degrade the ReID performance (Li, Zhu, and Gong 2018b). To chase a more effective attention mechanism, we design the attention model with a residual manner,i.e.,

y =1

2x+M ·x, (7)

wherexandydonate the input and output 4D signals, respec-tively.Mis the 4D attention mask which has been normalized to(0,1)by sigmoid function. As shown in Eq. 7, RAL is implemented as a residual convolution layer, where the ini-tial inputxis kept, meanwhile the meaningful cues inxare emphasized by the learned maskM.

Directly learningM can be expensive because it may con-tain a large number of parameters. As shown in Fig. 5, we learnMby factorising it into three low-dimensional attention masks to decrease the number of parameters,i.e.,

M =Sigmoid(Sm×Cm×Tm) (8)

whereSm ∈ R1×1×H×W_,_Cm _∈ _RC×1×1×1 _and_Tm _∈

R1×T×1×1_{represent the spatial, channel and temporal}

at-tention masks, respectively. To learns the three masks, RAL introduces three branches, whose outputs are finally multi-plied asM.

(5)

3×3, 1 1×1, 1

1×1, c 𝑟 1×1,𝑐

1×1, 𝑡 1×1,𝑡

1×t×h×w

c×1×h×w

c × 1 × 1 1×t×1×1

𝑡 × 1 × 1 1 × ℎ × 𝑤

Sigmoid

𝑐 × 𝑡 × ℎ × 𝑤

×1 2

𝑐 × 𝑡 × ℎ × 𝑤

𝑘 × 𝑘, c Conv layer : kernel size k, output channel c

Global pooling and reshape operation Multiplication operation

Addition operation

𝑐 × 1 × 1

𝑡 × 1 × 1 𝑐 × ℎ × 𝑤

𝑐 × 𝑡 × ℎ × 𝑤

×1 2

𝑐 × 𝑡 × ℎ × 𝑤 Conv layer : kernel size k, output channel c

Global pooling and reshape operation Multiplication operation

Addition operation

1×t×1×1 3×3, 1 1×1, 1 1×t×h×w 1×1, c_𝑟 1×1, c c×1×h×w 1×1, 𝑡 1×1, t

Sigmoid 𝑘 × 𝑘, c

Figure 5: Illustration of Residual Attention Layer (RAL). RAL consists of three branches to apply spacial, temporal, and channel attentions. The ReLU and Batch Normalisation (BN) layer are applied after each convolution layer.

two convolution layers to compute Sm. Giving an input x∈RC×T×H×W_{, we define global temporal pooling as}

xs= 1 T

T X

t=1

x1:C,t,1:H,1:W. (9)

The global temporal pooling layer is designed to aggregate the information across different time dimensions. It also de-creases the number of subsequent convolution parameters. We hence compute the spatial attention mask based onxs.

A previous work (Li, Zhu, and Gong 2018b) directly av-erages feature maps across different channels as the spatial attention map. To model the difference across channels, we utilize a convolution layerconvs1to generate a one-channel

at-tention map. An1×1convolution layer is further introduced to learn a scale parameter for further fusion. The computation ofSmcan be denoted as

Sm=conv₂s(ReLU(convs₁(xs))). (10)

Channel Attention Mask Learning:The channel atten-tion branch also contains 1 pooling layer and two 1×1 convolution layers. The first global pooling operation is im-posed on spatial and temporal dimensions to aggregate the spatial and temporal cues,i.e.,

xc= 1

T ×H×W

T X

t=1

H X

h=1

W X

w=1

x1:C,t,h,w. (11)

We then follow the Squeeze-and-Excitation (SE) (Hu, Shen, and Sun 2017) and design the channel branch with bottleneck manner:

Cm=conv₂c(ReLU(convc₁(xc))), (12)

where the output channels ofconvc

1is set as

c

r,rrepresents the bottleneck reduction rate. And the output channels of convc₂is set asc. The SE design reduces the parameters of two convolution layers from(c2₊_c2₎_to1

r(c

2₊_c2_{), where}

theris set as 16 in our experiment.

Temporal Attention Mask Learning:The temporal tention branch has the same architecture as the channel at-tention branch. It first aggregates the spacial and channel dimensions through global pooling. Then the attention mask can be obtained through two convolution layers.

The outputs from three branches are combined as the fi-nal attention maskM, which is further normalized to[0,1] through sigmoid function. Through initializing all convolu-tion layers as zero, we can getM = 1₂. FinallyMis imposed to the input feature map with a residual manner as Eq. 7.

With the designed M3D layer and RAL, we build M3D convolution network based on ResNet50 as illustrated in Fig 3. More details of our network structure can be found in the following sections.

Experiment

Dataset

We use three video ReID datasets as our evaluation protocols, includingPRID-2011 (Hirzer et al. 2011),iLIDS-VID(Wang and Zhao 2014) andMARS(Zheng et al. 2016).

PRID-2011consists of 400 sequences of 200 pedestrians from two cameras. Each sequence has a length between 5 and 675 frames. Following the implementation in previous works (Wang and Zhao 2014; Li et al. 2018), we randomly split this dataset into train/test identities. This procedure is repeated 10 times for computing averaged accuracies.

iLIDS-VIDconsists of 600 sequences of 300 pedestrians from two non-overlapping cameras. Each sequence has a variable length between 23 and 192 frames. We also follow the implementation in (Wang and Zhao 2014; Li et al. 2018), randomly split this dataset into train/test identities 10 times.

MARSconsists of 1261 pedestrians and 20,715 sequences under 6 cameras. Each pedestrian is captured by at least 2 cameras. This dataset provides fixed training and testing sets, which contain 630 and 631 pedestrians, respectively.

Implementation Details

We employ ResNet50 (He et al. 2016) as a simple 2D CNN baseline. All the 3D CNN models tested in this paper are build based on ResNet50 by replacing the 2D convolution layers with corresponding 3D convolution layers. M3D CNN is constructed based on ResNet50 by replacing portions of its 2D convolution layers with M3D layer. 3 RAL are further inserted to learn attention masks as illustrated in Fig. 3. We totally replace 4 residual blocks at the beginning of each stage in ResNet50.

(6)

Table 1: Comparison between M3D convolution layer and convolution layers onMARS.

Method Input mAP r1 Speed Params

Frames

2D CNN 1 62.54 76.43 796 frame/s 95.7MB

I3D ₁₆8 62.84_61.58 76.62_75.11 81.0 clip/s_{38.7 clip/s} 186.3MB

P3D-A 8 60.69 75.08 90.1 clip/s 110.9MB 16 60.52 75.69 46.9 clip/s

P3D-B 8 67.03 79.06 93.9 clip/s 110.9MB 16 65.07 77.63 48.7 clip/s

P3D-C ₁₆8 67.06_65.17 79.08_79.44 87.6 clip/s_{45.4 clip/s} 110.9MB

M3D 8 69.90 81.01 98.3 clip/s 99.9MB

16 66.23 80.13 49.1 clip/s

12 forT = 16. The initial learning rate is set as 0.01, and is reduced ten times after 300 epoches.

During testing, we use 2D CNN to extract feature from each still frame, then fuse frame features into spatial feature through average pooling. For 3D CNN models, we sample T adjacent frames from original sequences as input. For a sequence of length L, we can get bL

Tc sampled inputs and corresponding features, respectively, wherebcrefers to rounded down operation. The sequence-level feature is finally acquired by averaging those features. All of our experiments are implemented with GTX TITAN X GPU, Intel i7 CPU, and 128GB memory.

Evaluation on Individual Components

1) M3D Convolution Layer To verify the effectiveness of M3D convolution layer, we build a M3D CNN based on ResNet50, following the structure of temporal stream illus-trated in Fig 3. To show the performance gains of M3D layers, RAL is not inserted in this experiment. We also compare sev-eral widely used temporal feature extraction methods.

I3D(Carreira and Zisserman 2017) inflates 2D kernels into 3D versions to acquire the temporal cues learning ability. Fig 2 (a-b) show the inflation process. 2D kernels are typically square, therefore they are inflated cubically,e.g.,N×Nto N×N×N, which introduces a large mount of parameters to I3D.

P3D(Qiu, Yao, and Mei 2017) factorizes the 3D kernels into separate spatial and temporal ones,e.g., factorize aN× N ×N kernel to a1×N ×N spatial kernel and aN ×

1×1temporal kernel to reduce the amount of parameters. Fig. 2(c-e) shows 3 ways of factorizations, which are named as P3D-A, P3D-B and P3D-C, respectively. The factorization substantially decreases the parameters in 3D CNNs, while still need to stack many temporal kernels to capture long-term temporal cues.

We apply ResNet50 as the 2D CNN baseline. All of the 3D CNNs are implemented based on ResNet50, by replace 2D convolution layers with corresponding 3D versions. The comparison results are shown in Table 1.

I3D shows promising performance on video action recog-nition tasks (Carreira and Zisserman 2017). However, it

Table 2: The performance of M3D CNN on three datasets by inserting RAL and fusing of spatial and temporal features.

Dataset MARS PRID iLIDS-VID

Method mAP r1 r1 r1

2D baseline 62.54 76.43 82.02 49.33

M3D 69.90 81.01 87.64 70.00

M3D+RAL(s) 71.04 82.19 89.89 71.33

M3D+RAL(t) 70.66 81.81 88.76 71.33

M3D+RAL(c) 71.30 82.13 89.89 72.00

M3D+RAL 71.76 82.79 91.03 72.67

Two-stream M3D 74.06 84.39 94.40 74.00

achieves similar performance with 2D CNN in Table 1. The reason might be because I3D model has too many parameters, making it hard to train on the relatively small person ReID training sets. The P3D-A shows poor performance compared with 2D CNN. This could be caused by the serial connection between spatial kernel and temporal kernel, which increases the nonlinearity of the CNN model and makes it hard to be optimized. The P3D-B and P3D-C connect the spatial and temporal kernels through parallel or residual connections. They get substantial performance improvements over the 2D CNN. This shows the advantages of 3D CNN over 2D CNN in person ReID.

The experimental results also show that, our M3D CNN constantly outperforms 2D CNNs and other 3D CNNs. It outperforms 2D CNN and P3D-C by about 7.4% and 2.9% in mAP, respectively. Meanwhile, M3D CNN is also more compact than the compared 3D CNNs. M3D CNN contains 4 M3D layers, which bring only4.2MB parameter overhead into 2D CNN. It is more compact than I3D (90.6MB) and P3D (15.2MB). With 8-frames clip as model input, M3D CNN achieves the speed of 98.3 clips/s (786.4 frames/s), which is also the fastest among the compared 3D CNNs. We further tested replacing all the 2D convolution layers in ResNet50 as M3D layers, but don’t get further performance improvement. This implies that a small number of M3D lay-ers already captures the long-term temporal cues in video sequences. We hence could conclude that, M3D convolu-tion layer presents promising ability in learning multi-scale temporal cues.

It is also interesting to observe that, 3D CNNs trained with 8-frame clips outperform the ones trained with 16-frame clips. The reason could be because 16-frame clips take more memory and result in smaller batch size for training. Based on the this observation, we adopt 8-frame clips for training in the following experiments.

(7)

Table 3: Comparison with recent works onMARS.

Method mAP r1 r5 r20

BoW+kissme (Zheng et al. 2016) 15.50 30.60 46.20 59.20

LOMO+XQ (Zheng et al. 2016) 16.40 30.70 46.60 60.90

IDE+XQDA (Zheng et al. 2016) 47.60 65.30 82.00 89.00

LCAR (Zhang et al. 2017) - 55.50 70.20 80.20

CDS (Tesfaye et al. 2017) - 68.20 -

-SFT (Zhou et al. 2017) 50.70 70.60 90.00 97.60

DCF (Li et al. 2017a) 56.05 71.77 86.57 93.08

SeeForest (Zhou et al. 2017) 50.70 70.60 90.00 97.60

DRSA (Li et al. 2018) 65.80 82.30 -

-DuATM (Si et al. 2018) 67.73 81.16 92.47

-LSTM (Yan et al. 2016) 61.58 76.11 85.30 92.68

A&O (Simonyan et al. 2014) 63.39 77.11 88.41 94.60

Two-stream M3D 74.06 84.39 93.84 97.74

It is clear that, any one of the 3 attention branches con-sistently improves the performance of M3D. Combining the complete RAL brings the most substantial performance gains. For example, RAL boosts the rank-1 accuracy of M3D from 87.64% to 91.03% onPRID. This demonstates the validity of our RAL in identifying discriminative spatio-temporal fea-ture. It also shows the advantages of introducing attention mechanism in video feature learning.

3) Spatial-Temporal Feature Fusion This part tests the performance of our two-stream convolution network which involves the M3D convolution layer, RAL, and the spatial-temporal feature fusion. The comparisons on three datasets are summarized in Table 2. “Two-stream M3D” refers to the complete two-stream architecture in Fig. 3.

It can be observed from Table 2 that, M3D CNN outper-forms the 2D baseline by large margins by considering extra temporal information. This shows the benefits of considering temporal cues in video based person ReID. The attention layer RAL further boosts the performance of M3D CNN. Combining the 2D CNN and M3D CNN features achieves the best performance in Table 2. This shows our two-stream architecture is effective to exploit the complementary infor-mation cross spatial and temporal domains. In the following part, we compare our two-stream M3D network with recent works on three datasets.

Comparison with Recent Work

Table 3 reports the comparison of our approach with recent works onMARS. It can be observed from Table 3 that, our method constantly outperforms all of the compared methods. Our method achieves the rank1 accuracy of 84.39% and mAP of 74.06%, outperforming two latest works DuATM (Si et al. 2018) and DRSA (Li et al. 2018) by 6.33% and 8.26% in mAP, respectively. Note that, DRSA (Li et al. 2018) extracts local part features to gain stronger discriminative power. Du-ATM (Si et al. 2018) introduces a complex frame feature matching strategy with quadratic complexity. Compared with those two works, our method is more concise and efficient,

e.g., we extract global feature and match features with simple Euclidean distance.

We further build two widely used temporal feature extrac-tion methods based on ResNet50,e.g., LSTM (Yan et al.

Table 4: Comparisons onPRIDandiLIDS-VID.

Dataset PRID iLIDS-VID

Method r1 r5 r1 r5

BoW+XQDA (Zheng et al. 2016) 31.80 58.50 14.00 32.20

DVDL (Karanam et al. 2015) 40.60 69.70 25.90 48.20

RFA-Net (Yan et al. 2016) 58.20 85.80 49.30 76.80

STFV3D (Koestinger et al. 2012) 64.10 87.30 44.30 71.70

DRCN (Wu et al. 2016) 69.00 88.40 46.10 76.80

RCN (McLaughlin et al. 2016) 70.00 90.00 58.00 84.00

IDE+XQDA (Zheng et al. 2016) 77.30 93.50 53.00 81.40

DFCP (Li et al. 2017b) 51.60 83.10 34.30 63.30

SeeForest (Zhou et al. 2017) 79.40 94.40 55.20 86.50

AMOC (Liu et al. 2017a) 83.70 98.30 68.70 94.30

QAN (Liu et al. 2017b) 90.30 98.20 68.00 86.80

DRSA (Li et al. 2018) 93.20 - 80.20

-Two-stream M3D 94.40 100.00 74.00 94.33

2016) and Appearance&Optical flow (Simonyan and Zisser-man 2014). The comparison in Table 3 clearly shows that, our method outperforms those temporal feature extraction works. For example, our method outperforms the LSTM based method by 12.48% in mAP. This significant perfor-mance boost demonstrates the advantage of our two-stream M3D network in spatial-temporal feature learning.

The comparisons on PRIDandiLIDS-VIDdatasets are shown in Table 4. As shown in the table, our proposed method presents competitive performance on rank1 accuracy. DRSA (Li et al. 2018) also gets competitive performance on both datasets, and outperforms our method on iLIDS-VID

dataset. The reason may be becauseiLIDS-VIDhas a small training set. DRSA alleviates the insufficiency of training data using multi-task learning strategy on part cues. DRSA also impose Online Instance Matching loss (OIM) loss for training, which is shown more effective than our softmax. Extracting global feature trained with basic softmax loss, our method still outperforms DRSA on the other two datasets. Our competitive performance demonstrates the advantage of learning spatial-temporal cues in person ReID.

Conclusion

This paper proposes a two-stream convolution network to explicitly leverages spatial and temporal cues for video based person ReID. A novel Multi-scale 3D (M3D) network is constructed to learn the multi-scale temporal cues in video sequences. Implemented by inserting serval M3D convolu-tion layers into 2D CNN networks, M3D network can learn robust temporal representations with a fraction of increased parameters. A Residual Attention Layer (RAL) is further designed to refine the learned temporal features by M3D in residual manner. The learned temporal representations are combined with spatial representation learned through 2D CNN for video ReID. Experimental results on three widely used video ReID datasets demonstrate the superiority of the proposed model over current state-of-the-art methods.

(8)

References

Carreira, J., and Zisserman, A. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. InCVPR.

Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In

CVPR.

Feichtenhofer, C.; Pinz, A.; and Wildes, R. P. 2017. Spatiotemporal multiplier networks for video action recognition. InCVPR. Feichtenhofer, C.; Pinz, A.; and Zisserman, A. 2016. Convolutional two-stream network fusion for video action recognition. InCVPR. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. InCVPR.

Hirzer, M.; Beleznai, C.; Roth, P. M.; and Bischof, H. 2011. Person re-identification by descriptive and discriminative classification. In

Image Analysis.

Hu, J.; Shen, L.; and Sun, G. 2017. Squeeze-and-excitation net-works.arXiv preprint arXiv:1709.01507.

Ji, S.; Xu, W.; Yang, M.; and Yu, K. 2013. 3d convolutional neural networks for human action recognition.IEEE Trans. PAMI. Karanam, S.; Li, Y.; Radke, R. J.; and Radke, R. J. 2015. Person re-identification with discriminatively trained viewpoint invariant dictionaries. InICCV.

Klaser, A.; Marszałek, M.; and Schmid, C. A spatio-temporal descriptor based on 3d-gradients. InBMVC.

Koestinger, M.; Hirzer, M.; Wohlhart, P.; Roth, P. M.; and Bischof, H. 2012. Large scale metric learning from equivalence constraints. InCVPR.

Li, D.; Chen, X.; Zhang, Z.; and Huang, K. 2017a. Learning deep context-aware features over body and latent parts for person re-identification. InCVPR.

Li, Y.; Zhuo, L.; Li, J.; Zhang, J.; Liang, X.; and Tian, Q. 2017b. Video-based person re-identification by deep feature guided pooling. InCVPR Workshops.

Li, S.; Bak, S.; Carr, P.; and Wang, X. 2018. Diversity regularized spatiotemporal attention for video-based person re-identification. In

CVPR.

Li, M.; Zhu, X.; and Gong, S. 2018a. Unsupervised person re-identification by deep learning tracklet association. InECCV. Li, W.; Zhu, X.; and Gong, S. 2018b. Harmonious attention network for person re-identification. InCVPR.

Liu, H.; Jie, Z.; Jayashree, K.; Qi, M.; Jiang, J.; Yan, S.; and Feng, J. 2017a. Video-based person re-identification with accumulative motion context.IEEE Trans. CSVT.

Liu, Y.; Yan, J.; Ouyang, W.; and Ouyang, W. 2017b. Quality aware network for set to set recognition. InCVPR.

Liu, J.; Zha, Z.-J.; Chen, X.; Wang, Z.; and Zhang, Y. 2018a. Dense 3d-convolutional neural network for person re-identification in videos.ACM Trans. Multimed. Comput. Commun. Appl. Liu, J.; Zha, Z.-J.; Xie, H.; Xiong, Z.; and Zhang, Y. 2018b. Ca 3 net: Contextual-attentional attribute-appearance network for person re-identification. InACM MM.

McLaughlin, N.; Martinez del Rincon, J.; Miller, P.; and Miller, P. 2016. Recurrent convolutional network for video-based person re-identification. InCVPR.

Pedagadi, S.; Orwell, J.; Velastin, S.; and Boghossian, B. 2013. Local fisher discriminant analysis for pedestrian re-identification. InCVPR.

Qiu, Z.; Yao, T.; and Mei, T. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. InICCV.

Scovanner, P.; Ali, S.; and Shah, M. 2007. A 3-dimensional sift descriptor and its application to action recognition. InACM MM. Si, J.; Zhang, H.; Li, C.-G.; Kuen, J.; Kong, X.; Kot, A. C.; and Wang, G. 2018. Dual attention matching network for context-aware feature sequence based person re-identification. arXiv preprint arXiv:1803.09937.

Simonyan, K., and Zisserman, A. 2014. Two-stream convolutional networks for action recognition in videos. InNIPS.

Su, C.; Yang, F.; Zhang, S.; Tian, Q.; Davis, L. S.; and Gao, W. 2015. Multi-task learning with low rank attribute embedding for person re-identification. InICCV.

Su, C.; Zhang, S.; Xing, J.; Gao, W.; and Tian, Q. 2016. Deep attributes driven multi-camera person re-identification. InECCV. Springer.

Su, C.; Li, J.; Zhang, S.; Xing, J.; Gao, W.; and Tian, Q. 2017. Pose-driven deep convolutional model for person re-identification. InICCV.

Tesfaye, Y. T.; Zemene, E.; Prati, A.; Pelillo, M.; and Shah, M. 2017. Multi-target tracking in multiple non-overlapping cameras using constrained dominant sets.arXiv preprint arXiv:1706.06196. Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; and Paluri, M. 2015. Learning spatiotemporal features with 3d convolutional networks. InICCV.

Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; and Paluri, M. 2018. A closer look at spatiotemporal convolutions for action recognition. InCVPR.

Wang, X., and Zhao, R. 2014. Person re-identification: System design and evaluation overview. InPerson Re-Identification. Wei, L.; Zhang, S.; Yao, H.; Gao, W.; and Tian, Q. 2017. Glad: global-local-alignment descriptor for pedestrian retrieval. InACM MM. ACM.

Wu, L.; Shen, C.; Hengel, A. v. d.; and Hengel, A. v. d. 2016. Deep recurrent convolutional networks for video-based person re-identification: An end-to-end approach. arXiv preprint arXiv:1606.01609.

Wu, Y.; Lin, Y.; Dong, X.; Yan, Y.; Ouyang, W.; and Yang, Y. 2018. Exploit the unknown gradually: One-shot video-based person re-identification by stepwise learning. InCVPR.

Xiong, F.; Gou, M.; Camps, O.; and Sznaier, M. 2014. Person re-identification using kernel-based metric learning methods. In

ECCV.

Xu, J.; Zhao, R.; Zhu, F.; Wang, H.; and Ouyang, W. 2018. Attention-aware compositional network for person re-identification.CVPR. Yan, Y.; Ni, B.; Song, Z.; Ma, C.; Yan, Y.; and Yang, X. 2016. Person re-identification via recurrent feature aggregation. InECCV. Ye, M.; Lan, X.; and Yuen, P. C. 2018. Robust anchor embedding for unsupervised video person re-identification in the wild. InECCV. Yu, F., and Koltun, V. 2015. Multi-scale context aggregation by dilated convolutions.arXiv preprint arXiv:1511.07122.

Zhang, W.; Hu, S.; Liu, K.; and Liu, K. 2017. Learning compact appearance representation for video-based person re-identification.

arXiv preprint arXiv:1702.06294.

Zheng, L.; Bie, Z.; Sun, Y.; Wang, J.; Su, C.; Wang, S.; and Tian, Q. 2016. Mars: A video benchmark for large-scale person re-identification. InECCV.