• No results found

ReMOTS: Refining Multi-Object Tracking and Segmentation (1 st Place Solution for MOTS 2020 Challenge 1)

N/A
N/A
Protected

Academic year: 2021

Share "ReMOTS: Refining Multi-Object Tracking and Segmentation (1 st Place Solution for MOTS 2020 Challenge 1)"

Copied!
13
0
0

Loading.... (view fulltext now)

Full text

(1)

ReMOTS: Refining Multi-Object Tracking and

Segmentation

(1

st

Place Solution for MOTS 2020 Challenge 1)

0

Fan Yang

1,2

, Xin Chang

1

, Chenyu Dang

1

, Ziqiang Zheng

3

, Sakriani Sakti

1,2

,

Satoshi Nakamura

1,2

, Yang Wu

4

1

Nara Institute of Science and Technology, Japan

2

RIKEN Center for Advanced Intelligence Project, Japan

3

UISEE Technology (Beijing) Co. Ltd., China

(2)

1

• Problem: detect, segment, and track multiple objects in videos. • Input: a video sequence contain that multiple RGB images. • Output: 2D mask and corresponding track ID at each frame. • Application: action recognition, automatic driving, and others.

frame k frame k+1 Input Video Data frame k frame k+1 Instance Segmentation detect+segment

Background of Multi-Object Tracking and Segmentation (MOTS)

frame k frame k+1 track_id 1 track_id 2 track_id 2 track_id 1 MOTS detect+segment+track

(3)

frame k frame k+1 frame k frame k+1 Instance Segmentation Input Video Data detect+segment Step 1

Our solution for MOTS

(4)

3

We take off-the-shelf models:

X-101-64x4d-FPN of MMDetection + Mask R-CNN X152 of Detectron 2, which refers to the public detection and segmentation methods.

Instance Segmentation

But, how to fuse instance masks from different models? Fusing boxes – using NMS

Fusing masks – may also using NMS – but change IoU to IoM (Intersection over Minimum).

Pixel_IoM = 0.01/0.01 = 1 Pixel_IoM = 1/2 = 0.5 Instance masks: 2 1 2 1 0.10.1 Pixel_IoU = 1/3 = 0.33 Pixel_IoU = 0.01/2 = 0.005

Acceptable for bounding box, But not for mask.

(5)

We proposed an offline method, as ReMOTS (Refining Multi-Object Tracking and Segmentation ). frame k frame k+1 frame k frame k+1 frame k frame k+1 track_id 1 track_id 2 track_id 2 track_id 1 MOTS Instance Segmentation Input Video Data detect+segment+track detect+segment

Our solution for MOTS

Step 1 Step 2

Our main contributions:

1. Refine appearance features 2. Automatically decide threshold

(6)

5

Intra-frame Training and Short-term Tracking

Long-term Tracklets !"#$%&'' t1 t2 t3 t4 t5 t6 t7 t8 Short-term Tracklets Ap pe ar an ce En co de r Co si ne Si m ila ri ty ()* ()+ (), ( )-(). ()/ ()0 ()1 inf inf 0.1 0.4 inf inf 0.5 0.2 0.1 0.5 inf inf 0.4 0.2 inf inf Cutting threshold =2 − !456%5&'' clusters Di st an ce t1 t2 t3 t4 t5 t6 t7 t8 Distance Matrix Hierarchical Clustering Wlong

if same split tracklet ID: set inf

elif temporal overlapping: set inf

else:

set cosine distance

t1 t2 t3 t4 t5 t6 t7 t8 Object-instance Segmentation Ground Truth: Hypotheses: ID2 ID3 id9 id2 9 id3 9 id5 96 Raw Short-term Tracker Appearance Encoder

Intra-frame Training Inter-short-term-tracklet Training

Intra-frame Cosine Affinity Intra-short-tracklet Cosine Affinity For masks of frame k, consider all of IoU > 0 masks of frame k+1 for matching tn Intra-frame Observations Estimated Bounding Boxes

in Test Set Intra-frame sampling Augmentation P N P NInter-tracklet sampling t1 t2 t3 t4 t5 Ground-truth Tracklets in Training Set Form a Mini Batch

Input by the Ratio 1:1

Temporal overlapped & non-overlapped tracklets Ground Truth: Hypotheses: ID7 ID8 id#2 id#3 Intra-frame Training frame k frame k+1 0.2 0.4 inf inf inf 0.3 Linear Assignment BoT-Reid (TMM 2020) Appearance Encoder Appearance Distance Matrix Embedding Appearance Features Cosine Distance !"#$%&'((

(7)

Long-term Tracklets !"#$%&'' t1 t2 t3 t4 t5 t6 t7 t8 Short-term Tracklets Ap pe ar an ce En co de r Co si ne Si m ila ri ty ()* ()+ (), ( )-(). ()/ ()0 ()1 inf inf 0.1 0.4 inf inf 0.5 0.2 0.1 0.5 inf inf 0.4 0.2 inf inf Cutting threshold =2 − !456%5&'' clusters Di st an ce t1 t2 t3 t4 t5 t6 t7 t8 Distance Matrix Hierarchical Clustering Wlong

if same split tracklet ID: set inf

elif temporal overlapping: set inf

else:

set cosine distance

t1 t2 t3 t4 t5 t6 t7 t8 Object-instance Segmentation Ground Truth: Hypotheses: ID2 ID3 id9 id2 9 id3 9 id5 96 Raw Short-term Tracker Appearance Encoder

Intra-frame Training Inter-short-term-tracklet Training

Intra-frame Cosine Affinity Intra-short-tracklet Cosine Affinity Inter-short-term-tracklet sampling P N P NInter-tracklet sampling t1 t2 t3 t4 t5

Temporal overlapped & non-overlapped tracklets

Ground-truth Tracklets in Training Set Form a Mini Batch

Input by the Ratio 1:1

t1 t2 t3 t4 t5 Temporal overlapped short-term tracklets Ground Truth: Hypotheses: ID7 ID8 id#2 id# id3 # id5 #6 Estimated Short-term-tracklets in Test Set

Inter-short-term-tracklet Training

Long-term Tracklets !"#$%&'' t1 t2 t3 t4 t5 t6 t7 t8 Short-term Tracklets Ap pe ar an ce En co de r Co si ne Si m ila ri ty ()* ()+ (), ( )-(). ()/ ()0 ()1 inf inf 0.1 0.4 inf inf 0.5 0.2 0.1 0.5 inf inf 0.4 0.2 inf inf Cutting threshold =2 − !456%5&'' clusters Di st an ce t1 t2 t3 t4 t5 t6 t7 t8 Distance Matrix Hierarchical Clustering Wlong

if same split tracklet ID: set inf

elif temporal overlapping: set inf

else:

set cosine distance

t1 t2 t3 t4 t5 t6 t7 t8 Object-instance Segmentation Ground Truth: Hypotheses: ID2 ID3 id9 id2 9 id3 9 id5 96 Raw Short-term Tracker Appearance Encoder

Intra-frame Training Inter-short-term-tracklet Training

Intra-frame Cosine Affinity Intra-short-tracklet

(8)

7

What Happened in Each Step of Appearance Training

J (H1, H2) represents Jaccard Index of two normalized histograms H1and H2.

(1) After Trained on the train

set only (2) After Intra-frame training on test set without labels

(3) After Inter-short-tracklet training on test set with pseudo labels Long-term Tracklets !"#$%&'' t1 t2 t3 t4 t5 t6 t7 t8 Short-term Tracklets Ap pe ar an ce En co de r Co si ne Si m ila ri ty ()* ()+ (), ( )-(). ()/ ()0 ()1 inf inf 0.1 0.4 inf inf 0.5 0.2 0.1 0.5 inf inf 0.4 0.2 inf inf Cutting threshold =2 − !456%5&'' clusters Di st an ce t1 t2 t3 t4 t5 t6 t7 t8 Distance Matrix Hierarchical Clustering Wlong

if same split tracklet ID: set inf

elif temporal overlapping: set inf

else:

set cosine distance

t1 t2 t3 t4 t5 t6 t7 t8 Object-instance Segmentation Ground Truth: Hypotheses: ID2 ID3 id9 id2 9 id3 9 id5 96 Raw Short-term Tracker Appearance Encoder

Intra-frame Training Inter-short-term-tracklet Training

Intra-frame Cosine Affinity Intra-short-tracklet Cosine Affinity Long-term Tracklets !"#$%&'' t1 t2 t3 t4 t5 t6 t7 t8 Short-term Tracklets Ap pe ar an ce En co de r Co si ne Si m ila ri ty ()* ()+ (), ( )-(). ()/ ()0 ()1 inf inf 0.1 0.4 inf inf 0.5 0.2 0.1 0.5 inf inf 0.4 0.2 inf inf Cutting threshold =2 − !456%5&'' clusters Di st an ce t1 t2 t3 t4 t5 t6 t7 t8 Distance Matrix Hierarchical Clustering Wlong

if same split tracklet ID: set inf

elif temporal overlapping: set inf

else:

set cosine distance

t1 t2 t3 t4 t5 t6 t7 t8 Object-instance Segmentation Ground Truth: Hypotheses: ID2 ID3 id9 id2 9 id3 9 id5 96 Raw Short-term Tracker Appearance Encoder

Intra-frame Training Inter-short-term-tracklet Training

Intra-frame Cosine Affinity Intra-short-tracklet

Cosine Affinity

(9)

Merging Short-term Tracklets

Long-term Tracklets !"#$%&'' t1 t2 t3 t4 t5 t6 t7 t8 Short-term Tracklets Ap pe ar an ce En co de r Co si ne Si m ila ri ty ()* ()+ (), ( )-(). ()/ ()0 ()1 inf inf 0.1 0.4 inf inf 0.5 0.2 0.1 0.5 inf inf 0.4 0.2 inf inf Cutting threshold =2 − !456%5&'' clusters Di st an ce t1 t2 t3 t4 t5 t6 t7 t8 Distance Matrix Hierarchical Clustering Wlong

if same split tracklet ID: set inf

elif temporal overlapping: set inf

else:

set cosine distance

t1 t2 t3 t4 t5 t6 t7 t8 Object-instance Segmentation Ground Truth: Hypotheses: ID2 ID3 id9 id2 9 id3 9 id5 96 Raw Short-term Tracker Appearance Encoder

Intra-frame Training Inter-short-term-tracklet Training

Intra-frame Cosine Affinity Intra-short-tracklet Cosine Affinity Long-term Tracklets !"#$%&'' t1 t2 t3 t4 t5 t6 t7 t8 Short-term Tracklets Ap pe ar an ce En co de r Co si ne Si m ila ri ty ()* ()+ (), ( )-(). ()/ ()0 ()1 inf inf 0.1 0.4 inf inf 0.5 0.2 0.1 0.5 inf inf 0.4 0.2 inf inf Cutting threshold =2 − !456%5&'' clusters Di st an ce t1 t2 t3 t4 t5 t6 t7 t8 Distance Matrix Hierarchical Clustering Wlong if same split tracklet ID:

set inf

elif temporal overlapping: set inf

else:

set cosine distance

t1 t2 t3 t4 t5 t6 t7 t8 Object-instance Segmentation Ground Truth: Hypotheses: ID2 ID3 id29 id9 id3 9 id5 96 Raw Short-term Tracker Appearance Encoder

Intra-frame Training Inter-short-term-tracklet Training

Intra-frame Cosine Affinity Intra-short-tracklet Cosine Affinity !"#$%&'' Intra-frame Cosine Affinity Intra-short-tracklet Cosine Affinity t1 t2 t3 t4 Temporal-overlapped Short-term Tracklets

(10)

9

Comparison with others on MOTSChallenge 1

Since our strategy can be easily adapted to others, will other methods get better

performance by applying our appearance encoder and merging?

May benefit from mask fusion May benefit from refinement

(11)

Limitations of ReMOTS

1. An offline approach.

- It worth to explore how to bring it to online approach.

2. It is challenging for ReMOTs to handle objects with similar appearance.

e.g., good for persons (wear different clothes) but not very useful for

vehicles (similar textures)

3. Trajectory is not considered in our short-term tracker. Failed to associate fast

moving objects.

Slowly moving person with diverse clothes Fast moving car with similar appearance

?

(12)

11

Conclusion

• Unlabeled target videos can be used for learning better appearance

features, but should take care of the potential of introducing noises.

• The suitable hyper parameters for data association may varies from case

to case, and the statistical information of tracklets can be used to adjust

them.

• It is preferred to accommodate some insights of ReMOTS to online

MOTS.

(13)

References

Related documents

3.4 Ga old Buck Reef Chert of the Barberton greenstone belt of South Africa, one of the best preserved, less metamorphic Archaean volcano-sedimentary succession

(a) Amplitude of blocking for 14 December 2018 predicted by forecasts from different initial times up to 15 days ahead with forecasts from 5, 7, and 9 December highlighted; box

Assignments (in writing and doing forms on the aspects of syllabus content and outside the syllabus content. Shall be individual and challenging). Student seminars (on

The original CAWAP solutions were obtained using a structured grid solver, based on a documented iges file with refinements, compared with measured flight data, and reported

This study, using cointegration and causality tests, investigates the relationship between tourism development, economic expansion and poverty reduction in Nicaragua.. The

Our hypotheses derive from the observation that the timing and pace of the fertility transition correspond broadly with changes in socio-political institutions: whether

monitoring phase to respond through further development of online tools (i.e. social media, seo, blog etc.)... © 2010 Trinet Internet

The objective of this study is to investigate the relationship between teachers and students understanding of nature of science, as predictors of achievement in Biology