IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 14, NO. 8, AUGUST

(1)

Understanding Atomic Hand-Object Interaction with

Human Intention

Hehe Fan, Tao Zhuo, Xin Yu, Yi Yang, and Mohan Kankanhalli, Fellow, IEEE

Abstract—Hand-object interaction plays a very important role when humans manipulate objects. While existing methods focus on improving hand-object recognition with fully automatic meth-ods, human intention has been largely neglected in the recognition process, thus leading to undesirable interaction descriptions. To better interpret human-object interaction that is aligned to human intention, we argue that a reference specifying human intention should be taken into account. Thus, we propose a new approach to represent interactions while reflecting human purpose with three key factors, i.e., hand, object and reference. Specifically, we design a pattern of <hand-object, object-reference, hand, object, reference> (HOR) to recognize intention based atomic hand-object interactions. This pattern aims to model interactions with the states of hand, object, reference and their relationships. Furthermore, we design a simple yet effective Spa-tially Part-based (3+1)D convolutional neural network, namely SP(3+1)D, which leverages 3D and 1D convolutions to model visual dynamics and object position changes based on our HOR, respectively. With the help of our SP(3+1)D network, the recog-nition results are able to indicate human purposes accurately. To evaluate the proposed method, we annotate a Something-1.3k dataset, which contains 10 atomic hand-object interactions and about 130 videos for each interaction. Experimental results on Something-1.3k demonstrate the effectiveness of our SP(3+1)D network.

Index Terms—Hand-object interaction reasoning, action recog-nition, video analysis, deep neural networks.

I. INTRODUCTION

H

AND-OBJECT interaction, e.g., using a hand to move, arrange, control, operate things, is a dominant behavior when humans interact with the environment. Recognizing and understanding hand-object interactions are critical and challenging for intelligent agents. In this paper, we aim to recognize atomic interactions. Atomic hand-object interaction reasoning is a fundamental task of understanding complex activities. For example, a general interaction of “making a cup of coffee” can be divided into several atomic interactions, i.e., “putting coffee into a cup”, “pouring hot water into a cup” and “stirring coffee with a spoon”. Different from coarse-grained action recognition, such as swimming or playing football, hand-object interaction contains fine-grained human intention, e.g., “putting or dropping a book into a box”, and “taking a book away from a bookshelf”. However, focusing on different objects will lead to different interaction recognition results, especially in a collaborative AI scenario, where intelligent H. Fan, T. Zhuo and M. Kankanhalli are with the School of Computing, National University of Singapore, Singapore. E-mail: {hehe.fan,zhuotao}@nus.edu.sg, mohan@comp.nus.edu.sg

X. Yu and Y. Yang are with the Center for Artificial Intelligence, University of Technology Sydney. E-mail: {xin.yu,yi.yang}@uts.edu.au

Manuscript received July 23, 2020.

takinga berry out of the bowl

puttinga berry intothe bowl

hand object reference

Fig. 1. Motivation of human intention based hand-object interactions. Differ-ent references will lead to differDiffer-ent interaction reasoning results. When the bottom bowl is selected as reference, the interaction is recognized as “taking a berry out of the bowl”. When the left bowl is selected, the interaction is recognized as “putting a berry into the bowl”. Therefore, to precisely describe interactions and avoid ambiguity, we propose to explicitly explore human intention for atomic interaction recognition.

agents are required to perform different assistance depending upon different human intentions. As shown in Figure 1, the interaction could be interpreted as “taking a berry out of a bowl” or “putting a berry into a bowl”. Such ambiguity is inherit even in atomic interaction recognition without knowing the human intention.

Conventionally, action recognition methods [1]–[5] mainly focus on classifying coarse-grained actions rather than pay-ing attention to the interactive process between humans and objects. For instance, in Figure 1, conventional methods may classify the action as “moving a berry”, and this la-bel/classification result cannot fully describe the interaction. Moreover, although an algorithm can provide a detailed clas-sification, it is still difficult to distinguish whether the action should be “taking a berry out of a bowl” or “putting a berry into a bowl” without knowing human intention. Therefore, it is necessary to take human intention into account for better understanding hand-object interaction.

Recently, to interpret human-object interaction, graph-based methods [6], [7] have been proposed. These methods often separate each frame into several parts and then learn the rela-tionships among those parts. Specifically, these methods apply object detectors [8] to localize all potential objects in each frame for recognition. Since references of interest are rather subjective but not deterministic, these methods may employ unintended objects as reference for recognizing interaction especially in cluttered scenes. As a result, the predictions may not match the ground-truth ones that indicate human intention. Therefore, a better way to describe interaction is to embed human intention so as to identify key behaviors.

(2)

understand hand-object interactions aligned to human inten-tions, where the intended objects (references) are specified by gaze recognition or manually. By doing so, the ambiguity in interaction recognition is significantly reduced and the results would effectively reflect human intention. To the best of our knowledge, we are the first attempt to investigate how the human intention influences to interaction recognition.

To precisely model interactions, a natural question will be what are the key factors in depicting an interaction?Motivated by this, we examine three key factors involved in an interac-tion, i.e., hand, object and reference, and propose a pattern of <hand-object, object-reference, hand, object, reference> (HOR pattern). The “hand”, “object” and “reference” indicate the states (e.g., pose and appearance) of these components respectively and provide important clues for reasoning. For example, based on the state change of a book, we can deduce the interactions of “opening a book” and “closing a book”. However, existing methods [9], [10] mainly focus on the investigation of individual states but neglect their relationships, and thus they may suffer ambiguity in recognition.

We therefore explore the “hand-object” and “object-reference” relationships for recognizing position-related in-teractions. To be specific, the “hand-object” is designed to model the relationship between a hand and an object, such as “holding” and “dropping”. The “object-reference” establishes the relationship between an object and a reference, thus providing cues of position changes such as “into” and “out of”. Meanwhile, individual “hand”, “object” and “reference” components are exploited to recognize various hand-object interactions in a unified framework.

To recognize hand-object interactions, we propose a simple yet effective Spatially Part-based (3+1)D network, namely SP(3+1)D. In particular, SP(3+1)D treats the elements of the HOR pattern as five independent spatial parts. Then, 3D convolutional layers [1] are used to extract spatio-temporal features of each spatial part, while 1D convolutional layers are employed to capture the position changes of the hand, object and reference for “hand-object” and “object-reference” reasoning. The features of the five spatial parts are concate-nated for the final interaction recognition.

Since existing benchmark datasets do not specify hands, objects or references during interactions, we annotate a dataset, namely Something-1.3k, from the 20BN-something-something [11] dataset to evaluate our proposed method. In total, the newly annotated dataset contains 10 atomic hand-object interaction classes, and there are around 130 videos for each interaction class. Moreover, hands, objects and references are annotated with bounding boxes in each video. To avoid recognizing interactions simply from scenes, the same object, reference and scene are not allowed to appear in the same interaction class more than once.

In summary, our contributions are four-fold:

• To the best of our knowledge, this work is the first attempt to leverage human intention to reduce the ambiguity in recognizing hand-object interaction.

• We propose a new pattern of <hand-object, object-reference, hand, object, reference> for atomic

hand-object interaction recognition. Using this pattern, we significantly improve the recognition performance.

• We design a simple yet effective Spatially Part-based

(3+1)D network to recognize interactions based on our proposed pattern.

• We annotate a Something-1.3k dataset to evaluate our proposed method. Experimental results on this challeng-ing dataset demonstrate the superiority of our method.

II. RELATEDWORK

Video analysis techniques, such as action recognition [1], [12], event detection [13], [14], activity understanding [15]– [17], video classification [18], video retrieval [19] and video prediction [20], have been studied for decades. Among them, action recognition plays a core role. Two-stream convolutional networks [21], [22] capture the complementary information on appearance and motion between frames (i.e., optical flows). Deep 3D convolutional networks [1], [2], [4], [5] directly learn spatio-temporal features from raw RGB video frames. As a video consists of a temporal sequence of frames, re-current neural networks also are applied to video analysis. Ng et al. [23] and Fan et al. [18] used a Long Short-Term Memory (LSTM) [24] to aggregate frame-level CNN features for modeling temporal video sequences.

Similar to image relation detection and reasoning [25], [26], some methods try to leverage explainable or interpretable models for interaction or action reasoning. Shang et al. [27] proposed a video visual relation detection (VidVRD) method using a relation triplet <subject, predicate, object>. Zhuo et al. [28] proposed to integrate prior knowledge into logical reasoning to explain semantic-level observations of video state changes. Di et al. [29] proposed a Multi-Hypothesis Relational Association (MHRA) method to generate multiple hypotheses for long-term relation prediction. Liao et al. proposed a Parallel Point Detection and Matching (PPDM) method [30] for real-time human-object interaction detection. Similar to most existing methods, PPDM mainly focuses on humans and objects while neglects references. Besides, hand-object 3D pose and shape estimation [31] have the potential to help human-object interaction reasoning.

Graph Convolutional Networks (GCN) [6], [7], [32] have recently been used to learn structured video representation. Qi et al. [32] introduced a Graph Parsing Neural Network (GPNN) to incorporate structural knowledge for detecting and recognizing hand-object interactions. Wang et al. [6] used spatio-temporal region graphs to model temporal dynamics and relationships between humans and objects. Ji et al. [33] developed Action Genome to decompose actions into spatio-temporal scene graphs. Materzynska et al. [7] presented a spatial-temporal interaction network to infer on candidate sparse graphs established by objects. Different from these methods that are designed for general action recognition, we focus on atomic hand-object interactions. By analyzing the atomic characteristics of hand-object interactions, we propose a unified pattern to describe them in a more comprehensive manner.

(3)

III. PROPOSEDMETHODOLOGY

In this section, we first introduce our proposed HOR pattern for atomic hand-object interaction recognition. Then we de-scribe the proposed Spatially Part-based (3+1)D convolutional neural network in detail.

A. HOR Pattern Definition

In this paper, our goal is to infer atomic hand-object interactions by analyzing three key factors, i.e., hand, object and reference. Existing methods [10], [34] have noticed that hands and objects are important for interaction reasoning. However, they neglected another critical factor, i.e., the refer-ence. This would lead the predicted results to fail to indicate human intentions, thus preventing human-machine interaction algorithms from being applicable in real-world scenarios. For example, a robot may fail to recognize the action of “putting a wrench into a box” and “moving a wrench” without specifying the “box” as a reference. Therefore, we specify a reference to indicate human intention for interaction reasoning. Here, the intention can be obtained from either algorithms of human gaze or direct human input (e.g., verbally). By treating the “box” as a reference, the robot is able to fully understand this interaction.

In general, an interaction occurs when a hand manipulates an object. As a result, the states of the three key factors and their relationships will change. Therefore, we propose an HOR (Hand-Object-Reference) pattern to model these state changes, formulated as:

<hand−object, object−reference, hand, object, reference>, where the “hand-object” denotes the relationship between a hand and an object. The “object-reference” indicates the relationship between an object and a reference. The “hand”, “object” and “refererence” represent states of a hand, an object and a reference, respectively. The goal of HOR is to fully capture the dynamics of atomic hand-object interactions while reducing the distraction from other unrelated parts in a scene. We illustrate the HOR pattern in Figure 2(a).

“Hand-object”: This relationship provides clues for recog-nizing hand actions. As illustrated in Figure 2(b), based on the distance changes between the “hand” and the “object”, we can recognize the interaction as “dropping something into something”, rather than “putting something into something”.

“Object-reference”: This component describes the changes of an object with respect to a reference. As shown in Fig-ure 2(c), with “object-reference”, we can recognize the interac-tion as “moving something away from something”, rather than “moving something towards something”. Note that, “object-reference” can not only be used to describe object position changes, but also imply the relationships.For the example of the interaction “using a wrench to screw a bolt”, if we consider a “wrench” as an object and a “bolt” as a reference, the action of “screw” can be implied by “object-reference”.

“Hand”: This component describes pose changes of a hand, providing details on how a hand performs interactions. In Figure 2(d)-(e), according to the “hand” states, we can easily distinguish “picking” and “holding”.

b)Hand-object: droppingsomething into something

a) The <hand-object, object-reference, hand, object, reference> pattern object

hand-object object-reference

hand reference

𝑠𝑡𝑎𝑡𝑒 𝑠𝑡𝑎𝑡𝑒 𝑠𝑡𝑎𝑡𝑒

𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝

c)Object-reference: moving something away from something

d)Hand: picking e)Hand: holding f)Object: opening g)Ref: half of water

Fig. 2. (a) Illustration of our proposed HOR pattern for atomic hand-object interaction reasoning. (b) Using “hand-object”, we distinguish whether the action is “putting” or “dropping” (i.e., when the hand leaves the object, the object is inside or outside the container). (c) With “object-reference”, we recognize the position changes of an object. (d)-(e) The “hand” states are used to distinguish “picking” and “holding”. (f) From the “object” state, we recognize that a book is being opened. (g) For the task “pouring water from a cup into a bowl”, we can recognize whether the task has been finished or not from the “reference” state. Here, we propose to employ “hand—object” and “object—reference” to recognize position-related interactions while integrate the individual states of “hand”, “object” and “reference” in a unified manner.

“Object”: This component describes the appearance changes of an object. The example in Figure 2(f) shows that from the book state, we are able to differentiate the interaction “opening a book” from “closing a book”.

“Reference”: The “reference” state describes the appear-ance or pose changes of a reference of user interest. With the reference, the interactions will exhibit high correlation with human intentions.

Why is there no “hand-reference” relationship in our HOR pattern? First, a hand often manipulates a target object but does not directly interact with the reference. Second, the information of “hand-reference” has already been indirectly encoded in “hand-object” and “object-reference”. To avoid in-formation redundancy, we do not incorporate “hand-reference” into our HOR pattern.

What if some relationships in the proposed pattern are missing in an interaction? The HOR pattern aims to describe various hand-object interactions in a unified manner. If a relationship does not exist in an interaction, we directly set it to none. For example, the interaction “opening a book” only involves a “hand” and an “object”. In this case, the relationship “object-reference” and the “reference” state are set to none.

B. Spatially Part-based (3+1)D CNN

HOR pattern is designed to capture the states and rela-tionships among hands, objects and references for interaction recognition. To apply HOR to the recognition tasks, we present a Spatially Part-based (3+1)D network, namely SP(3+1)D. As

(4)

hand -object object -reference hand object reference

⋯

inputs 3D and 1D convolutional neural networks

interaction prediction

⋯

𝑥1,1ℎ_{, 𝑦1,1}ℎ_{, 𝑥1,2}ℎ_{, 𝑦1,2}ℎ 𝑥1,1𝑜_{, 𝑦1,1}𝑜_{, 𝑥1,2}𝑜_{, 𝑦1,2}𝑜 𝑥2,1 ℎ_{, 𝑦2,1}ℎ_{, 𝑥2,2}ℎ_{, 𝑦2,2}ℎ 𝑥2,1𝑜_{, 𝑦2,1}𝑜_{, 𝑥2,2}𝑜_{, 𝑦2,2}𝑜 𝑥3,1 ℎ_{, 𝑦3,1}ℎ_{, 𝑥3,2}ℎ_{, 𝑦3,2}ℎ 𝑥3,1𝑜_{, 𝑦3,1}𝑜_{, 𝑥3,2}𝑜_{, 𝑦3,2}𝑜 𝑥4,1 ℎ_{, 𝑦4,1}ℎ_{, 𝑥4,2}ℎ_{, 𝑦4,2}ℎ 𝑥4,1𝑜_{, 𝑦4,1}𝑜_{, 𝑥4,2}𝑜_{, 𝑦4,2}𝑜

⋯

𝑥1,1𝑜_{, 𝑦1,1}𝑜_{, 𝑥1,2}𝑜_{, 𝑦1,2}𝑜 𝑥1,1ℎ_{, 𝑦1,1}ℎ_{, 𝑥1,2}ℎ_{, 𝑦1,2}ℎ 𝑥2,1𝑜_{, 𝑦2,1}𝑜_{, 𝑥2,2}𝑜_{, 𝑦2,2}𝑜 𝑥2,1ℎ_{, 𝑦2,1}ℎ_{, 𝑥2,2}ℎ_{, 𝑦2,2}ℎ 𝑥3,1𝑜_{, 𝑦3,1}𝑜_{, 𝑥3,2}𝑜_{, 𝑦3,2}𝑜 𝑥3,1ℎ_{, 𝑦3,1}ℎ_{, 𝑥3,2}ℎ_{, 𝑦3,2}ℎ 𝑥4,1𝑜_{, 𝑦4,1}𝑜_{, 𝑥4,2}𝑜_{, 𝑦4,2}𝑜 𝑥4,1ℎ_{, 𝑦4,1}ℎ_{, 𝑥4,2}ℎ_{, 𝑦4,2}ℎ

⋯

FC FC FC

Fig. 3. Illustration of the proposed spatially part-based (3+1)D convolutional neural network. In particular, we design five streams to address the <hand-object, object-reference, hand, object, reference> pattern. Each stream is encoded by an individual 3D convolutional layers and outputs a 1D representation. 1D convolutions, which take an 8D position vector as an input, are used to explicitly learn the position changes of hand, object and reference for reasoning these two relationships. The extracted representations from the five streams are concatenated and then a fully-connected layer is used for classification.

shown in Figure 3, each component of the HOR pattern is re-garded as a condition when an interaction happens. SP(3+1)D first extracts each condition with an independent stream. Then, the descriptors of these conditions are concatenated for the final interaction recognition.

To be specific, let Bh

t, Bto, Btr be the bounding boxes of a

hand, an object and a reference in the t-th frame, respectively. Then, the “hand-object” area is represented by a union bound-ing box of Bh

t and Bot. Similarly, the “object-reference” region

is defined by a union bounding box of Bo

t and Btr. Along the

time dimension, a sequence of bounding box areas forms a spatio-temporal tube to represent the dynamics of a pattern. To capture the visual dynamics of the spatio-temporal tubes, we employ the 3D convolutional neural network [5] for feature extraction.

Different from the individual “hand”, “object” and “refer-ence” components, “hand-object” and “object-refer“refer-ence” com-ponents require to encode not only the object and reference appearance information but also their positions. Specifically, the representation of “hand-object” is expressed as:

pho(x, y) = (h∗, o∗, I),

and the pattern “object-reference” is represented as: por(x, y) = (o∗, r∗, I),

where I denotes the color of a pixel, x and y are the coordinate of a pixel, and h∗, o∗ and r∗ represent hand, object and reference flags. For example, when (x, y) ∈ B_th, the “hand” flag h∗ is set to 1. Otherwise, it is set to 0. Similarly, the “object” flag o∗ is set to 1 when (x, y) ∈ Bot. When

(x, y) ∈ Btr, the “reference” flag r∗ is set to 1. Otherwise,

the flags are set to 0. In this manner, our model is able to

capture the position changes as well as their relationships via the 3D convolutional neural networks.

Apart from modeling the visual dynamics, we additionally use 1D CNN layers to explicitly learn the distance changes between “hand”, “object” and “reference” for the “hand-object” and “object-reference” relationships. Specifically, the input of the 1D CNNs is an 8D vector consisting of 2D coordinates of the upper-left and lower-right corners of the two bounding boxes in the “hand-object” and “object-reference” components. We refer to this 8D vector as a position vector in this paper. As each frame in a video has a 8D vector, 1D CNNs are used to extract the position dynamics from the position vectors. By applying the 1D CNNs, the position changes, such as approaching, leaving and overlapping, will be explicitly explored for recognizing hand-object interactions.

The h∗, o∗ and r∗ flags and 8D position vectors aim to encode position from different modalities. By appending these flags to each pixel in frames, the “hand-object” and “object-reference” areas are marked. In this fashion, our network is able to capture the dynamic changes of “hand-object” and “object-reference” from the vision modality. Meanwhile, 8D position vectors are responsible for explicitly encoding position changes from the coordinate modality. Note that, compared to other methods (e.g., [28] and [27]) that require additional human-annotated relationships as supervision sig-nals, our method learns such relationships from visual and position dynamics without requiring those human-annotated relationships. Moreover, it is impossible to exhaustively define all relationships in the world. By contrast, our method is not restricted by the relationship annotations and can be trained with many different kinds of relationships.

(5)

SP(3+1)D network concatenates the “hand”, “object”, “ref-erence”, “hand-object” and “object-reference” features for the final interaction reasoning. If a relationship does not exist in an interaction, we directly set it to none. In our implementation, we simply pad zeros at the corresponding positions of missing relationships or elements in the final concatenated feature.

IV. SOMETHING-1.3KDATASET

Although many datasets [34]–[36] have been developed for general action or specific interaction detection, these datasets do not specify which parts humans are interested in. Thus, deep network based systems trained on those datasets are unable to interact with humans. To enable those networks to understand interactions between humans and objects at a fine granularity, we intend to recognize atomic hand-object interaction. However, the existing datasets do not provide such information for model training. Therefore, we com-piled a Something-1.3k dataset from the 20BN-something-something [11] dataset in which humans perform pre-defined basic actions with diverse objects in various scenarios.

The original 20BN-something-something dataset contains 220,847 videos and 174 interaction classes. In our collected Something-1.3k dataset, we further add the annotations of hands, objects and references of interest for atomic hand-object interaction recognition. Specifically, we collect 1,321 videos with 10 interaction classes. The 10 interactions (as shown in Table I) are selected from the most frequent classes in 20BN-something-something, with about 130 videos for each interaction. We perform ten-fold cross-validation to evalu-ate the performance of different methods. For each cross-validation, we use 1,221 videos for training and 100 videos for evaluation. Since we observe that most interactions last within 16 consecutive frames, we use 16 frames to represent each interaction video clip.

The Something-1.3k dataset is a considerably challenging for atomic hand-object interaction recognition: First, the 10 interaction classes share some common factors. For example, “putting something onto something” and “putting something into something” share the same “hand-object” relationship, i.e., “putting”, while the “putting something onto something” and “dropping something onto something” share the same “object-reference” relationship, i.e., “onto”.

Second, to force a model to recognize the temporal char-acteristics of interactions, we incorporate inverse interactions into the dataset, such as “putting something into something” and “taking something out of something”. These interactions exhibit the opponent temporal information. To distinguish these interactions, a neural network needs to understand the temporal orders in interactions.

Third, in the 20BN-something-something dataset, the same objects, references or scenes usually appear many times in the same interaction class. A model could classify interactions solely based on objects, references or scenes rather than the spatio-temporal information in interactions. For example, in the 20BN-something-something dataset, a “wrench” appears in the videos of “putting something into something”. In this case, neural networks can predict the correct interaction based

TABLE I

10INTERACTION CLASSES INSOMETHING-1.3K

Action Object Relationship Reference taking [something] out of [something] putting [something] into [something] putting [something] onto [something] putting [something] in front of [something] moving [something] away from [something] moving [something] from left to right on [something] moving [something] from right to left on [something] dropping [something] into [something] dropping [something] onto [something] dropping [something] in front of [something]

on the “wrench” rather than the interaction itself. To avoid this bias, the same object, reference or scene is only allowed to appear at most once in an interaction class in our Something-1.3k dataset. Meanwhile, the same object, reference or scene is encouraged to appear in different interaction classes. In this way, a network needs to recognize interactions using the spatio-temporal information of videos. Our Something-1.3k dataset therefore mimics an experimental setting similar to the interaction understanding tasks in practice.

V. EXPERIMENTS A. Implementation Details

Detector and Tracker. Following the previous work [7], we employ Faster R-CNN [8], with Feature Pyramid Net-work [37] and ResNet-101 [38] as a backbone, to detect hands and objects. The model is firstly pre-trained on the COCO dataset [39] and the Something-Else dataset [7], and then fine-tuned on our Something-1.3k dataset. Note that, we do not distinguish objects and references for fine-tuning the detector. The detected objects are then tracked by the SORT [40] method. The detector and tracker are also used in STRG [6] and STIN [7]. Note that, other methods [41], [42] can also be employed in our approach.

Hand, Object and Reference. In most cases, the detector produces a hand and multiple object or reference bounding boxes, and we use the specified bounding boxes as human intention. If there is more than one hand, we select the one whose position changes dramatically in a video. For the detected objects, we select the bounding box whose average distance to the hand is smallest during an entire interaction.

Interaction Recognition Backbone. We use R3D-18 [5] as the visual perception backbone and five individual streams are used in our SP(3+1)D network. The R3D-18 backbone is not pre-trained since it is used to extract features from different tar-gets. Each R3D-18 stream outputs a 512-dimensional feature. To perceive position changes, we use two R1D-18 streams. R1D-18 replaces the 3D convolutions in R3D-18 with 1D convolutions. The position features are concatenated with their corresponding visual features to form relationship features. The relationship and state features are concatenated, and the concatenated feature is used for interaction predictions. ReLU is employed as our non-linear activation function.

Training. During the training stage, all frames are resized to 128×171 and then randomly cropped to 112×112 pixels. The random horizontal flip operation is applied to videos for data

(6)

augmentation. In the evaluation, frames are first resized and then centrally cropped. Our network is trained from scratch with the Adam optimizer [43] for 500 epochs. Besides, we set the learning rate to 0.001 and the batch size to 16.

B. Competing Methods

We compare our method with two types of competing meth-ods. The first type is CNN based general action recognition methods, which take the entire video as input to learn spatio-temporal features.

• C3D [1], it has eight convolutions, five max-poolings, and two fully-connected layers, followed by a softmax layer.

• I3D [2], an inflated 3D convolutional neural network. It integrates 3D convolutions into the 2D Inception-V1 [44] to learn spatio-temporal information.

• R3D [5], a 3D ResNet [38]. It replaces 2D convolutions in the ResNet architecture with 3D convolutions.

• MC3 [5], a ResNet architecture with mixed three 2D

convolutions and two 3D convolutions in each block.

• R(2+1)D [5], which explicitly factorizes the 3D

convolu-tion of the R3D architectures into two separate operaconvolu-tions, i.e., a 2D spatial convolution and a 1D temporal one. Since these methods take entire videos as inputs, they do not benefit from object detection or human intention. Here, we compare our method with them to show the limitation of these traditional methods in recognizing hand-object interaction.

The second type of competing methods learns structured video representations by decomposing each frame into several parts.

• STRG [6], as known as Space-Time Region Graph. It represents videos as space and time region graphs followed by graph convolutions for inference.

• STIN [7], Spatial-Temporal Interaction Network. It oper-ates on object-centric features. Specifically, each object feature consists of its own feature and the average of other objects’ features. The STIN first performs spatial reasoning among the potential objects in each frame, and then performs temporal reasoning on the top of the frame features.

The aforementioned methods usually apply object detectors [8] to generate bounding boxes of the object of interest in each video frame. Similarly, we use the same object detector to detect hands and objects that are related interactions.

To demonstrate the effectiveness of our method, we compare with the state-of-the-art methods on three scenarios. First, similar to STRG and STIN, we only use automatically detected hands and objects but do not use reference (w/ hand and object). Second, on the basis of automatically detected hands and objects, we further integrate references into interaction recognition (w/ hand, object and reference). Because reference is subjective and not objective, it is extremely difficult to automatically detect reference based on videos only. In this scenario, we consider three methods to determine references.

• Nearest-item-based reference selection. We automatically select the nearest item to an object as the reference.

• Gazed-item-based reference selection. We use SMI RED250, which is a screen-based eye tracker, to achieve

Fig. 4. Illustration of reference detection via gaze. The radius of circles indicates gaze time. Because the human intention is subjective, it is extremely difficult to automatically detect reference based on videos only. To address this problem, we leverage the gaze for reference detection. Specifically, we use an eye tracker to achieve the position of gaze on a reference by moving sight three times from the top left to the bottom right of a reference. Then, the item whose detected bounding box has the largest IoU with the gaze bounding box is selected as the reference.

the position of gaze on reference. Specifically, as shown in Fig. 4, we first play Something-1.3K videos on a computer and achieve reference position by moving sight from the top left to the bottom right of a reference. Then, the SMI RED250 is used to record the gaze trajectory, marked by circles. The top left gaze and the bottom right gaze can define a so-called gazed bounding box. Finally, the item whose detected bounding box has the largest Intersection over Union (IoU) with the gazed bounding box is selected as the reference.

• Manually specified reference. In a collaborative AI envi-ronment, speech commands or manual methods are more precise than gaze recognition for AI agents to interact with humans. Therefore, in this setting, we use manually specified, i.e., ground-truth, references.

Finally, to provide the upper-bound performance, we also use the ground-truth bounding boxes of hands, objects and references in training and testing (w/ gt box).

C. Comparison with Competitors

We compare our SP(3+1)D with the competitors on our collected Something-1.3k dataset. Experimental results are reported in Table II. From these results, we observe:

(1) Among the convolutional neural networks that directly take entire videos as inputs, the performance of ResNets (i.e., R3D, MC3 and R(2+1)D), I3D and C3D on hand-object interaction is consistent with their performance on the action recognition [5], i.e., ResNets > I3D > C3D. This indicates that achievements on traditional action recognition techniques also help hand-object interaction recognition.

Compared to the best model among the first type of the com-petitors, i.e., R3D, SP(3+1)D with human intention increases the recognition accuracy by 10.1%, demonstrating the effec-tiveness of our method. When directly applying traditional action recognition methods on atomic hand-object interaction, it is extremely challenging for those methods to correctly focus on the task-related parts.

For the first group of the competing methods, the ac-curacy on the ten interaction classes of Something-1.3k is extremely biased. For instance, C3D achieves 93% accuracy on the class “dropping something into something”, but it

(7)

TABLE II

RECOGNITION ACCURACY OF ATOMIC HAND-OBJECT INTERACTIONS ON THESOMETHING-1.3K DATASET.

Methods Take-out Put-into Put-onto Put-front Move-away Move-left-right Move-right-left Drop-into Drop-onto Drop-front Average w/o structured video model-ing C3D [1] 0.11 0.32 0.01 0.02 0.01 0.03 0.07 0.93 0.02 0.04 0.156 I3D [2] 0.00 0.38 0.02 0.01 0.03 0.41 0.05 0.87 0.01 0.04 0.182 MC3 [5] 0.03 0.33 0.22 0.00 0.01 0.15 0.06 0.85 0.46 0.07 0.218 R3D [5] 0.09 0.41 0.68 0.04 0.02 0.07 0.03 0.57 0.81 0.04 0.276 R(2+1)D [5] 0.07 0.29 0.54 0.05 0.02 0.09 0.01 0.28 0.74 0.02 0.211 w/ hand and object STRG [6] 0.26 0.18 0.44 0.20 0.41 0.25 0.33 0.51 0.28 0.12 0.298 STIN [7] 0.17 0.30 0.42 0.33 0.29 0.19 0.04 0.32 0.35 0.31 0.272 SP(3+1)D (ours) 0.28 0.23 0.31 0.23 0.37 0.31 0.49 0.32 0.35 0.33 0.322 w/ hand,

ob-ject and refer-ence

SP(3+1)D (nearest) 0.37 0.30 0.28 0.26 0.39 0.27 0.52 0.34 0.36 0.29 0.338 SP(3+1)D (gaze) 0.38 0.32 0.31 0.30 0.47 0.29 0.54 0.28 0.36 0.31 0.356 SP(3+1)D (manually) 0.41 0.32 0.34 0.31 0.50 0.29 0.62 0.30 0.38 0.30 0.377 w/ gt box SP(3+1)D (ours) 0.50 0.41 0.54 0.34 0.52 0.40 0.63 0.62 0.53 0.33 0.482

fails (i.e., almost 0% accuracy) on many other interaction classes. This manifests that it is difficult to simply use entire videos for understanding interaction. Therefore, structured video representations are required. By contrast, the accuracy of STRG, STIN and our SP(3+1)D method on the ten classes is evenly distributed, demonstrating that these methods learn more representative spatio-temporal visual features.

(2) Compared with the second group of the competing methods, which employ object detection results, our SP(3+1)D method (w/ detection and w/o intention) outperforms the STRG and STIN methods. In particular, our SP(3+1)D achieves an improvement of 2.4% in comparison to STRG. Note that, we do not incorporate the human intention mech-anism here for fair comparison. This improvement mainly comes from the fact that our SP(3+1)D network uses the explicit and concise HOR pattern to describe hand-object interactions. Moreover, using h∗, o∗, r∗ states and position vectors exploited in SP(3+1)D facilities the network training and thus improves relationship reasoning.

Although STRG adopts graph convolutional networks (GCNs) to model multiple potential objects and relationships, GCNs do not distinguish hands, objects and references. Hence, our method achieves more representative ability than STRG and outperforms it by 5%. Because STIN only employs individual hand and object features, relationships are not well exploited. In contrast, both individual states and relationships are taken into consideration in SP(3+1)D. Thus, it can effec-tively model the interactions.

(3) Human intention, i.e., reference, effectively improves interaction recognition. Compared to the “w/ hand and object” setting, with nearest reference, our SP(3+1)D method achieves improvement of 1.6%, 3.4% and 5.5% on accuracy with near-est, gaze and manually specified references, respectively. This demonstrates the necessity of introducing human intention in hand-object interaction recognition. Compared to STRG and STIN, our SP(3+1)D method with manually specified refer-ences improves the accuracy by 7.9% and 10.5%, respectively, showing the superiority of the proposed method.

(4) To provide the upper-bound accuracy of our method, we directly use the ground-truth hand, object and reference bounding boxes, as well as the ground-truth tracking results. Compared to the case of using human intention, where the hand and object are automatically detected and the reference

TABLE III

INFLUENCE OF THEFIVESTREAMS INSP(3+1)D. THE SYMBOL“X”

DENOTES THAT THE CORRESPONDING STREAM IS USED. Stream

Accuracy hand-object object-reference hand object reference

X 0.307 X 0.338 X 0.154 X 0.217 X 0.195 X X X 0.322 X X X X 0.358 X X 0.320 X X X 0.318 X X X 0.332 X X X 0.361 X X X X X 0.377 TABLE IV

INFLUENCE OF THE ADDITIONAL“HAND-REFERENCE”STREAM. THE SYMBOL“X”DENOTES THAT“HAND-REFERENCE”IS USED. FOR THE BASELINE,ALL THE FIVE PROPOSED STREAMS, i.e., “HAND”, “OBJECT”

AND“REFERENCE”, “HAND-OBJECT”AND“OBJECT-REFERENCE”,ARE USED.

hand-reference Accuracy

0.377

X 0.372

is manually specified. By exploiting ground-truth bounding boxes of all these three key factors, the final recognition accuracy can be further improved by 11.5%. This implies that improving hand and object detection can further benefit our method. Notice that our goal is to exploit the human intention to reduce the ambiguity in understanding atomic human-object interactions. Providing more accurate and specific detectors for hands and objects will be our future work.

D. Ablation Study

1) Influence of the Five Streams in SP(3+1)D: The pro-posed SP(3+1)D consists of five streams, i.e., hand-object, object-reference, hand, object and reference. In this section, we investigate the influence of these streams on interaction recognition individually. For example, when we investigate the “hand-object” stream individually, the other four streams are removed from SP(3+1)D. The experimental results are re-ported in Table III, in where the manually specified references

(8)

TABLE V

INFLUENCE OFh, o, rFLAGS AND POSITION VECTORS ON RELATIONSHIP( i.e., “HAND-OBJECT”AND“OBJECT-REFERENCE”)REASONING. THE SYMBOL“X”DENOTES THAT THE CORRESPONDING INPUT IS USED. Relationship inputs _Accuracy I h, o, r flags position vector

X 0.256

X X 0.291

X X 0.286

X X X 0.320

are used. As indicated by the experimental results, we observe the following phenomena:

Employing a single stream (the top part in Table III), the “hand-object” and “object-reference” streams achieve the top-2 accuracy. This indicates the importance of these two relationships for atomic hand-object interaction recognition on the Something-1.3k dataset. Because the distance changes between an object and a reference provide critical cues for interaction recognition, the network using the position-based relationships can achieve better accuracy, as shown in the top part of Table III.

Among the two relationships, the single “object-reference” stream achieves better accuracy (34%) than the single “hand-object” one (30%). This implies that the relative positions is more informative when compared to “hand–object” since the hand and object often are very close during interactions. Moreover, using references specified by human intentions significantly reduces ambiguities in interaction recognition.

Among the cases of only using a single “hand”, “object” or “reference” stream, the “object” stream provides the most informative clue, thus obtaining the highest accuracy. Since the “object” bounding boxes sometimes include the “hand” and “reference” areas, the rich information can be extracted and then leads to better recognition performance.

In the second part of Table III, when integrating “reference”, our method achieves an improvement of 3.6%. By further integrating “object-reference”, as shown in the last part of Table III, our method further achieves an improvement of 1.9%. Compared to the case where reference is not exploited, we achieve an improvement of 5.5%. This demonstrates the effectiveness of using reference reasoning for recognition accuracy.

As indicated in the third part of Table III, compared to “hand-object” and “object-reference”, using additional infor-mation, e.g., “hand”, “object” or “reference”, can further improve the recognition accuracy. This is because our network pays more attention to the details of “hand”, “object” and “reference” and obtains complementary information, such as detailed appearance and pose changes.

2) Influence of “hand-reference”: Based on the design philosophy of our HOR pattern, the “hand-reference” rela-tionship is unnecessary, and thus we additionally add the “hand-reference” steam to investigate the influence of this unnecessary stream. The proposed flag and 8D position vector techniques are also used in this stream. As shown in Table IV, the “hand-reference” relationship does not effectively help the proposed method.

3) Influence of h∗, o∗, r∗ Flags and Position Vectors on Relationship Reasoning: In the “hand-object” and “object-reference” streams, h∗, o∗, r∗ pixel flags (note that these are different from the “hand”, “object” and “reference” streams) and position vectors are added to improve relationship reason-ing. To study the impact of these flags and position vectors, we only examine the “hand-object” and “object-reference” streams, and the individual “hand”, “object” and “reference” streams are removed from SP(3+1)D. Experimental results are reported in Table V.

When only using the pixel color I , we achieve a recognition accuracy of 26%. By incorporating h∗, o∗, r∗ flags, accuracy increases by 3%. This shows that these pixel flags facilitate relationship reasoning since more specific objects including hands and references are provided. By introducing the position vectors, we increase the accuracy by 2%. This confirms that position based modeling explicitly exploits motion information for relationship reasoning. By simultaneously using the flags and position vectors, we achieve an recognition accuracy of 32%, which demonstrates the effectiveness of the h, o, r flags and position vectors for “hand-object” and “object-reference” reasoning.

4) SP(3+1)D with Different Backbones: The proposed SP(3+1)D framework is based on convolutional neural net-works. To explore how different backbones impact on the performance of our SP(3+1)D, we conduct experiments on different backbones and the results are reported in Figure 6. The R3D-based SP(3+1)D achieves the highest recognition accuracy, which is also consistent with the first part of Table II, where the R3D network achieves the highest accuracy among the first type of the competitors. This observation indicates that a strong backbone facilitates our SP(3+1)D to extract more robust spatio-temporal features.

E. Visualization of Recognition Results

As illustrated in Figure 5, we visualize a few interaction recognition examples to demonstrate the advantage of our proposed method. We compare our SP(3+1)D method with the best entire video representation based method among the competitors of the first type, i.e., R3D, and the best structured video representation based method among the competitors of the second type, i.e., STRG.

In the first example of Figure 5, neither R3D nor STRG can correctly recognize the interaction. Because STRG uses IoU to measure the relationship between objects, it recognizes the cell phone as “being in the book”. On the contrary, benefiting from the position vectors and state flags, our SP(3+1)D can better reason about the “object-reference” relationships. This demonstrates that apart from explicitly modeling 2D position changes, relative-position relationships play an important role in understanding the visual dynamics of human-object inter-actions.

As seen in the second example of in Figure 5, R3D misun-derstands the interaction and STRG fails to reason the “object-reference” relationship. In contrast, our SP(3+1)D successfully recognizes the interaction, showing the effectiveness of our proposed method.

(9)

R3D: taking something out of something STRG: moving something from left to right

SP(3+1)D: moving something away from something

R3D: moving something away from something

STRG: moving something from left to right SP(3+1)D: moving something from left to right

R3D: putting something onto something STRG: dropping something into something

SP(3+1)D: putting something into something

R3D: dropping something onto something STRG: putting something onto something

SP(3+1)D: putting something in front of something

R3D: dropping something into something STRG: dropping something into something

SP(3+1)D: dropping something onto something

R3D: dropping something onto something

STRG: putting something onto something SP(3+1)D: putting something onto something

hand

object

reference

Fig. 5. Visualization of interaction recognition results. The green and red colors indicate that the answer is correct and incorrect, respectively.

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 C3D I3D MC3 R3D R(2+1)D 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 C3D I3D MC3 R3D R(2+1)D

Fig. 6. Accuracy of SP(3+1)D with different backbones.

In the third example in Figure 5, R3D incorrectly reasons the “object-reference” relationship. This is because R3D takes the entire video as input and does not focus on the correct “reference”, i.e., the table. This demonstrates the importance and necessity of introducing human intention for interac-tion recogniinterac-tion. Using the human inteninterac-tion as an auxiliary information, we can significantly reduce the ambiguity in understanding interactions.

The fourth example in Figure 5 manifests that R3D fails to understand the correct “object-reference” relationship and STRG predicts the incorrect “hand-reference” relationship. Benefiting from the proposed HOR pattern and SP(3+1)D framework, our method correctly understands the interaction. For the fifth and sixth in Figure 5, we illustrate the

in-fluence of human intention, i.e., reference, on hand-object recognition. In the fifth example, when we select the book as the “reference”, our SP(3+1)D predicts the “object-reference” relationship as “in front of”. When we select the book as the “reference” in the sixth example, the proposed meth-ods recognizes the “object-reference” relationship as “onto”. However, the R3D and STRG methods predict the same interactions when the reference changes. This demonstrates that our methods effectively exploits human intention and adaptively predicts the labels according to the intention.

VI. CONCLUSION

In this paper, we attempt to exploit human intention to better understand hand-object interactions. By treating human inten-tion as reference, we propose a novel pattern <hand—object, object—reference, hand, object,reference> (HOR) for atomic hand-object interaction. This pattern leverages the hand, ob-ject and reference states and their relationships to precisely describe an interaction. When applying the proposed HOR pattern to interaction recognition tasks, we significantly re-duce the ambiguity in understanding hand-object interactions. Furthermore, we design a simple yet effective Spatially Part-based (3+1)D network to map HOR representations to the interaction labels. Due to the lack of interaction datasets that contains annotations of hands, objects and references, we annotate a Something-1.3k dataset for evaluation. Extensive results show that our method outperforms the state-of-the-art methods, demonstrating the importance of introducing human intention in this task.

(10)

ACKNOWLEDGMENTS

This research is supported by the Agency for Science, Tech-nology and Research (A*STAR) under its AME Programmatic Funding Scheme (#A18A2b0046).

REFERENCES

[1] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in ICCV, 2015, pp. 4489–4497.

[2] J. Carreira and A. Zisserman, “Quo vadis, action recognition? A new model and the kinetics dataset,” in CVPR, 2017, pp. 4724–4733. [3] Z. Qiu, T. Yao, and T. Mei, “Learning spatio-temporal representation

with pseudo-3d residual networks,” in ICCV, 2017, pp. 5534–5542. [4] K. Hara, H. Kataoka, and Y. Satoh, “Can spatiotemporal 3d cnns retrace

the history of 2d cnns and imagenet?” in CVPR, 2018, pp. 6546–6555. [5] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recognition,” in CVPR, 2018, pp. 6450–6459.

[6] X. Wang and A. Gupta, “Videos as space-time region graphs,” in ECCV, 2018, pp. 413–431.

[7] J. Materzynska, T. Xiao, R. Herzig, H. Xu, X. Wang, and T. Dar-rell, “Something-else: Compositional action recognition with spatial-temporal interaction networks,” in CVPR, 2020.

[8] S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: towards real-time object detection with region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, 2017. [9] A. Rosenfeld and S. Ullman, “Hand-object interaction and precise

localization in transitive action recognition,” in Conference on Computer and Robot Vision, 2016, pp. 148–155.

[10] B. Tekin, F. Bogo, and M. Pollefeys, “H+O: unified egocentric recog-nition of 3d hand-object poses and interactions,” in CVPR, 2019, pp. 4511–4520.

[11] R. Goyal, S. E. Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fr¨und, P. Yianilos, M. Mueller-Freitag, F. Hoppe, C. Thurau, I. Bax, and R. Memisevic, “The ”something something” video database for learning and evaluating visual common sense,” in ICCV, 2017, pp. 5843–5851.

[12] H. Fan, X. Yu, Y. Ding, Y. Yang, and M. Kankanhalli, “PSTNet: Point spatio-temporal convolution on point cloud sequences,” in ICLR, 2021. [13] X. Chang, Y. Yu, Y. Yang, and E. P. Xing, “Semantic pooling for complex event analysis in untrimmed videos,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 8, pp. 1617–1632, 2017.

[14] H. Fan, X. Chang, D. Cheng, Y. Yang, D. Xu, and A. G. Hauptmann, “Complex event detection by identifying reliable shots from untrimmed videos,” in ICCV, 2017, pp. 736–744.

[15] A. Shahroudy, J. Liu, T. Ng, and G. Wang, “NTU RGB+D: A large scale dataset for 3d human activity analysis,” in CVPR, 2016, pp. 1010–1019. [16] X. Li and M. C. Chuah, “SBGAR: semantics based group activity

recognition,” in ICCV, 2017, pp. 2895–2904.

[17] ——, “Rehar: Robust and efficient human activity recognition,” in WACV, 2018, pp. 362–371.

[18] H. Fan, Z. Xu, L. Zhu, C. Yan, J. Ge, and Y. Yang, “Watching a small portion could be as good as watching all: Towards efficient video classification,” in IJCAI, 2018, pp. 705–711.

[19] H. Fan and Y. Yang, “Person tube retrieval via language description,” in AAAI, 2020, pp. 10 754–10 761.

[20] H. Fan, L. Zhu, and Y. Yang, “Cubic lstms for video prediction,” in AAAI, 2019, pp. 8263–8270.

[21] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in NeurIPS, 2014, pp. 568–576. [22] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. V. Gool,

“Temporal segment networks: Towards good practices for deep action recognition,” in ECCV, 2016, pp. 20–36.

[23] J. Y. Ng, M. J. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, “Beyond short snippets: Deep networks for video classification,” in CVPR, 2015, pp. 4694–4702.

[24] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.

[25] H. Zhang, Z. Kyaw, S. Chang, and T. Chua, “Visual translation embed-ding network for visual relation detection,” in CVPR, 2017, pp. 3107– 3115.

[26] H. Zhou, C. Zhang, and C. Hu, “Visual relationship detection with relative location mining,” in ACM Multimedia, 2019, pp. 30–38.

[27] X. Shang, T. Ren, J. Guo, H. Zhang, and T. Chua, “Video visual relation detection,” in ACM Multimedia, 2017, pp. 1300–1308.

[28] T. Zhuo, Z. Cheng, P. Zhang, Y. Wong, and M. S. Kankanhalli, “Explainable video action reasoning via prior knowledge and state transitions,” in ACM Multimedia, 2019, pp. 521–529.

[29] D. Di, X. Shang, W. Zhang, X. Yang, and T. Chua, “Multiple hypothesis video relation detection,” in International Conference on Multimedia Big Data, 2019, pp. 287–291.

[30] Y. Liao, S. Liu, F. Wang, Y. Chen, C. Qian, and J. Feng, “PPDM: parallel point detection and matching for real-time human-object interaction detection,” in CVPR, 2020, pp. 479–487.

[31] M. Qi, E. Remelli, M. Salzmann, and P. Fua, “Unsupervised domain adaptation with temporal-consistent self-training for 3d hand-object joint reconstruction,” CoRR, vol. abs/2012.11260, 2020.

[32] S. Qi, W. Wang, B. Jia, J. Shen, and S. Zhu, “Learning human-object interactions by graph parsing neural networks,” in ECCV, 2018, pp. 407–423.

[33] J. Ji, R. Krishna, F. Li, and J. C. Niebles, “Action genome: Actions as composition of spatio-temporal scene graphs,” in CVPR, 2020. [34] S. Bambach, S. Lee, D. J. Crandall, and C. Yu, “Lending A hand:

Detect-ing hands and recognizDetect-ing activities in complex egocentric interactions,” in ICCV, 2015, pp. 1949–1957.

[35] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijaya-narasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman, “The kinetics human action video dataset,” CoRR, vol. 1705.06950, 2017.

[36] J. Liu, A. Shahroudy, M. Perez, G. Wang, L.-Y. Duan, and A. C. Kot, “Ntu rgb+d 120: A large-scale benchmark for 3d human activity understanding,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.

[37] T. Lin, P. Doll´ar, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie, “Feature pyramid networks for object detection,” in CVPR, 2017. [38] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image

recognition,” in CVPR, 2016, pp. 770–778.

[39] T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick, “Microsoft COCO: common objects in context,” in ECCV, 2014.

[40] A. Bewley, Z. Ge, L. Ott, F. T. Ramos, and B. Upcroft, “Simple online and realtime tracking,” in ICIP, 2016, pp. 3464–3468.

[41] T. Zhuo, Z. Cheng, P. Zhang, Y. Wong, and M. S. Kankanhalli, “Unsupervised online video object segmentation with motion property understanding,” IEEE Trans. Image Process., vol. 29, pp. 237–249, 2020. [42] D. Yuan, X. Chang, P. Huang, Q. Liu, and Z. He, “Self-supervised deep correlation tracking,” IEEE Trans. Image Process., vol. 30, pp. 976–985, 2021.

[43] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR, 2015.

[44] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in CVPR, 2015, pp. 1–9.