Towards Accurate Oriented Object Detection in Aerial Images with Adaptive Multi-levelFeature Fusion

(1)

Adaptive Multi-level Feature Fusion

^∗

PEINING ZHEN,

Shanghai Jiao Tong University, China

SHUQI WANG,

SUMING ZHANG,

Beijing Institute of Astronautical Systems Engineering, China

XIAOTAO YAN,

WEI WANG,

ZHIGANG JI,

HAI-BAO CHEN,

Detecting objects in aerial images is a long-standing and challenging problem since the objects in aerial images vary dramatically in size and orientation. Most existing neural network based methods are not robust enough to provide accurate oriented object detection results in aerial images since they do not consider the correlations between diferent levels and scales of features. In this paper, we propose a novel two-stage network-based detector with adaptive feature fusion towards highly accurate oriented object detection in aerial images, named AFF-Det. First, a multi-scale feature fusion module (MSFF) is built on the top layer of the extracted feature pyramids to mitigate the semantic information loss in the small-scale features. We also propose a cascaded oriented bounding box regression method to transform the horizontal proposals into oriented ones.

Then the transformed proposals are assigned to all feature pyramid network (FPN) levels and aggregated by the weighted RoI feature aggregation (WRFA) module. The above modules can adaptively enhance the feature representations in diferent stages of the network based on the attention mechanism. Finally, a rotated decoupled-RCNN head is introduced to obtain the classiication and localization results. Extensive experiments are conducted on the DOTA and HRSC2016 datasets to demonstrate the advantages of our proposed AFF-Det. The best detection results can achieve 80.73% mAP and 90.48% mAP respectively on these two datasets, outperforming recent state-of-the-art methods.

CCS Concepts: · Computing methodologies→ Neural networks; Object recognition.

Additional Key Words and Phrases: Remote sensing images, Aerial images, Oriented object detection, Convolutional neural network.

∗This work is supported by the National Key Research and Development Program of China under grant 2019YFB2205005. Corresponding author: Hai-Bao Chen.

Authors’ addresses: Peining Zhen, [email protected], Shanghai Jiao Tong University, Shanghai, China, 200240; Shuqi Wang, sqwang026@

sjtu.edu.cn, Shanghai Jiao Tong University, Shanghai, China, 200240; Suming Zhang, [email protected], Beijing Institute of Astronautical Systems Engineering, Beijing, China, 100076; Xiaotao Yan, [email protected], Beijing Institute of Astronautical Systems Engineering, Beijing, China, 100076; Wei Wang, [email protected], Beijing Institute of Astronautical Systems Engineering, Beijing, China, 100076; Zhigang Ji, [email protected], Shanghai Jiao Tong University, Shanghai, China, 200240; Hai-Bao Chen, [email protected], Shanghai Jiao Tong University, Shanghai, China, 200240.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proit or commercial advantage and that copies bear this notice and the full citation on the irst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior speciic permission and/or a fee. Request permissions from [email protected].

1551-6857/2022/7-ART1 $15.00 https://doi.org/10.1145/3513133

(2)

1 INTRODUCTION

Recent years have witnessed great success in remote sensing technologies with the rapid development of deep learning and neural networks [31, 37, 39]. As one of the most important topics in real-world computer vision tasks, object detection in aerial images can meet a wide range of requirements for agricultural, commercial, and geological applications. Detecting objects in aerial images is more challenging than that in natural images because the objects have diverse sizes and arbitrary orientations rather than horizontal shapes. Despite the diiculty, there are many studies to tackle these challenges based on the convolution neural networks (CNN) [6, 35, 46].

w

h q

Y

X

OBB HBB

w

h

Fig. 1. The representation of oriented bounding box (OBB) and horizontal bounding box (HBB). We always keep the width w as the long side and the height h as the short side of an OBB.

In the past decade, CNN based object detection methods have achieved tremendous success in natural image object detection in terms of accuracy and speed [10, 17, 26]. Nowadays, there are mainly two types of object detectors, namely one-stage detectors and two-stage detectors. One-stage detectors such as YOLO [22], SSD [16]

and RetinaNet [14] get rid of the heavy region proposal network (RPN) [23] and thus achieve higher speed than two-stage detectors. However, they usually get lower detection accuracy than two-stage detectors. In general, these methods are based on the horizontal bounding boxes (HBB) which are more suitable for natural images. The objects in aerial images have arbitrary orientations, therefore the HBBs can not properly regress the bounding boxes (BBoxes) of target objects. To address this problem, the proposed two-stage AFF-Det leverages the oriented bounding box (OBB) regression branch to locate the oriented objects. The two types of bounding boxes are graphically shown in Fig. 1. We introduce an additional parameter θ to represent the direction of an oriented object. Moreover, our AFF-Det proposes a cascaded OBB regression (COR) structure to transform the horizontal proposals into oriented ones. Based on the oriented proposals, the misalignment between region features and oriented targets can be eliminated. As a result, we can achieve better detection results than the HBB based detectors.

Conventional detectors extract aerial image features following a bottom-up pathway with the backbone network. After feature extraction from the backbone network, the feature maps at the top level are scaled to a very small size. And the channels of these top-level feature maps will commonly be reduced from 2048 to 256 before building FPN from top to down [13]. Consequently, the highest level features sufer from the most severe semantic information loss. To tackle this problem, we adaptively enhance the top-level feature representations with multi-scale aggregated features based on the self-attention mechanism. Then the enhanced multi-scale features are fused and propagated to build FPN. The proposed method can signiicantly mitigate the semantic information loss in the area of small and clustered objects.

Moreover, FPN based two-stage detectors usually predict Region of Interests (RoIs) from diferent levels of feature maps in FPN, and RoIAlign [8] is adopted to extract the RoI features. In general, the RoIs will be assigned to diferent levels of FPN with respect to the RoI scales. Speciically, a small-scale RoI tends to be assigned to a

(3)

lower FPN feature level and vice versa. However, the assignment function in [13] is obtained empirically with hard-coded hyper-parameters, which are not adaptive to the aerial images. Since the aerial images are large and have scale-varied objects, the pre-deined assignment function based on ImageNet is not suitable to train our model. In the proposed AFF-Det detector, we propose to map the RoIs to all FPN levels and aggregate the extracted features without depending on the assignment function. Furthermore, we also adaptively fuse and enhance the RoI features from diferent levels with diferent weights. The proposed methods can adaptively enhance the features in the network low and thus we can achieve higher detection accuracy than previous methods.

In this paper, we propose a two-stage Faster-RCNN based detector for oriented object detection in aerial images. First, to adaptively enhance the backbone extracted features and mitigate the semantic information loss, a multi-scale feature fusion (MSFF) module is built on the top level of the extracted feature pyramids. After the RPN, the generated horizontal proposals are irst transformed into oriented ones through the cascaded OBB regression head; then the rotated proposals are assigned to all FPN levels to extract rotation-invariant RoI features. These features are adaptively fused by the weighted RoI feature aggregation (WRFA) module. Finally, a rotated decoupled-RCNN head is introduced to classify and locate the oriented objects. Diferent from previous methods [8, 23, 24, 43], the rotated decoupled-RCNN head does not have shared fully-connected (FC) layers for processing the extracted features; it owns two separate network branches to obtain the classiication and localization results of oriented objects. Moreover, the rotated head also has an additional parameter θ to regress the directions of oriented objects while the traditional methods only have four parameters to locate the HBBs of the targets. Experimental results on the DOTA and HRSC2016 datasets show that our proposed method can achieve state-of-the-art detection accuracy of 80.73% mAP and 90.48% mAP respectively. In addition, the results of ablation studies can demonstrate the efectiveness of our proposed methods. The main contributions of this paper can be summarized as follows:

• A novel two-stage object detector is proposed towards highly accurate oriented object detection in aerial images. The cascaded OBB regression (COR) method is proposed to alleviate the mismatches between horizontal proposals and oriented ground-truth. Thus the region features can be extracted aligned with oriented targets and the results can be signiicantly boosted.

• We propose the MSFF and WRFA modules that can adaptively fuse the multi-level aerial image features with diferent weights to mitigate the semantic and context information loss.

• We introduce the rotated decoupled-RCNN head to perform the classiication and localization tasks separately for the oriented object detection, which can achieve better performance compared with the coupled-RCNN head.

• A lightweight version of AFF-Det, named AFF-Det-Lite, is proposed to balance the detection performance and the model complexity. AFF-Det-Lite can achieve very competitive accuracy with signiicantly reduced computational cost on the DOTA benchmark.

• The experimental results demonstrate that the proposed AFF-Det can achieve state-of-the-art accuracy on the widely used DOTA and HRSC2016 datasets.

The rest of this paper is organized as follows. In Section 2, recent works about oriented object detection are reviewed. Section 3 introduces the proposed AFF-Det for oriented object detection in aerial images. Ablation studies are provided in Section 4.2 to verify the efectiveness of our proposed methods. In Section 4.3, we compare the proposed AFF-Det with state-of-the-art methods on two challenging benchmarks. Finally, conclusions are drawn in Section 5.

(4)

2 RELATED WORK

The general object detectors can be mainly divided into two types: one-stage detectors and two-stage detectors.

In this section, we review the methods proposed to solve the oriented object detection problem based on these two types of detectors.

2.1 Two-stage Detectors

Many works focus on oriented object detection based on two-stage detectors. In [31], the authors irst propose the Faster-RCNN OBB network that detects the oriented objects in aerial images with oriented bounding boxes.

Since the two-stage detectors are based on the RPN, the RoI transformer [4] designs and predicts rotated anchors to generate rotated region proposals. Furthermore, ReDet [7] combines rotation-equivariant features and rotation-invariant RoIAlign to formulate a rotation-equivariant detector based on the RoI transformer. Inspired by Cascaded-RCNN [2], our work proposes a regression network for generating rotated proposals. The rotated proposals can eliminate the mismatch between oriented objects and extracted features, thereby signiicantly increasing the detection accuracy. In [39], the authors propose SCRDet for robust oriented object detection with equally weighted RoI features. In contrast, our proposed method adaptively assigns diferent weights to the RoI features and obtains better performance. The work [43] proposes PLCNet based on Faster-RCNN, which extracts and concatenates the pooled RoI features to compensate for the information loss. However, they treat the features from diferent FPN levels equally while our proposed method gives diferent weights to distinct levels of features.

In [5, 33], the point-based representation is proposed to encode the location of oriented objects. Diferent from the works [5, 33], our method leverages the angle θ to represent the orientations of the objects, which is more intuitive and easier to train. Furthermore, the works [27, 32, 44] integrate the segmentation mask prediction with the bounding box prediction to propagate and enhance the semantic information in the network. Although their results are promising, the networks in [27, 32, 44] have to predict the segmentation masks which greatly increase the computational complexity of the models. The above-mentioned two-stage detectors dedicate to improving the object representations (e.g. BBox representation), while our method focuses on improving the feature representations throughout the proposed detector.

2.2 One-stage Detectors

In addition, some other works explore one-stage or anchor-free methods for oriented object detection. In SCRDet++

[38], the authors propose the instance level denoising technique towards more robust detection of arbitrarily- rotated objects. However, the denoised features and horizontal anchors still sufer from the misalignment issue.

Inspired by the deformable convolution, S²A-Net [6] proposes to solve the misalignment problem between features and anchors based on the alignment convolution. In S²A-Net, high-level features are extracted aligned with rotated anchor boxes. Similarly, R³Det [36] aligns the features between rotated anchors and horizontal receptive ields via feature reinement and reconstruction. Another way to optimize the one-stage detectors is to smooth the learning objective function. In the works [34, 35], the authors propose two label smooth methods named CSL and DCL, which treat angle prediction as a classiication task to mitigate the boundary discontinuity problems.

However, large backbones are still required to achieve high accuracy. In [21], the authors propose modulated rotation loss to dismiss the loss discontinuity and regression inconsistency problems. Moreover, by converting the rotated BBoxes into 2D Gaussian distributions, the GWD [37] and KLD [40] can compute regression losses as the distance between two Gaussian distributions. GWD and KLD losses can solve the boundary discontinuity problem raised in [37].

What’s more, the IENet [15] leverages the self-attention module to enhance the features in the detection head.

Although the IENet achieves promising running speed, the detection accuracy is far from accurate. In [20], the authors improve the CenterNet [45] baseline with the feature selection module that can adjust receptive ields in

(5)

accordance with shapes and orientations of the oriented targets. In [3], a pixel-wise loss (PIoU loss) is proposed to exploit both the angle and IoU for accurate oriented object detection. However, the combination of CenterNet and PIoU loss still sufers from extremely low accuracy even with a deep backbone network. To sum up, the main drawback of the one-stage detectors is that they sufer from low detection accuracy even with large and deep backbone networks. Diferent from the above methods, our proposed method is an anchor-based two-stage detector that can achieve state-of-the-art detection accuracy while maintaining high model eiciency.

3 THE PROPOSED METHOD 3.1 Network Pipeline

The network pipeline of the proposed AFF-Det is shown in Fig. 2. It is a two-stage detector composed of a backbone network, a feature pyramid network (FPN), the proposed MSFF and WRFA modules, the cascaded OBB regression head with the rotated decoupled-RCNN. The widely used ResNet [9] is adopted in our AFF-Det to produce the diferent levels of feature maps, which are denoted by{C²,C3,C4,C5} in Fig. 2. We then build the FPN [13] using these feature maps and represent the pyramid levels as{P²,P3,P4,P5}. All the pyramid features for RoI feature extraction have 256 channels. Our proposed MSFF module enhances the top-level feature maps C5which have the most severe context and semantic information loss. A cascaded regression structure is built after FPN to learn the rotated proposals, which can better model the oriented objects. In the WRFA module, the Rotated-RoIAlign [4] is leveraged to extract features from each level of the FPN based on the rotated RoIs. These features are fused adaptively with diferent weights. Then the aggregated features from the WRFA module are fed into the decoupled-RCNN head for oriented object detection. The convolution layers and FC layers are leveraged separately for object classiication and localization in the decoupled-RCNN head.

MSFF

WRFA

FC Layer FC Layer

Aggregated Features

Conv Layer

C2 C3

C4 C5

P2 P3

P4

P5

cls reg

p x, y, h, w, q

Rotated Decoupled-RCNN FPN

Backbone

Fig. 2. The overall architecture of our proposed framework. The proposed cascaded OBB regression structure is omited in this figure for simplicity.

3.2 Multi-scale Feature Fusion Module

By fusing the features at diferent levels, FPN adopts both the high-resolution low-dimension and the low- resolution high-dimension features to achieve accurate predictions. To build FPN, all features{C2,C3,C4,C5} extracted from the backbone are passed through a 1× 1 convolution layer to reduce the feature channels to 256.

FPN then fuses the features in a top-down pathway by upsampling and element-wisely adding the multi-scale features P5to P2. However, this method is sub-optimal for small object detection in aerial images. There exist extremely small objects in aerial images whose corresponding features will be scaled down to very small sizes at pyramid level C5. Moreover, to build FPN, the channels of C5are usually reduced from 2048 to 256 in P5. Based on

(6)

the above two reasons, the P5in FPN sufers signiicant semantic information loss to accurately predict detection results.

C5 ^nC×H×W

C×H×W n×H×W n×H×W

s1

s2

sn

n×(1×H×W)

n×(C×H×W) Concat

Conv1×1 Conv3×3 Sigmoid

Split

Unsqueeze Upsample

Upsample

Upsample C × W × H

snW × snH

s2W × s2H

s1W × s1H

C × W × H

Element-wise Multiplication

Element-wise

Sum AAP Adaptive Average Pooling AAP

Fig. 3. The network architecture of our proposed multi-scale feature fusion (MSFF) module. The ReLU activation functions ater convolution layers are omited for simplicity in the figure.

To mitigate the information loss of the single-scale feature P5, we propose the multi-scale feature fusion (MSFF) module. The network architecture of the proposed MSFF module is graphically shown in Fig. 3. We irst generate multi-scale features by using the adaptive average pooling (AAP) to the original single-scale feature C5. Given the feature maps of size H× W , the generation of multi-scale features can be represented as follows:

Hi× Wⁱ = (Si × H ) × (Sⁱ× W ), (1)

where S_i is the manually deined feature scale, and i ∈ {1, . . . , n}. n is the number of diferent feature scales.

The adaptive average pooling can enhance and aggregate the features to diferent scales. Then we upsample the multi-scale enhanced features to make it have the same shape H× W as C⁵through bilinear interpolation like FPN [13]. The number of the upsampling is equal to the number of feature scales n.

In the following step, the features are concatenated and separated into two branches as shown in Fig. 3. The irst branch leverages 1× 1 and 3 × 3 convolution layers to reduce the feature channels to the number of scales n;

then the sigmoid function is employed along the channel dimension to normalize the values of each channel into (0, 1) adaptively. The normalized values now represent the weights of each feature scale respectively. The other branch irst multiplies the original features and the corresponding weights in channel-wise and then aggregates them to get the inal enhanced fused features. Finally, we add the multi-scale fused features into the original feature branch from C5to P5as a residual connection. The experimental results show that the MSFF module can signiicantly improve the detection accuracy since semantic information loss is compensated by the weighted multi-scale information. We select the number of scales n experimentally and the results are given in ablation study 4.2.2.

3.3 Cascaded Oriented Bounding Box Regression

In general, prior two-stage detectors generate horizontal proposals for detection heads. However, relying on horizontal proposals will introduce misalignment between RoI features and oriented objects in aerial images. It is diicult for a single detection head to perform uniformly with horizontal proposals and oriented target BBoxes.

Inspired by the work Cascaded-RCNN [2], we propose to decompose the diicult OBB regress task into two sequential steps as shown in Fig. 4.

In this work, we propose to regress the OBBs in a coarse to ine manner. In the irst coarse regression step, we feed the horizontal proposals (HB0) into the irst vanilla couple-RCNN detection head H1 for regressing them

(7)

I Backbone HB0

H1

C1 RB1

R-RoIAlign H2

C2 RB2

RoIAlign

Fig. 4. The diagram of the proposed cascaded oriented bounding box regression (COR) method. łIž is input image and łCž is classification. łHB0ž represents the horizontal proposals; łRB1ž represents the rotated BBox outputs of the vanilla coupled-RCNN detection head łH1ž; łRB2ž denotes the refined rotated BBox outputs of the rotated decoupled-RCNN head łH2ž.

with the oriented ground-truth. The irst detection head will predict the angle diference between a horizontal RoI (HRoI) and an oriented ground-truth. After the irst detection head, we can transform the horizontal RoIs into rotated RoIs (RRoI) with the predicted ofsets. The detailed mathematical deinitions of ofsets are given in section 3.5.2. Compared with HRoIs, the RRoIs have fewer mismatches with the oriented ground-truth. Then in the reinement stage, the RRoIs are fed into the proposed rotated decoupled-RCNN head H2 for region feature extraction and inal oriented object detection.

Since RoIs are oriented in the rotated decoupled-RCNN head H2, using normal RoIAlign operation to extract RRoI features from FPN will cause the performance drop. In the rotated decoupled-RCNN head, we adopt the Rotated-RoIAlign (R-RoIAlign) operation for rotation-invariant feature extraction. Suppose we have the feature mapF of size C × H × W and a RRoI (xr,yr,wr,hr,θr). Following RoIAlign, to produce features F with uniied size C× L × L, a RRoI is divided into L × L bins. Then each value in F can be obtained as follows:

F_c(i, j) = X

(a,b )∈B(i, j )

F^{c,i, j}(Θ (a, b)) /ni j, (2)

where (i, j) (0≤ i, j < L) represents the index of each bin B, Θ is the transform function, n^{i j}is the number of sampling points in each bin. The bins B(i, j) is the coordinates set{[i^wL^r + ua wr

L×(ni,j/2+1),j^h_L^r + ub hr L×(ni,j/2+1)]| ua,ub= 1, . . . , ni, j/2}. Each point (a, b) in the bin B is rotated to (ar,br) within the RRoI area with the following transform function:

ar

b_r

!

= cos ∆θ − sin ∆θ sin ∆θ cos ∆θ

! a− wr/2 b− hr/2

! + xr

y_r

!

. (3)

Here ∆θ denotes the angle diference between a RRoI and its corresponding HRoI.

3.4 Weighted RoI Feature Aggregation Module

In AFF-Det, an FPN is introduced for building high-dimensional semantic features at diferent scales in a top-down pathway as shown in Fig. 2. In FPN, diferent feature pyramid levels have diferent receptive ields and semantic information, which are beneicial to handle the scale variation problems in object detection. Consequently, the FPN can signiicantly increase the detection accuracy of small objects such as in aerial images. For the classic region-based detectors like Faster-RCNN [23] or Mask-RCNN [8], RoIs of diferent scales are assigned to diferent

(8)

pyramid levels k in FPN to extract features following the Eq. (4):

k = h

k0+ log2

√wh/224i

, (4)

where k0is 4 and w, h are the width and height of each RoI. 224 is the canonical ImageNet pre-training size.

However, this assignment function is not suitable for aerial images for two reasons. First, the training aerial images are much larger than ImageNet images. Therefore the parameter 224 in Eq. (4) is not suitable for AFF-Det.

Second, the parameters in Eq. (4) are hard-coded which are not adaptive to the image or feature variations.

P2

P3 P4

P5

FPN

4C×H×W

4C×1×1 C/4×1×1

4C×1×1 4×(C×1×1)

4×(C×H×W) Concat

Conv1×1 Sigmoid

Split C×W×H

Aggregated Features

GAP Conv1×1

RRoIAlign

Element-wise Multiplication

Element-wise

Sum GAP Global Average Pooling

Fig. 5. The network architecture of our proposed weighted RoI feature aggregation (WRFA) module. The ReLU activation functions ater convolution layers are omited for simplicity in the figure.

In this work, to eliminate the inluence of manual and canonical design, we propose the weighted RoI feature aggregation (WRFA) module in the rotated decoupled-RCNN head. Diferent from previous methods [4, 8, 23], we assign the transformed RRoIs to all FPN levels rather than following the manually designed assignment function Eq. (4). The architecture of our proposed WRFA module is shown in Fig. 5. We irst assign an RRoI to all pyramid levels{P²,P3,P4,P5} for feature extraction using R-RoIAlign. Then the features Fⁱ ∈ R^{C×H ×W} from each level i ∈ {2, 3, 4, 5} are concatenated and separated into two branches, which can be formulated as follows:

F = F¯ ²◦ F³◦ F⁴◦ F⁵, (5)

where ¯F ∈ R^4C^{×H ×W} is the concatenated features and◦ is the concatenate operation.

We introduce the channel self-attention layers to guide the diferent pyramid features for spotlighting mean- ingful channels and further repressing uninformative ones. The concatenated features are irst pooled by global average pooling for generating statics zcof each feature channel. This procedure is formally described as follows:

zc= 1 H× W

H

X

l =1 W

X

m=1

F (c, l,m),¯ (6)

where c is the c-th channel in ¯F . After embedding the global information into the channel descriptors (zc), we can fully capture the dependencies between diferent feature channels. The features irst pass through two 1× 1 convolution layers for squeezing and extending the feature channels, which performs like a bottleneck module.

The squeeze and excitation convolutions have two advantages: 1) irst, they increase the non-linearity and aid

(9)

generalization in the module compared with a single convolution layer; 2) second, they can limit the module complexity with acceptable parameter increase compared with no channel reduction. Then we adopt a simple gating mechanism with a sigmoid function to normalize and generate weights for each feature pyramid channel.

The computation procedure is summarized as follows:

W ( ¯F ) = δ (I¹×1(P^Avд( ¯F ))), (7)

where δ represents the sigmoid function,I¹×1denotes 1× 1 convolution layers, and PAvдrepresents the global average pooling function for generating statics zc.W ( ¯F ) with size 4 × (C × 1 × 1) is the learnable weights for each feature pyramid channel. Each value in the weights is within (0, 1), which represents the importance of diferent channels.

Finally, we can adaptively fuse the diferent levels of features by summation to get the aggregated feature maps F^o. The computation process can be expressed as follows:

F^o =P5

i=2W ( ¯Fⁱ)⊗ ¯Fⁱ, (8)

where i ∈ {2, 3, 4, 5} is the feature pyramid level and ¯Fi enumerates{F2,F3,F4,F5}. ⊗ is the element-wise multiplication. The proposed WRFA module is able to extract features of the RRoIs from each feature level adaptively and learn their correlations for context information enhancement.

3.5 Rotated Decoupled-RCNN Head

3.5.1 Network structure.Inspired by the previous research [30], we introduce the decoupled-RCNN head that has separate branches for oriented object classiication and localization. In the Faster-RCNN, the coupled detection head shares several convolution layers (Conv layer) or fully-connected layers (FC layer) for classiication and BBox regression. However, the Conv layers and FC layers have diferent sensitivity for diferent detection tasks [30].

The Conv layers can encode the BBoxes more precisely while the FC layers provide more accurate classiication results. The comparison between coupled-RCNN head and decoupled-RCNN head is graphically shown in Fig. 6.

(b) Decoupled-RCNN (a) Coupled-RCNN

WRFA

Features class

box

FC Layer FC Layer WRFA

FC Layer

Features

Conv Layer

class box

FC Layer

Fig. 6. Comparison between coupled-RCNN head and decoupled-RCNN head. (a) The coupled-RCNN head with 2 shared FC layers; (b) The decoupled-RCNN head which has an FC layer based branch for classification and a Conv layer based branch for localization.

In our rotated decoupled-RCNN head, we leverage FC layers for oriented object classiication and Conv layers for oriented BBox regression. As shown in igures 2 and 6, the classiication branch is a stack of Kc FC layers followed by ReLU activation function. Since FC layers will signiicantly increase the model complexity, Kcis set to 2 following the Faster-RCNN FPN implementation in [13]. On the contrary, the regression branch is a stack of Conv layers. Diferent from the vanilla convolution layers like in the VGG network, the proposed regression branch leverages several ResBlocks [9] as shown in Fig. 7. The regression branch has at least one stem ResBlock followed by Kr stage ResBlocks. The irst stem ResBlocks adopts a 1× 1 Conv layer to increase the input channels from 256 to 1024 in our implementation. The dotted box in the Fig. 7 shows the 1× 1 Conv layer that only works

(10)

in the irst stem block. In the stage ResBlocks, this is a direct shortcut connection from input to output without the 1× 1 Conv layer. Based on our experiments, the decoupled head can provide more accurate detection results compared with the coupled head.

Conv 1×1

Conv 3×3

Conv 1×1 Conv

1×1

C×W×H

bn, relu bn, relu

C/4×W×H C/4×W×H

C×W×H

Fig. 7. The ResBlock used in the proposed decoupled-RCNN head for OBB regression.

3.5.2 BBox regression.In the normal CNN based detectors for natural image object detection, the detection head usually adopts the horizontal bounding boxes to encode the location of an object. The HBB of an object can be expressed as follows:

Oh = (x, y, w, h), (9)

where (x, y) indicate the coordinates of the HBB center point; w and h are the width and height of the HBB respectively. However, four parameters are not suicient to represent an oriented object. In this work, we modify the Conv detection branch with ive parameters to precisely encode the oriented bounding box of an object in aerial images. We introduce the parameter θ to represent the direction of an oriented object. The OBB of an object with ive parameters can be expressed as follows:

O^r = (x, y, w, h, θ ), (10)

where θ is the angle between the width w and positive direction of X-axis. We regress the oriented bounding boxes with the ofsets between anchors and ground-truth (or predictions). The ofset ˆt_ibetween anchors and ground-truth can be deined as follows:

ˆtx = ( ˆx− xâ) cos θa+ ( ˆy− yâ) sin θa /wa, ˆty = ( ˆy− yâ) cos θa− ( ˆx − xâ) sin θa /ha,

ˆtw = log( ˆw/wa), ˆth = log( ˆh/ha), ˆtθ = [( ˆθ− θ^a) mod 2π ]/2π ,

(11)

where ( ˆx, ˆy, ˆw, ˆh, ˆθ ) denote the center coordinates, width, height, and the angle of a ground-truth OBB; i enumerates{x,y, w, h, θ }. (x^a,ya,wa,ha,θa) represent the center coordinates, width, height, and the angle of an anchor.

As for the ofset ti between predictions and anchors, we can simply replace ( ˆx, ˆy, ˆw, ˆh, ˆθ ) with the predicted results (x, y, w, h, θ ) in the above equations (11). As mention in section 3.3, the ofsets between HRoIs and the RRoIs (known as the outputs of the irst vanilla coupled-RCNN head) also follows the Eq. (11). We take the modulus of the angle θ to ensure that the angle falls in one period. Moreover, ˆti and ti are used to represent ofsets in the loss functions.

3.5.3 Loss function.The training loss function basically follows the Cascaded-RCNN except that we have the additional parameter θ to represent the direction of an oriented object. The loss function for the AFF-Det can be formulated as follows:

L = Lr pn+ αLH 1+ βLH 2, (12)

where Lr pnrepresents the training loss for the RPN, LH 1and LH 2represent the training losses for the irst vanilla coupled-RCNN head and the rotated decoupled-RCNN head respectively. α and β are the balancing parameters that are set equally to 1 in our experiments. LH 1is the same as deined in [2].

(11)

The two separate branches in the decoupled detection head are jointly trained end to end with diferent balancing parameters. The LH 2loss can be further expressed as follows:

LH 2= λf c

1 Ncl s

X

i

Lcl s(pi,pˆi) + λconv

1 Nr eд

X

i

pˆiLr eд

ti, ˆti

, (13)

where λ_{f c} and λconv denote the balancing weights for classiication and localization respectively. Diferent from the losses in the previous two-stage detectors, we have two balancing weights for L_{cl s} and Lr eдin the detection head rather than only for Lr eд. This modiication can handle the loss changes more lexibly. λf c and λconv are set to 2 in our experiments. In this loss function, Ncl s and Nr eд represent the number of training samples for classiication and the number of positive samples for regression; i is the index of a proposal in the sampled mini-batch. ˆpi and pi are the classiication ground-truth and prediction for each category respectively. Note that pˆi is 1 if the proposal is assigned to positive while it is 0 when the proposal is negative. ˆti and ti are the OBB ofsets as described in Eq. (11). The cross-entropy loss and smooth L1loss [23] are implemented as the Lcl s loss and Lr eдloss respectively.

3.6 Design of the AFF-Det-Lite

Prior two-stage detectors usually leverage large FPN and RPN with heavy detection heads. To balance the model complexity and detection performance, we design a lightweight AFF-Det-Lite for eicient oriented object detection in aerial images. Considering the computation eiciency, we irst compress the common 256-channel FPN output features{P²,P3,P4,P5} to 64 channels. Then we compress the heavy RPN by replacing the original 256-channel 3× 3 convolution with a 3 × 3 depth-wise convolution followed by a 64-channel 1 × 1 convolution. The RoIAlign extracted features for the vanilla RCNN head are reduced from 256 channels to 64 channels. In the WRFA module, the output features of R-RoIAlign are also reduced from 256 channels to 64 channels for feeding into the rotated decoupled-RCNN head. Moreover, to reduce the computation overhead in the proposed modules, the feature scale in the MSFF module and the number of stage ResBlocks Kr in the decoupled-RCNN head are both ixed to 1.

In addition, the inner feature channels in the vanilla coupled-RCNN head and rotated decoupled-RCNN head are also reduced to a half. The experimental results in Section 4.6 show that the AFF-Det-Lite has signiicantly less model complexity while maintaining competitive detection performance.

4 EXPERIMENTS

To evaluate the proposed method for oriented object detection in aerial images, we conduct experiments on two widely used aerial image datasets: DOTA and HRSC2016. We irst introduce the implementation details of our proposed methods in Section 4.1. Then the ablation studies are given in Section 4.2. Finally, quantitative comparison results with state-of-the-art methods are shown in Section 4.3. Experimental results have demonstrated the advantages of our proposed method.

4.1 Datasets and Implementation Details

4.1.1 Datasets. DOTA is a recently released large-scale dataset that can be used for oriented object detection in aerial images. It contains 2806 images with sizes ranging from 800× 800 to 1000 × 1000 pixels. The dataset is labeled with 188,282 instances of 15 diferent categories including Plane (PL), Baseball-diamond (BD), Bridge (BR), Ground-track-ield (GF), Small-vehicle (SV), Large-vehicle (LV), Ship (SH), Tennis-court (TC), Basketball-court (BC), Storage-tank (ST), Soccer-ball-ield (SF), Roundabout (RA), Harbor (HA), Swimming-pool (SP), and Helicopter (HE). The instances have diverse scales, shapes, and orientations that are all annotated using quadrilaterals. The dataset is split into three subsets: 1/2 is the training set, 1/3 is the validation set, and 1/6 is the test set. This dataset has both oriented bounding box (OBB) and horizontal bounding box (HBB) localization tasks.

(12)

The high-resolution optical satellite images with complex background (HRSC) dataset is a widely employed dataset for arbitrary-oriented ship detection in aerial images. It contains 1061 images with sizes ranging from 300× 300 to 1500 × 900 pixels. The training set has 436 images, the validation set has 181 images, and the test set has 444 images.

4.1.2 Evaluation Metric. To evaluate the performance of the proposed method, mean Average Precision (mAP) is adopted as the evaluation metric on the two datasets. The default metric for DOTA and HRSC2016 is the same as PASCAL VOC 2007 if not speciied.

4.1.3 Preprocessing.First, the training and validation datasets are combined to train the models on the DOTA dataset. Second, since the sizes of aerial images in the DOTA dataset are very large, we crop the original images into 1024× 1024 small pieces with an overlap of 200 following [31]. For the multi-scale training and testing, we irst resize the original images by 1.5× and 0.5×; then along with the original images (1.0×), all images are cropped into 1024× 1024 pieces with an overlap of 200. Then the augmented dataset is used for multi-scale training and testing.

On the HRSC2016 dataset, the training and validation datasets are also combined to train the models. All the images are reshaped to the size of 800× 512 for training and testing. The evaluation accuracy is measured on the test dataset.

4.1.4 Implementation Details. We adopt the Faster-RCNN OBB [31] as our baseline model. If not speciied, our backbone is the ImageNet pretrained ResNet-50 with FPN [13]. We train the models on 4 NVIDIA GTX 1080Ti GPUs with a batch size of 8. The models are trained for 12 epochs on the DOTA dataset while for 36 epochs on the HRSC2016 dataset. Stochastic gradient descent (SGD) is adopted as the default optimizer with a starting learning rate of 0.01; the learning rate drops to 0.001 at epoch 8 and 0.0001 at epoch 11. The weight decay is set to 0.0001 and the momentum is set to 0.9. During training on both of the datasets, we lip the image horizontally with a probability of 0.5 to augment the datasets. We get the evaluation accuracy on the DOTA test dataset through the oicial online evaluation server.

4.2 Ablation Study

4.2.1 Main Results.We perform ablation studies on the DOTA dataset to verify the contribution of each compo- nent in the AFF-Det. The experimental results are given in Table 1. We irst reproduce the baseline model by ourself and it can achieve 69.41% mAP. Then we gradually integrate the proposed methods into the baseline model to explore their contributions. As can be seen from Table 1, our proposed cascaded OBB regression method can signiicantly improve the mAP by 4.22 percentage points since the mismatches between proposals and ground-truth are alleviated. Then based on the baseline + COR model, our proposed methods MSFF, WRFA, and D-RCNN can improve the accuracy by 1.30%, 0.67%, and 0.99% mAP. The experimental results in Table 1 can demonstrate that AFF-Det beneits from each proposed method.

By using multi-scale training data augmentation, the inal detection accuracy can be signiicantly increased to 80.73% mAP. The backbone network usually generates feature maps that are dozens of times smaller than the original image, which makes it diicult for the semantic information of small objects to be captured by the detection head. Training by larger and multi-scale images can improve the robustness of the detection model to the sizes of oriented objects in aerial images. The results can thereby be improved signiicantly.

4.2.2 Multi-scale Feature Fusion.We present the ablation study results about the MSFF module in Table 2. We irst explore the inluence of diferent numbers of feature scales and give the results in the top part of Table 2.

We gradually increase the feature scales from{0.1} to {0.1, 0.2, 0.3, 0.4, 0.5}, which represents that the features increase from one scale to ive scales. As can be seen from Table 2, the accuracy varies with the diferent number

(13)

Table 1. Ablation studies on the DOTA test dataset. COR represents the cascaded OBB regression; D-RCNN represents the rotated decoupled-RCNN and MS denotes the multi-scale training and testing.

Method Improvements to the baseline model

Baseline √ √ √ √ √ √ √ √

COR √ √ √ √ √ √ √

MSFF √ √ √ √

WRFA √ √ √ √

D-RCNN √ √ √

MS √

mAP 69.41 73.63 74.93 74.30 74.62 75.17 75.72 80.73

of feature scale combinations. The best accuracy can be obtained with four feature scales{0.1, 0.2, 0.3, 0.4} and it is 1.30% mAP higher than the basic model. For further comparison, we measure the results using one scale but four feature branches in the MSFF module. As shown in the middle part of Table 2, we can observe that none of these results can surpass the result obtained with four feature scales although they can achieve up to 1.10% mAP improvement.

Table 2. Experimental results under diferent feature scales in the multi-scale feature fusion module. The basic experimental model is the baseline + COR model.

No. S mAP Gain

Baseline + COR 73.63 -

0.1 74.32 + 0.69

0.1, 0.2 73.99 + 0.36

0.1, 0.2, 0.3 74.51 + 0.88 0.1, 0.2, 0.3, 0.4 74.93 + 1.30 0.1, 0.2, 0.3, 0.4, 0.5 73.67 + 0.04 0.2, 0.2, 0.2, 0.2 74.34 + 0.71 0.3, 0.3, 0.3, 0.3 74.44 + 0.81 0.4, 0.4, 0.4, 0.4 74.73 + 1.10 0.5, 0.5, 0.5, 0.5 73.56 - 0.07 0.2, 0.3, 0.4, 0.5 73.73 + 0.10

Under the four feature scale circumstances, we also explore the impact of diferent scale conigurations. The result of scales{0.2, 0.3, 0.4, 0.5} is also shown in Table 2. This coniguration can achieve 0.10% mAP improvement over the baseline + COR model. Moreover, we have tested other diferent feature scale conigurations but they can not improve the accuracy like the ones shown in Table 2. The results can demonstrate the efectiveness of using multi-scale features rather than a single feature scale since the objects in aerial images vary dramatically in size.

4.2.3 Weighted RoI Feature Aggregation.First, we compare the proposed WRFA module with the single feature level RoI assignment method (basic model). As the main ablation study results shown in Table 1, the proposed WRFA module can increase the accuracy by 0.67% mAP compared with the baseline model by fusing all levels of RoI features.

Furthermore, we analyze the impact of diferent RoI feature fusion methods on the proposed AFF-Det. The results are summarized in Table 3. The irst common method is the sum fusion, which means we directly sum

(14)

the corresponding extracted RoI features from P2to P5without adaptively giving them diferent weights. We ind that the sum fusion will lead to a 0.14% mAP drop to the basic model. Similarly, the max and mean fusion methods cause 0.84% and 0.18% mAP drop respectively as shown in Table 3. The results reveal that diferent levels of RoI features contribute diferently to the inal results and each feature level should have its own weight.

As shown in Table 3, by fusing the weighted features, we can get 0.81% mAP improvement compared with the vanilla sum fusion. The experimental results demonstrate that the proposed WRFA module can fully exploit the features from diferent FPN levels; thus it can generate better RoI features with more powerful context and semantic information.

Table 3. Comparisons of diferent fusion methods in the weighted RoI feature aggregation module. The basic experimental model is the baseline + COR model.

Method mAP Gain

Baseline + COR 73.63 -

sum 73.49 - 0.14

max 72.79 - 0.84

mean 73.45 - 0.18

WRFA 74.30 + 0.67

4.2.4 Rotated Decoupled-RCNN.In this section, we analyze the impact of the proposed rotated decoupled-RCNN on the AFF-Det. We irst analyze how the convolution ResBlocks contribute to the detection results. We measure the results with diferent numbers of convolution ResBlocks and show the results in TABLE 4. The baseline + COR model adopts the vanilla coupled-RCNN head that has shared FC layers for detection. It does not have convolution ResBlocks which means K_r is 0. We measure the results with K_r from 1 to 5. As shown in TABLE 4, deeper networks will lead to better results with minor luctuations. The best result is obtained when K_r is 5 and it gains 1.14% mAP improvement from the basic model; however, the training time and memory consumption will be signiicantly increased. Considering the trade-of between model complexity and detection accuracy, we leverage 3 ResBlocks as the default setting in our experiments. The 3 stages of ResBlocks can increase the accuracy by 0.99% mAP compared with the basic model. As for the FC layers, we set Kcto 2 following the FPN implementation considering that FC layers are memory and computation intensive. The results shown in Table. 4 are also measured with 2 FC layers.

Table 4. Experimental results under diferent numbers of ResBlocks in the rotated decoupled-RCNN. The coupled-RCNN in baseline + COR model has 0 ResBlocks.

No. Stages Kr mAP Gain Coupled-RCNN (0) 73.63 -

1 74.46 + 0.83

2 74.48 + 0.85

3 74.62 + 0.99

4 74.24 + 0.61

5 74.77 + 1.14

(15)

Table 5. Comparison with state-of-the-art methods on the DOTA test dataset for the OBB task.^†indicates multi-scale training and testing. The results with bold indicate the best and second-best results of each column.

Method Backbone PL BD BR GF SV LV SH TC BC ST SF RA HA SP HE mAP

PIoU [3] DLA-34 80.90 69.70 24.10 60.20 38.30 64.40 64.80 90.90 77.20 70.40 46.50 37.10 57.10 61.90 64.00 60.50 RoI Trans. [4] ResNet-101 88.64 78.52 43.44 75.92 68.81 73.68 83.59 90.74 77.27 81.46 58.39 53.54 62.83 58.93 47.67 69.56 CAD-Net [43] ResNet-101 87.80 82.40 49.40 73.50 71.10 63.50 76.70 90.90 79.20 73.30 48.40 60.90 62.00 67.00 62.30 69.90 DRN[20] Hourglass-104 88.91 80.22 43.52 63.35 73.48 70.69 84.94 90.14 83.85 84.11 50.12 58.41 67.62 68.60 52.50 70.70 CenterMap [28] ResNet-50 88.88 81.24 53.15 60.65 78.62 66.55 78.10 88.83 77.80 83.61 49.36 66.19 72.10 72.36 58.70 71.74 DAL [19] ResNet-101 88.61 79.69 46.27 70.37 65.89 76.10 78.53 90.84 79.98 78.41 58.71 62.02 69.23 71.32 60.65 71.78 SCRDet [39] ResNet-101 89.89 80.65 52.09 68.36 68.36 60.32 72.41 90.85 87.94 86.86 65.02 66.68 66.25 68.24 65.21 72.61 RSDet [21] ResNet-152 90.20 83.50 53.60 70.10 64.60 79.40 67.30 91.00 88.30 82.50 64.10 68.70 62.80 69.50 66.90 73.50 S²A-Net [6] ResNet-50 89.11 82.84 48.37 71.11 78.11 78.39 87.25 90.83 84.90 85.64 60.36 62.60 65.26 69.13 57.94 74.12 SCRDet++ [38] ResNet-152 89.20 83.36 50.92 68.17 71.61 80.23 78.53 90.83 86.09 84.04 65.93 60.80 68.83 71.31 66.24 74.41 KLD [40] ResNet-50 88.91 83.71 50.10 68.75 78.20 76.05 84.58 89.41 86.15 85.28 63.15 60.90 75.06 71.51 67.45 75.28 R³Det-DCL [34] ResNet-152 89.78 83.95 52.63 69.70 76.84 81.26 87.30 90.81 84.67 85.27 63.50 64.16 68.96 68.79 65.45 75.54 AFF-Det (Ours) ResNet-50 88.34 83.06 53.77 72.16 79.54 78.09 87.65 90.69 87.19 84.50 57.46 64.96 74.88 70.80 61.24 75.72 AFF-Det (Ours) ResNet-101 88.78 77.71 54.34 72.60 79.65 77.74 87.39 90.79 86.28 83.46 56.73 64.12 74.87 71.36 62.46 75.43 DRN^†[20] Hourglass-104 89.71 82.34 47.22 64.10 76.22 74.43 85.84 90.57 86.18 84.89 57.65 61.93 69.30 69.63 58.48 73.23 Gliding-Vertex^†[33] ResNet-101 89.64 85.00 52.26 77.34 73.01 73.14 86.82 90.74 79.02 86.81 59.55 70.91 72.94 70.86 57.32 75.02 BBAVectors^†[42] ResNet-101 88.36 84.06 52.13 69.56 78.26 80.40 88.06 90.87 87.23 86.39 56.11 65.62 67.10 72.08 63.96 75.36 CenterMap^†[28] ResNet-101 89.83 84.41 54.60 70.25 77.66 78.32 87.19 90.66 84.89 85.27 56.46 69.23 74.13 71.56 66.06 76.03 CSL^†[35] ResNet-152 90.25 85.53 54.64 75.31 70.44 73.51 77.62 90.84 86.15 86.69 69.60 68.04 73.83 71.10 68.93 76.17 R³Det^†[36] ResNet-152 89.80 83.80 48.10 66.80 78.80 83.30 87.80 90.80 85.40 85.50 65.70 62.70 77.50 78.60 72.60 76.50 SCRDet++^†[38] ResNet-152 88.68 85.22 54.70 73.71 71.92 84.14 79.39 90.82 87.04 86.02 67.90 60.86 74.52 70.76 72.66 76.56 R³Det-DCL^†[34] ResNet-152 89.26 83.60 53.54 72.76 79.04 82.56 87.31 90.67 86.59 86.98 67.49 66.88 73.29 70.56 69.99 77.37 GWD^†[37] ResNet-152 89.06 84.32 55.33 77.53 76.95 70.28 83.95 89.75 84.51 86.06 73.47 67.77 72.60 75.76 74.17 77.43 KLD^†[40] ResNet-50 88.91 85.23 53.64 81.23 78.20 76.99 84.58 89.50 86.84 86.38 71.69 68.06 75.95 72.23 75.42 78.32 FR-Est-MST-DCN^†[5] ResNet-101 89.78 85.21 55.40 77.70 80.26 83.78 87.59 90.81 87.66 86.93 65.60 68.74 71.64 79.99 66.20 78.49 S²A-Net^†[6] ResNet-50 88.89 83.60 57.74 81.95 79.94 83.18 89.11 90.78 84.87 87.81 70.30 68.25 78.30 77.01 69.58 79.42 ReDet^†[7] ReR50 [7] 88.81 82.48 60.83 80.82 78.34 86.06 88.31 90.87 88.77 87.03 68.65 66.90 79.26 79.71 74.67 80.10 AFF-Det^†(Ours) ResNet-50 88.96 85.57 61.64 79.90 76.41 85.20 88.59 90.82 87.24 86.73 69.69 69.93 79.15 83.48 77.58 80.73 AFF-Det^†(Ours) ResNet-101 88.72 84.96 62.24 79.42 76.40 85.43 88.72 90.68 88.56 86.98 69.10 69.88 79.30 83.54 72.07 80.50

4.3 Comparison with State-of-the-art Methods

4.3.1 Results on the DOTA Dataset. Table 5 shows the quantitative comparisons between our proposed method and state-of-the-art approaches on the DOTA test dataset. As can be seen from the table, our single-scale model can achieve 75.72% mAP, surpassing all the other single-scale models and even some of the multi-scale models.

Furthermore, the best detection accuracy of our proposed AFF-Det is 80.73% which outperforms all the listed state-of-the-art methods. In addition, we observe that using ResNet-50 can achieve slightly better results than using ResNet-101 as the backbone in our method. The reason is that the ResNet-101 model slightly overits during training on the DOTA dataset since we observe higher training accuracy but lower test accuracy than using the ResNet-50 model. The accuracy learning curve during training is shown in Fig 8. The following two reasons account for the overitting problem: irst, the ResNet-101 model is deeper and more complex than ResNet-50;

second, the ResNet-101 model is pretrained on the ImageNet but the DOTA dataset is not large enough for inetune. It is also shown in Table 5 that, with limited data augmentation techniques (multi-scale training), the mAP can be signiicantly improved by about 5 percentage points with both ResNet-50 and ResNet-101 backbones.

These results reveal the importance of adequate data for training deep neural networks.

Moreover, we show sample visual detection results of diferent categories by our proposed AFF-Det in Fig.

9. As can be seen from the igure, although the objects have a variety of sizes and orientations, our proposed method can accurately get the detection results. We also show some negative samples about Soccer-ball-ield (SF), which are mainly caused by redundant detection boxes.

In addition, we also provide the experimental results for the HBB task on the DOTA dataset. The HBBs in the experiments are generated by calculating the axis-aligned bounding boxes over predicted oriented bounding boxes (minimum bounding rectangle), which is consistent with the ground-truth HBBs on the DOTA dataset. The

(16)

2 4 6 8 10 12

Epoch

94.0 94.5 95.0 95.5 96.0 96.5 97.0 97.5 98.0

mmAP

ResNet-50 ResNet-101

Fig. 8. Accuracy learning curves during training on the DOTA dataset with ResNet-50 and ResNet-101 backbones. mmAP represents that the accuracy value is the average of all mAPs from all iterations in the corresponding epoch.

Positive Samples Negative Samples

PL BD BR GF SV HE SP SH TC BC ST SF RA HA

ColorMap: LV

Fig. 9. Sample visual results of both positive and negative samples on the DOTA test dataset.

comparisons with state-of-the-art methods are given in Table 6. As can be seen from Table 6, our AFF-Det can obtain the best detection accuracy of 81.18% mAP with the ResNet-101 backbone. Experimental results show that our method can achieve 1.22%, 1.19%, and 3.85% higher mAP than the two-stage methods SCRDet [39], FADet [11], and CenterMap [28]. AFF-Det can also surpass the accuracy of one-stage methods FMSSD [29], ICN [1], and SCRDet++ [38] by a large margin. The experimental results demonstrate that the proposed methods are beneicial to both the OBB and HBB tasks.