With various pre-trained deep CNN models and remotesensing datasets, the remote scene classification performances are shown in Table 1. In Table 1, Ac and SD denote accuracy and standard deviation, respectively.
In the experiment, pre-trained deep CNNs are directly used as feature extractors in an unsupervised manner. By removing the last fully connected layer, the rest parts of pre-trained deep CNNs extract high-dimensional feature vectors of remotesensing images. These feature vectors are considered as final image representation followed by a linear SVM classifier. From Table 1, we can see that all transferred deep CNNs generated from AlexNet, CaffeNet, VGG-VD16, and GoogLeNet achieve state-of-the-art performance. Pre-trained deep CNNs show strong general- ization power in the transferring process. In addition to our surprise, the most successful deep CNNs to date, ResNets fail to obtain a good experiment result, no matter their layers are 50, 101, or 152. In ResNets, shortcut connections bring less parameters and make the network much easier to optimize. At the same time, the direct connection between input and output brings poor generalization ability when we transfer them for other tasks. On the other hand, as shown in Figure 11, the spatial information of remotesensing images in the Brazilian coffee scene dataset is very simple. However, these remotesensing images are not optical (green-red- infrared). In Table 1, the relatively poor performance on this dataset comes from the difference in spectral information when we are transferring pre-trained deep CNNs for remote scene classification.
Abstract: Semantic-level land-use scene classification is a challenging problem, in which deep learning methods, e.g., convolutionalneuralnetworks (CNNs), have shown remarkable capacity. However, a lack of sufficient labeled images has proved a hindrance to increasing the land-use scene classification accuracy of CNNs. Aiming at this problem, this paper proposes a CNN pre-training method under the guidance of a human visual attention mechanism. Specifically, a computational visual attention model is used to automatically extract salient regions in unlabeled images. Then, sparse filters are adopted to learn features from these salient regions, with the learnt parameters used to initialize the convolutional layers of the CNN. Finally, the CNN is further fine-tuned on labeled images. Experiments are performed on the UCMerced and AID datasets, which show that when combined with a demonstrative CNN, our method can achieve 2.24% higher accuracy than a plain CNN and can obtain an overall accuracy of 92.43% when combined with AlexNet. The results indicate that the proposed method can effectively improve CNN performance using easy-to-access unlabeled images and thus will enhance the performance of land-use scene classification especially when a large-scale labeled dataset is unavailable.
Abstract: The interpretation of land use and land cover (LULC) is an important issue in the fields of high-resolution remotesensing (RS) image processing and land resource management. Fully training a new or existing convolutionalneural network (CNN) architecture for LULC classification requires a large amount of remotesensing images. Thus, fine-tuning a pre-trained CNN for LULC detection is required. To improve the classification accuracy for high resolution remotesensing images, it is necessary to use another feature descriptor and to adopt a classifier for post-processing. A fully connected conditional random fields (FC-CRF), to use the fine-tuned CNN layers, spectral features, and fully connected pairwise potentials, is proposed for image classification of high- resolution remotesensing images. First, an existing CNN model is adopted, and the parameters of CNN are fine-tuned by training datasets. Then, the probabilities of image pixels belong to each class type are calculated. Second, we consider the spectral features and digital surface model (DSM) and combined with a support vector machine (SVM) classifier, the probabilities belong to each LULC class type are determined. Combined with the probabilities achieved by the fine-tuned CNN, new feature descriptors are built. Finally, FC-CRF are introduced to produce the classification results, whereas the unary potentials are achieved by the new feature descriptors and SVM classifier, and the pairwise potentials are achieved by the three-band RS imagery and DSM. Experimental results show that the proposed classification scheme achieves good performance when the total accuracy is about 85%.
It is interesting to note that our results surpass those achieved by their respective networks on the Camvid test data within a few hundred iterations, and then go on to perform significantly better. This could partly result from our data-set containing fewer classes (8 vs 11). Another factor could be our partially labelled test data, which features very few class boundary regions, however further testing with fully labelled data shows similar performance. It is possible that partially labelled training data could lead to a better performing classifier due to the lack of potentially confusing boundary pixels, although to fully test this we would need to compare these results to those obtained by training an identical network with a fully labelled version of the same data set, which is beyond the scope of this paper.
building blocks such as convolutional layers, pooling layers, and fully connected layers . It has been designed to automatically and adaptively learn spatial hierarchies of features, from low to high- level patterns, through backpropagation algorithm [15, 30]. Its attraction is due to its special architecture with a very powerful ability to learn filters and apply them to small-sub regions of data. This unsupervised feature learning, which is performed in the convolution layers, allows them to easily capture hidden local patterns and variations in the data. The resulting feature map is then passed to the fully connected layers for activity context classification. The convolutional layers are trained alongside other layers of the network as their outputs serve as the inputs of other CNN layers. The convolutional operation exploits effectively the local temporal dependency of time series data while its pooling operation cancels the impact of small translation of the input. With its weight sharing feature, the convolution operation of the DCNN allows reservation of scale invariance, which in activity context recognition can discriminate between two similar or identical classes. This operation also helps to capture local dependencies of the signals. For example, it would be able to capture the dependencies between inertia sensing signals and those of nearby ambient sensors. It also lowers the computational cost by reducing the number of connections between convolutional layers [4,8,10]. With the capability to be optimized using backpropagation, it is an excellent deep learning architecture that produces minimal prediction error.
Research activities in the human activity context can be broadly categorized into two. Video- based human activity recognition and sensor-based activity recognition [4,7–18]. The sensor-based activity recognition process focused more on using data generated by inertial sensors, such as accelerometer and gyroscope, for recognizing human locomotive activities by either placing these sensors on various parts of the human body or using smartphones [16,24–26]. The video-based human activity recognition has focused on using video surveillance data in the activity recognition processes . In recent years, many research works have explored various algorithms, whilst building new ones, to automatically identify human activities. The conventional machine learning algorithms have been extensively explored and widely reported in the literature [2–4,10,16]. For example, in our previous work, we explored various traditional classification algorithms for automatic context recognition . The result was applied in the development of a context model for an intelligent context-aware recommendation system. Other works based on classical machine learning algorithms and handcrafted feature extraction processes have been extensively reported [4,24–27]. For example, authors in Reference  proposed a new approach using a descriptor-based approach to human activity recognition. They handcrafted time and frequency domain features from accelerometer and gyroscope signals and then used conventional support vector machines and k- nearest neighbor algorithms. In Reference , Straczkiewicz and Onnela provide a comprehensive review of several human activity recognition research works using classical machine learning algorithms. The majority of the reported works using traditional machine learning algorithms are based on handcrafted feature extraction processes. Zeng et al.  report that, although these works might have demonstrated good performance recognizing one activity, they, however, perform poorly recognizing others due to class imbalance. They also noted that these works cannot capture local dependencies of an activity signal, as well as not being able to preserve scale invariance. This explains why some models struggle to discriminate between jogging and running contexts .
The second dataset comprises street images generated from a 3D model of an abstract city in ESRI CityEngine, which is a parametric engine for city building (ESRI 2013). Creating synthetic image data using a 3D virtual environment can greatly enhance the efficiency of collecting urban image data. In total, 4,800 images are generated with this process. We remove invalid images, such as repeated images, images that are near an intersection and those at the end of a road. This process results in 1,029 filtered images, which are then similarly resized to a set dimension (256 x 256 pixels). The ground truth labelling is performed automatically, and three sets of images are produced: 0 - blank frontages on both sides of the street, 1 - active frontages on one side of the street and 2 - active frontages on both sides of the street (Figure 4). The non-urban frontage class is not included, as it is not realistic to synthetically create non-urban scenes using the parametric software.
To address the above-mentioned limitations in the existing approaches for RS image scene classification, we present a novel approach that benefits from DeepNeuralNetworks (DNNs) to perform scene classification of compressed RS images. The proposed approach aims to minimize the amount of decompression applied to RS images. We assume that images are compressed with the JPEG 2000 algorithm. To achieve an efficient scene classification at a fast computational rate, the proposed approach consists of two steps: i) approximating wavelet sub-bands or image; and ii) feature extraction and classification of the approximated wavelet sub-band. The proposed approach begins with approximating finer (highest) wavelet resolution sub-bands of the reversible biorthogonal filter used in the JPEG 2000 from the coarsest (lowest) resolution wavelet sub-band. To achieve this, the proposed approach uses a series of deconvolutional layers for which the wavelet sub-bands are approximated. Then, the high-level semantic content of the approximated wavelet sub-bands are learnt through a sequence of convolutional layers and finally image classification is performed. Accordingly, the proposed approach utilized the multiresolution paradigm within the JPEG 2000 compression algorithm to achieve an efficient scene classification in a time-efficient manner. Experimental results performed on a benchmark archive shows the effectiveness of the proposed approach.
Classification of satellite and aerial imagery is a quintessential problem in remotesensing observation analysis. Although the variation in appearance and environmental conditions make this a very challenging problem, in the past few years, machine learning methods have led to a dramatic leap in performance [1,2]. In supervised image classification, both for generic imagery and remotesensingscenes, the current gold standard involves designing an appropriate ConvolutionalNeural Network (CNN) and training the network so that the cross-entropy loss between the predicted and the ground-truth labels is minimized. The cross-entropy is defined as the KL-divergence between the ground truth labels, encoded as a binary vector where all but one element are zero (one-hot encoding), and the predicted class label distribution from the network, obtained from the last layer after appropriate scaling via a soft-max activation function . Although different loss functions have been proposed for non-classification tasks like focal loss in object detection  and mean-squared-error for image enhancement , the majority of state-of-the-art CNN architectures employ cross-entropy as the loss function to be minimized for scenarios like supervised image classification .
Abstract— In this Letter, we propose a new approach for remotesensing scene classification by creating an ensemble of the recently introduced massively parallel deep (fuzzy) rule-based (DRB) classifiers trained with different levels of spatial information separately. Each DRB classifier consists of a massively parallel set of human-interpretable, transparent 0-order fuzzy IF…THEN… rules with a prototype-based nature. The DRB classifier can self-organize “from scratch” and self-evolve its structure. By employing the pre-trained deep convolution neural network as the feature descriptor, the proposed DRB ensemble is able to exhibit human-level performance through a transparent and parallelizable training process. Numerical examples using benchmark dataset demonstrate the superior accuracy of the proposed approach together with human-interpretable fuzzy rules autonomously generated by the DRB classifier.
However, thus far, there has been no work on exploring rate- accuracy tradeoffs for CNN-based video classification. This is now increasingly important due to the advent of visual IoT and cloud-based platforms, where the visual sensing and processing are not co-located –. Alas, such tradeoffs are non trivial, because they depend on the spatio-temporal information needed by the CNN performing the recognition task , . For instance, one of the issues with most of the work described above is the short temporal extent of inputs , , ; each input video segment comprises a small group of frames that only represent (approximately) one second of the recorded action or event to be classified. Hence, this cannot account for cases where temporal dependencies extend over longer durations . Feichtenhofer et al.  attempted to resolve this issue by using multiple copies of their two stream network where the copies are spread over a coarse temporal scale, thus encompassing both coarse and fine motion information with an optical flow input. The architecture is spatially and then temporally fused using 3D convolution and pooling. Despite achieving state-of-the-art results on UCF- 101 and HMDB-51 datasets, this approach requires heavy processing for both training and testing. Alternatively, other work ,  argues that increasing the temporal extent is simply a case of taking the optical flow component over a larger temporal extent. In order to minimize the complexity of the network, most such approaches downsize the frames, thus reducing the spatial dimensions. On the other hand, the work of Sevilla et al.  shows that high-resolution optical flow can be beneficial since deep learning methods can learn fea- tures from small details. This observation suggests that high- resolution optical flow can be leveraged to lower the temporal extent of inputs. Understanding the trade-offs in compressed- domain spatio-temporal information and exploring the rate- accuracy characteristics of CNN-based video classification is the objective of this paper.
Equation (4) implements the normalization operation. Then the normalized value is scaled and shifted by learnable parameters 𝛾𝛾 and 𝛽𝛽 to get the final result 𝑦𝑦 𝑖𝑖 . In the implementation, batch normalization can be inserted anywhere into the network just as a normal computational layer since all steps in the batch normalization are based on simple differentiable operations. The batch normalization is a practical tool in training deepneuralnetworks for the following reasons: First, it can alleviate the problem caused by improper network initialization. Second, it can effectively speed up the training procedure by preventing “gradient vanishing”.
Figure 2. Inception Module of GoogleNet  : The incep- tion module is an intrinsic component of the GoogleNet architecture. GoogleNet has 9 inception modules named as 3a, 3b, 4a, 4b, 4c, 4d, 4e, 5a, 5b connected one after another. The inception module has two layers and 6 convolutional blocks (green blocks), connected as shown in the figure. As an implementation perspective of our ap- proach with GoogleNet, for a convolutional block l in Layer 1, the subsequent blocks are all convolutional blocks in layer 2, irrespective of the connection pattern. This is done for ease in the computation of (2) and (3). However, for a given convolutional block l in a layer of inception module, its previous convolutional block is considered only to be the one from which l has incoming links. The distinction is made for simplicity in computation, as the statistics of the previous layer is only required in case (b) of our approach (Section 3), for deciding whether any operation should be applied to the current block or not.
Motion vector based optical flow approximations have been proposed for action recognition by Kantorov and Laptev [ 4 ], albeit without the use of CNNs. In more recent work, proposals have been put forward for fast video classification based on CNNs that ingest compressed-domain motion vec- tors and selective RGB texture information [ 5 , 6 ]. Despite their significant speed and accuracy improvements, none of these approaches considered the trade-off between rate and classification accuracy obtained from a CNN. Conversely, while rate-accuracy trade-offs have been analysed for conven- tional image and video feature extraction systems [ 7 , 8 ], these studies do not cover deep CNNs and semantic video clas- sification, where the different nature of the spatio-temporal classifiers can lead to different rate-accuracy trade-offs.
plied pixel-by-pixel or window-by-window. Second, the fea- tures for any given spatial window are fed to a classifier that assesses whether such a region depicts a human. Furthermore, a scale-space is typically used in order to detect pedestrians at di ff erent scales, that is, distance with respect to the sensing de- vice. In 2003, Viola and Jones  propose a pedestrian de- tection system based on box-shaped filters, that can be applied e ffi ciently resorting to integral images. The features, i.e. the result of the convolution of a window with a given box-shaped filter, are then fed to a classifier based on AdaBoost . Dalal and Triggs refine the process, proposing Histogram Of Gradi- ents (HOG)  as local image features, to be fed to a linear Sup- port Vector Machine aimed at identifying windows containing humans. Such features proved to be quite effective for the task at hand, representing the basis for more complex algorithms. Felzenswalb et al.  further improve the detection accuracy by combining the Histogram Of Gradients with a Deformable Part Model. In particular, such approach aims at identifying a human shape as a deformable combination of its parts such as the trunk, the head, etc. Each body part has peculiar characteris- tics in terms of its appearance and can be e ff ectively recognized resorting to the HOG features and a properly trained classifier. Such a model proved to be more robust with respect to body shape and pose and to partial occlusions. Doll´ar et al.  pro- pose to use features extracted from multiple di ff erent channels. Each channel is defined as a linear or non-linear transformation of the input pixel-level representation. Channels can capture different local properties of the image such as corner, edges, intensity, color.
There is a large body of research on sentiment analysis, or more generally on sentence classifica- tion tasks. Initial approaches followed the clas- sical two stage scheme of extraction of (hand- crafted) features, followed by a classification stage. Typical features include bag-of-words or n- grams, and their TF-IDF. These techniques have been compared with ConvNets by (Zhang et al., 2015; Zhang and LeCun, 2015). We use the same corpora for our experiments. More recently, words or characters, have been projected into a low-dimensional space, and these embeddings are combined to obtain a fixed size representation of the input sentence, which then serves as input for the classifier. The simplest combination is the element-wise mean. This usually performs badly since all notion of token order is disregarded.
Each convolutional block (see Figure 2) is a se- quence of two convolutional layers, each one followed by a temporal BatchNorm (Ioffe and Szegedy, 2015) layer and an ReLU activation. The kernel size of all the temporal convolutions is 3, with padding such that the temporal resolution is preserved (or halved in the case of the convolu- tional pooling with stride 2, see below). Steadily increasing the depth of the network by adding more convolutional layers is feasible thanks to the limited number of parameters of very small con- volutional filters in all layers. Different depths of the overall architecture are obtained by vary- ing the number of convolutional blocks in between the pooling layers (see table 2). Temporal batch normalization applies the same kind of regulariza- tion as batch normalization except that the activa- tions in a mini-batch are jointly normalized over temporal (instead of spatial) locations. So, for a mini-batch of size m and feature maps of tempo- ral size s, the sum and the standard deviations re- lated to the BatchNorm algorithm are taken over |B| = m · s terms.
Model adaptation: we focus on two deep CNN architec- tures which have been used in multi-label image classifica- tion: VGG16  and Resnet-101  to which two changes have been made in this study. First, we apply an adaptive pooling layer to the last convolutional feature maps such that different input sizes can be handled within the same architec- ture. Second, the final output layer for single-label classifi- cation in the original model is simply replaced with a fully connected layer in which the number of neurons is set as C (i.e. the number of concerned class labels).