2.4 Unsupervised Domain Adaptive Multi-View Learning
Labelling a large-scale visual dataset from scratch can be prohibitively expensive (Berinsky et al, 2012; Deng et al, 2009; Zheng et al, 2015). To improve the model performance on the unsu- pervised target dataset, one learning strategy (Zhang et al, 2019; Ganin and Lempitsky, 2015; Luo et al, 2017) is incorporating another labelled source dataset with relevant tasks for training. Each dataset is denoted as a domain and this setting is also known as unsupervised domain adap- tation (UDA) (Long et al, 2015b; Ganin et al, 2016). The main purpose of UDA is to transfer knowledge from the source domain to the target one.
In this thesis, the main concentration is a specified UDA setting, the unsupervised domain adaptive (UDA) multi-view learning, where both domains are multi-view datasets with relevant tasks. Many multi-view visual applications (Zhang et al, 2019; Sohn et al, 2017) follow this set- ting and our main focus is the unsupervised domain adaptive (UDA) person Re-ID (Kodirov et al, 2016; Wei et al, 2018; Wang et al, 2018). The main challenges are the disjoint label (e.g., person identity) spaces across domains and the clear domain gaps, as illustrated in Figure 1.3. Moreover, the relevant transfer learning settings such as the conventional unsupervised domain adaptation (UDA) (Section 2.4.1) and the disjoint label space transfer learning (DLSTL) (Section 2.4.2) are reviewed and compared as in follows.
2.4.1 Unsupervised Domain Adaptation with Domain Alignment
An underlying assumption made by the conventional unsupervised domain adaptation (UDA) is both source and target domains share the same label space. Different domain alignment methods are thus proposed for the UDA based on this assumption. The cross-domain alignments can happen in either the raw image space (Hoffman et al, 2017; Kim et al, 2017) or a deep feature embedding space (Chen et al, 2019; Ganin et al, 2016; Gretton et al, 2009).
Although the UDA person Re-ID problem does not hold such assumption, existing methods still stick to the domain alignment framework. In (Wei et al, 2018; Deng et al, 2018), person style transfer GANs are trained to synthesise images of persons in the target domain styles, with identity information preserved from the source dataset for supervised training in the target domain. Differently, cross camera-view image synthesis in (Zhong et al, 2018) only takes place in the target domain for pseudo identity label generation. By performing joint feature learning using both domains as in our model, (Zhong et al, 2018) achieves the best results thus far. However,
44 Chapter 2. Literature Review
it still cannot avoid the challenging GAN training process and requires the image synthesis and Re-ID models to be trained in separate stages and independently. On the other hand, domain feature alignment techniques such as maximum mean discrepancy (MMD) (Gretton et al, 2009) have been used to drive Re-ID domain adaptation (Wang et al, 2018; Lin et al, 2018). However, unlike the conventional domain-adaptation setting, the label spaces in UDA Re-ID are disjoint, so it is unclear why and how they should be aligned. Moreover, both (Wang et al, 2018; Lin et al, 2018) made use of attributes to provide a good intermediate space for alignment, but attribute annotation is not widely available, thus limiting their applicability.
2.4.2 Disjoint Label Space Transfer Learning
Transfer learning (TL) aims to transfer knowledge from one domain/task to improve performance on the other (Pan and Yang, 2009). The most widely used TL technique for deep networks is fine-tuning (Yosinski et al, 2014; Chen et al, 2018; Ren et al, 2015). Instead of training a target network from scratch, its weights are initialised by a pre-trained model from another task such as ImageNet (Deng et al, 2009) classification. Its target dataset requires to be fully supervised. Another TL setting, Disjoint Label Space Transfer Learning (DLSTL), focuses on thedisjoint label spaces between source and target domains. The most concerned DLSTL problems are semi-superivsed DLSTL (Luo et al, 2017), i.e., both unlabelled and few labelled target data are available, and the unsupervised DLSTL, i.e., with unlabelled target data. Therefore, different UDA multi-view learning problems, e.g., the UDA person Re-ID (Zheng et al, 2016a) and fine- grained sketch based image retrieval (SBIR) (Sangkloy et al, 2016), belong to the unsupervised DLSTL. On the contrary, the conventional UDA (Ganin et al, 2016) has the same label space across domains. However, both settings have the unsupervised target datasets and the labelled source ones, as in unsupervised transfer learning (Zhang et al, 2019; Ganin and Lempitsky, 2015; Luo et al, 2017). As a summary, different transfer learning settings can find their coordinates on two axes, the relation of the label spaces across domains and the amount of target supervision provided, as illustrated in Figure 2.4.
2.4.3 Multi-Task Learning
The unsupervised domain adaptive (UDA) multi-view learning belongs to a specified transfer learning setting DLSTL, as analysed above. Therefore, the multi-task learning pipeline for trans- fer learning (Pan and Yang, 2009) can be adopted. Its main objective is to learn a shared deep
2.4. Unsupervised Domain Adaptive Multi-View Learning 45 Supervised Unsupervised Semi-supervised Disjoint Aligned S o u rc e & T ar g et L ab el S p ac e Target Supervision Semi-supervised DLSTL Unsupervised DLSTL (UDA MVL) Unsupervised Domain Adaptation (Conventional UDA) Fine-tuning Fine-tuning
Figure 2.4: Schematic of various transfer learning problems on two criteria: the relation between source and target label space, and the amount of target problem supervision. MVL stands for Multi-View Learning.
embedding space with contributions from both domains. Specifically, the space serves the source domain for supervised learning and the target domain for unsupervised learning. The unsuper- vised learning objectives play the key roles in boosting the target performance. Two different assumptions on target data are proposed and results in two multi-task learning algorithms for unsupervised domain adaptive multi-view learning.
The first method is based on the common space factorisation assumption. Such a space is regularised to be low-entropy, i.e., near-binary. Each dimension of the space aims to capture some latent visual attributes such as colour and texture. The use ofbinary codes for hashing
with deep networks goes back to (Salakhutdinov and Hinton, 2009). In computer vision, hashing layers were inserted between feature- and classification-layers to provide a hashing code (Lin et al, 2015a; Zhu et al, 2016). To produce a binary representation for fast retrieval, a threshold is applied on the sigmoid activated hashing layer (Lin et al, 2015a).Entropy lossfor unlabelled
data is another widely used regulariser (Long et al, 2016; Zhu, 2005). It is applied at the clas- sification layer in problems where the unlabelled and labelled data share the same label space – and reflects the inductive bias that a classification boundary should not cut through the dense unlabelled data regions. Its typical use is on softmax classifier outputs where it encourages a clas- sifier to pick a single label.Graph-based regularisationis popular for semi-supervised learning
(SSL) which uses both labelled and unlabelled data to achieve better performance than learning with labelled data only (Zhu, 2005; Belkin et al, 2006). In SSL, graph based regularisation is applied to regularise model predictions to respect the feature-space manifold (Yue et al, 2017;
46 Chapter 2. Literature Review
Nadler et al, 2009; Belkin et al, 2006). Moreover, exploiting the graph from lower-level to regu- larise higher-level features is widely adopted in other scenarios, e.g., unsupervised learning (Jia et al, 2015; Yang et al, 2017b).
The second method is based on theclusteringassumption on the target samples. Taking per- son Re-ID as a example, the target instances form cluster and each cluster can be interpreted as an unknown identity. It is mostly related toPUL(Fan et al, 2018b), also a clustering paradigm
for deep UDA person Re-ID. PUL alternates between performingk-means clustering using fixed features, and deep feature learning using fixed clusters as classification targets. Recent work shows that jointly optimising deep representations with different clustering objectives (e.g.,k- means (Xie et al, 2016a; Yang et al, 2017a), agglomerative (Yang et al, 2016a), Gaussian Mix- ture Model (Van den Oord and Schrauwen, 2014; Jiang et al, 2017; Viroli and McLachlan, 2017), spectral (Shaham et al, 2018)) yields promising results (Aljalbout et al, 2018). Different from these clustering work, the main focus is about the source-to-target knowledge transfer with the unlabelled (target) multi-view data exploited via a clustering loss. A number of recent deep clus- tering methods (Xie et al, 2016a; Yang et al, 2017a) also attempt to avoid the hard/deterministic assignment ink-means clustering. DEC(Xie et al, 2016a) does soft cluster assignment using
a Student-tdistribution, and an auxiliary distribution to sharpen the initial soft assignment. Al- though DEC is end-to-end trainable by SGD, it fails to handle the reinforcing errors problem since the auxiliary distribution always peaks in the same position as the initial soft assignment, but with a higher probability. DCN (Yang et al, 2017a), on the other hand, is similar in that it
also uses a reconstruction loss to regularise the clustering by avoiding trivial clustering solutions (such as mapping all the input to a single point). However, like PUL (Fan et al, 2018b), DCN is based on alternating optimisation ofk-means and deep feature learning (i.e., not end-to-end trainable). It also does not address the reinforcing errors problem.