3.2 Results and Analysis
5.1.1 Methodology
Definition and Notation For UDA multi-view learning, there is a source (labelled) domainS and a target (unlabelled or partially labelled) domainT. The key characteristic of UDA multi- view learning is the disjoint label space, i.e., the sourceYS and targetYT label spaces are po- tentially disjoint: YS∩YT =∅. Instances from source/target domains are denoted XS and XT
5.1. Common Factorised Space Model 81
Sampling
mini-batch
Common SpaceGraph
Regularisation
Unsupervised
Factorisation
Loss
Figure 5.1: Different colours corresponding to different data streams. Green indicates source data. Blue is used for target data. Purple means joint data from both source and target domains. The parameters of CFSM areθM,θCandθScorresponding to the feature extractorΦM, CFS layer
ΦC and source classifier χS. θS is learned with source data using supervised loss whileθM,θC are estimated using data from both domains with all losses and regularisations.
Model Architecture The proposed model architecture consists of three modules, a feature ex-
tractor F =ΦM(X) that can be any deep neural network and is shared between all domains. This is followed by a fully connected layer and sigmoid activationσ, which define the Common Factorised Space (CFS) layer. This provides a representation of dimension dC, fffC=ΦC(·) = σ(WWWΦM(·) +bbb). Recall that the goal of CFS is to learn a latent factor (low-entropy) represen- tation for both source and target domains. The sigmoid activation means that the layer’s scale is fffC ∈(0,1)dC, so activations near 0 or 1 can be interpreted as the corresponding latent factor
being present or absent. To encourage a near-binary representation, unsupervised factorisation loss is applied. For the labelled source domain only, the pre-activated fffC are then classified by softmax classifierχSwith cross-entropy loss. The overall architecture is illustrated in Figure 5.1.
Regularised Model Optimisation The parameters of the proposed CFSM areθ:={θM,θC,θS} including parameters of the feature extractorΦM, CFS layerΦCand source classifierχS. Both the labelled source{XS,YS}and unlabelled target dataXT are used in the multi-task model training procedure.
82 Chapter 5. Unsupervised Domain Adaptive Multi-View Learning
Firstly, the labelled source data {XS,YS} contributes to the model training as a supervised learning task with a loss`sup(XS,YS;θ) which is a conventional cross-entropy. However, such loss is inapplicable to the target domain data since no supervision is provided.
To adapt the knowledge across domains, the unlabelled target data plays the key role in model training. Therefore, unsupervised domain adaptation losses/regularisations are required to enable the multi-task learning with target domain. As mentioned before, conventional UDA losses/regularisations are not applicable in the UDA multi-view learning problem since the label space across domains are disjoint. Therefore, we proposed the Common Factorised Space (CFS) by using the CFS layerΦC. A low-entropy loss is used to regularise its model learning (parameter θC). On the other hand, the UDA multi-view learning problem usually relies on the feature representation for retrieval tasks, e.g., in person Re-ID. More regularisations are thus required on the learning of feature extractorΦM (with parameterθM). As illustrated in Figure 5.1, two unsupervised regularisations are exploited in our CFSM based on the data samples from both domains.
Low-Entropy Regularisation: Unsupervised Adaptation Firstly, the definition of the low-
entropy regulariser on the CFS is discussed. The sigmoid activated outputs fffC from CFS layer
ΦC can be interpreted as multi-label predictions on latent factors. The uncertainty measure for label prediction can be defined by using its entropy,
− N
∑
i=1 <fffC,i,log(fffC,i)> =− N∑
i=1 <ΦC(xi),log(ΦC(xi))> (5.1)where fffC,i denotes the common factor representationΦC(xxxi)of instancexxxi∈X. This is applied on both source and target data, soNis the number of instances in both datasets. log(·)is applied
element-wise, and <·, ·>is vector inner product. According to the low-uncertainty criterion (Carlucci et al, 2017), optimising such prior can be achieved by minimising this uncertainty measure. Eq. 5.1 is thus the regulariser corresponding to the low-entropy prior. Specifically, this loss biases the representationFCto contain more certain predictions, e.g., closer to 0 or 1 for each discovered latent factor. Therefore, it is denoted as unsupervised factorisation loss.
In summary, the low-entropy regulariser on CFS is built upon the assumption that the two domains share a set of latent attributes and that if a source classifier is well adapted to the target, then the presence/absence of these attributes should be certain for each instance. Therefore, it
5.1. Common Factorised Space Model 83 essentially generalises the low-uncertainty principle (widely used in existing unsupervised and semi-supervised learning literature) to the disjoint label space setting.
Graph Regularisation: Robust Feature Learning The second prior is regularising the fea-
ture extractorΦM. The unique property of our setup so far is that the knowledge transfer into the target domain is via the CFS layer; therefore we are interested in ensuring that the feature extractor network extracts features whose similarity structure reflects that of the latent factors in the CFS layer. Unlike conventional graph Laplacian losses that regularise higher-level features with a graph built on lower-level features (Belkin et al, 2006; Zhu, 2005), we do the reverse and regularise the feature extractorΦM to reflect the similarity structure in fffC. This is particularly important for applications where the target problem is retrieval, because deep features fff =ΦM(·) are used as an image representation.
The proposed graph loss is expressed as
Tr(fffT∆fffCfff), (5.2)
where∆fffC is the graph Laplacian (Cai et al, 2010b) built on the common space features fffC.
Summary We unify the proposed model architectureθ :={θM,θC,θS} with source{XS,YS} and target {XT}data for UDA multi-view problems under the multi-task learning framework.
This decomposes into a standard supervised term (with source data only) and data-driven priors for the CFS layer and feature extraction module. They correspond to supervised loss`sup(XS,YS;θ), unsupervised factorisation loss (Eq. 5.1) and the graph loss (Eq. 5.2) respectively. Taking all terms into account, the final optimisation objective is,
`(θ) =`sup(XS,YS;θ) +βMTr(fffT∆fffCfff) −βC1 N N
∑
i=1 < fffC,i,log(fffC,i)> . (5.3)whereβCandβMare balancing hyper-parameters. In order to selectβCandβM, the model is first run by setting all weights to 1; after the first few iterations, the value of each loss is checked. We then set the two hyper-parameters to rescale the losses to a similar range so that all three terms contribute approximately equally to the training.
Mini-batch Organisation Deep Neural Networks (DNNs) are usually trained with SGD mini-
batch optimisation, but Eq. 5.3 is expressed in a full-batch fashion. Converting Eq. 5.3 to mini- batch optimisation is straightforward. However, it is worth mentioning the mini-batch schedul- ing: each mini-batch contains samples from both source and target domains. The supervised loss
84 Chapter 5. Unsupervised Domain Adaptive Multi-View Learning
is applied only to source samples with corresponding supervision, the entropy and graph losses are applied to both, and the graph is built per-mini-batch. In this work, the number of source and target samples are equally balanced in a mini-batch.