Dynamic Texture CNN - Materials and Methods

4.2 Materials and Methods

4.2.2 Dynamic Texture CNN

Slicing the Dynamic Texture data

Slices of the DT sequences are extracted as illustrated in Figure 4.2 to enable the training of the networks on the three orthogonal planes.

XY plane (spatial): A sequence of DT withd frames of sizeh×wis represented asS ∈_Rh×w×d×c _where_h _(height), _w_{(width) and}_d _{(depth) are in the}_x_,_y_{, and}_t axes respectively and c is the number of colour channels, i.e. three for RGB or one for greyscale. In the spatial plane,m_d frames equally spaced in the temporal axis are extracted fromS. All the frames are resized using bilinear interpolation to the sizen×nto obtain a sequenceS_xy∈_Rn×n×md×c_with_m

d≤min(d,h,w)and

n≤min(d,h,w).

XT and YT planes (temporal): From the same sequenceS,m_handm_wslices are extracted in thext and yt planes, equally spaced on they andx axes respectively. The slices are resized to n×n resulting in sequences S_xt ∈ _Rn×n×mh×c _and _S

∈_Rn×n×mw×c_{. A slice in the} _xt _(or _yt_{) plane reflects the evolution of a row (or}

a column) of pixels over time throughout the sequence. After pre-processing a sequence, three sets of slices are obtained which represent the same DT in three different planes. Examples of spatial and temporal slices are shown in Figure 4.3. Training on three planes

In order to evaluate the developed methods, the sequences are split into training and testing sets. Details on the training and testing splits are provided in the experimental setups in Section 4.3.1. A dataset containingM original sequences is split intoT training sequences and (M−T) testing sequences. In each plane, there is a total ofT ×m training and(M−T)×mtesting slices, where m∈ {m_d,m_h,m_w}is the

number of slices per sequence. For each plane, theT×mtraining slices are used to finetune an independent network. In the testing phase, the slices in each plane are classified and the outputs are combined as explained in the following section. Sum collective score

An independent network is used for each of the three orthogonal planes, thus multiple outputs must be combined in the testing phase. A collective score is implemented by summing the output predictions of the T-CNNs. Firstly, a sequence is represented in each plane by a stack of slices. Therefore, a score for a given plane is obtained by summing the outputs of all the slices in this plane. The score vector of a sequence in a plane pwithmslices is computed as follows:

sp= 1 m m

∑

i=1 s_ip (4.1)

wheres_ip∈_RN _{is the output (non-normalised classification score) of the last fully-} connected layer of theith slice on planep, with p∈ {xy,xt,yt}andNis the number of classes. All the scoress_ipwithi={1, ...,m}are obtained with the same finetuned network for a particular plane pand each plane uses an independently finetuned network. A global score for a given sequence is then obtained by summing over the three planes as follows:

_∑

p={xy,xt,yt}

sp (4.2)

Note that in Section 4.4.2, sums over two planes or single planes are also used to analyse their contribution and complementarity.

The collectively detected labell for a sequence is the one for which the sum scores is maximum.

l=arg max j

(s[j]) (4.3)

where j={1, ...,N} enumerates the DT classes. This ensemble model approach combines three weak classifiers to create a more accurate one. A late data fusion approach is used as it requires three network classifiers to recognise three data types derived from the sequences, i.e. slices in three different planes. The late fusion adopted here is different from the one used in [155], in which the fully-connected layers combine multiple “streams” of frames analysed at multiple time steps. By using a sum collective score, the classification confidence of each slice, given by the output vector of the last fully-connected layer, is taken into account for the collective classification. Confidence in this context refers to the magnitude of the

output activations of the convolutional network as each neuron gives a score similar to a non-normalised probability for the input image to belong to a certain class.

Moreover, summing the raw output of the last fully-connected layer gives better results than the softmax normalised probability output. Using the raw output, large non-normalised scores can be attributed to a sequence by a single plane for a particular class if the confidence is high. This is similar to an automatic weighting strategy based on the detection confidence of each network. Note that in Section 3.5, the softmax outputs are summed. Although there is no large difference in using the softmax or raw outputs, the reasoning is that the subimages are from the same image and same plane. Due to the homogeneity of tissue images, all subimages are expected to contribute similarly to the collective score. On the other hand, summing across multiple planes means that one plane could be detected with more confidence than the others and this confidence should have a greater impact. It may also help by weighting slices in sequences which are not repetitive throughout the entire temporal domain.

Finally, it was confirmed experimentally that a sum collective score performs better than a majority or a Borda count voting scheme.

In document Deep learning for texture and dynamic texture analysis (Page 92-94)