II. Background
2.3 Deep Learning
2.3.1 Convolutional Neural Networks
CNNs are useful for processing data that comes in the form of tensors which have dependencies along certain tensor dimensions. Such data is often 1-d (single channel time-series), 2-d (images or spectrograms), or 3-d (videos or sequences of 2-d images). Often additional dimensions of a tensor are needed to define the data structures required for input into a deep learning architecture or to handle multiple channels, but in the descriptions in this section, those added dimensions are ignored. Each convolutional layer is typically the composition of several stages: a convolution stage where the input is convolved with a set of kernels, a detector stage where the output of the convolutions are passed through a nonlinear activation function such as a Rectified Linear Unit (ReLU) to produce an activation map, and a pooling stage where statistical summaries of local data are used to downsample the activation map [17, 103]. Batch normalization is often used to enhance trainability (rate of convergence and accuracy) and to regularize the model [86]. Throughout this section, canonical image examples are often used to describe concepts because they make intuitive sense
and are simple to understand. The remainder of this section will assume the reader has at least a basic understanding of CNNs and will cover conceptual design guidelines and recent developments that have drastically improved state-of-the-art results using CNNs. For a detailed review of CNN basics, see [17].
LeCunn, et. al. describe the four concepts that engender the unique properties that CNNs possess: parameter sharing, local connections, pooling, and many-layered depth [103]. Parameter sharing leads to an efficient network parametrization and also makes CNNs work well in cases where translational symmetry is present [43]. In the context of a CNN, translational symmetry means that both the classes and the distribution of the data are invariant to location shifts: a dog is still a dog whether it is in the upper left part of an image or the center of an image [43]. The combination of local connectivity and parameter sharing give rise to equivariance to translation [17]. In image processing terms, this means that translating an image and then doing a forward pass through a layer is equivalent to performing a forward pass on the original image and then translating the resulting feature map [43]. Translational equivariance is a property which preserves translational symmetry as a network’s depth increases which enables the creation of deep convolutional networks [43].
Two types of pooling will be important for EEG processing. The first type is pooling within a feature map. In this case, pooling imparts invariance to local trans- lation. It makes the assumption that identifying when a particular feature is present is more important than knowledge about the exact location of that feature, and trades some location specificity in order to more robustly detect a feature [17]. This type of pooling will be useful in high density channel EEG processing because it will allow the learned representation to become invariant to small spatial or frequential trans- lations that may differ slightly from one individual to another, or from small changes in placement of the electrodes from one session to another. The second type of pool-
ing that will be important to EEG processing is pooling across feature maps, also known as global pooling [17, 107]. This type of pooling can be especially useful when using what is termed in this manuscript as a global correlational layer–a layer which identifies correlations globally between channels and frequency bands as illustrated in Figure 1. Each convolutional kernel will be n-channels times f -frequencies in height and will have a width of 1 time step. This type of convolution will search for a unique pattern of correlation between different channel-frequency combinations that result in a particular type of workload. When a network is trained using examples from a variety of task environments and individuals, this should allow the network to iden- tify specific distributed patterns of brain activity which are invariant across groups of individuals or across different tasks. The activation maps resulting from these workload-specific kernels will be pooled together using a global max-pooling function across all the kernels, thus determining the global workload response to the input stimuli. This type of feature could provide an indication of the dominant pattern of brain activity at a given time if a strictly convolutional network is used. This is be- cause each separate fork will then output its max activation (through global pooling) for a given time step which could then be processed sequentially with an RNN or 1-d CNN to determine the experienced workload level for a temporal segment. An example of the global max pooling process is also illustrated in Figure 1.
The most significant breakthroughs using CNNs have been produced during image classification competitions starting with the development of AlexNet in 2012 and con- tinuing with VGGNet, GoogLeNet, and ResNet. The evolution of ideas will be studied through the lens of each of these architectures and concepts that could be useful for EEG processing will be identified. Before delving into the details of each network, it is prudent to mention one topic that will not be discussed until a later section, but that was instrumental to each of these competition-winning CNN architectures: the
Figure 1. This illustrates the ideas of global correlational layers and global max-pooling. Each kernel is full length in a flattened channel-frequency dimension and of length 1 in the temporal dimension. Convolution is performed to learn kernels that correspond to channel-frequency patterns which correlate with a particular workload state producing a k kernels by t time steps matrix. Global max-pooling is then performed across each row of the activation map matrix for each time step resulting in the most activated kernel being identified for any given step.
prodigious use of physically-plausible data augmentation. Several image-specific data augmentation techniques including randomly selected fixed size cropping, random rescaling and cropping, horizontal flipping, randomly selected rotations, and random drawing of occlusions [117]. A thorough overview of data augmentation is provided in Section 2.4. The excellent results achieved in the image processing domain would not be possible without data augmentation as it helps reduce overfitting of very large networks [99].
These breakthrough networks will be discussed starting with the development of AlexNet in 2012. Krizhevsky’s AlexNet architecture consisted of 8 layers: 5 convolu- tional followed by 3 fully-connected layers [99]. It achieved a top-5 test error rate of 15.3% and a top-1 error rate of 36.7% in the ILSVRC-2012 competition, representing a 42% reduction in error compared to the next nearest competitor [99]. The use of ReLUs, training on multiple Graphics Processing Units (GPUs), using a normaliza- tion scheme on the feature maps at each layer, and the use of overlapping pooling resulted in faster training, better classification accuracy, and better generalization [99]. The first two fully-connected layers used 50% dropout to act as a form of regu- larization and reduce overfitting [99]. AlexNet was the first large-scale GPU-trained CNN and ushered in the era of deep learning for image recognition.
Simonyan and Zisserman produced 16 and 19-layer versions of their deep architec- ture (VGGNet) for ImageNet Large-Scale Visual Recognition Challenge (ILSVRC)- 2014. This architecture was significant because it was the first to feature small 3x3 filters in all convolutional layers with a stride of one [146]. Significant efficiencies were gained through the key realization that stacking a number of small-filter layers without pooling in between, enables an increased receptive field using a smaller num- ber of overall parameters [146]. For example, if three 3x3 convolutional layers are stacked, they have a receptive field of 7x7, yet only use 27 parameters rather than 49
parameters; resulting in a commensurate reduction in computation [146, 157]. This type of architecture has the added benefit of allowing for more non-linear transforms to occur than a single larger layer, adding to the network’s ability to build higher- order abstractions [146, 157]. The takeaway for EEG processing is that the use of stacked, small filters reduces computation and could help establish localized spatial regions or frequency representations of brain activity that are grouped together in a computationally-efficient manner.
Two other architectures which significantly improved the state-of-the-art are the different variants of Szegedy’s GoogLeNet architecture with inception sub-networks and the development of deep residual networks (ResNets) [157, 156, 74, 75]. These are discussed at length in Section 5.1.2 and their understanding will not be required for the intervening sections as other EEG researchers have not yet leveraged these techniques. Now a brief transition to self-supervised and unsupervised deep learning techniques which have been used in EEG research is discussed for completeness.