2.6 Hyper-parameter Optimization
2.6.1 Transfer Learning of Deep Models
TL mainly focuses on learning the common features that can get benefit for multiple tasks. In Auto-ML, its applications are mostly in network architecture search, however, the knowl- edge transfer process from one task to the other is not addressed in an automated manner. The DNN have attained tremendous success by consistently outperforming the shallow learning techniques. However, solving complex tasks need deeper and wider networks which are considered hard to design. Transfer learning, often, works well on simple and more general tasks whereas complex tasks require effort to design a customized network. The network designing process requires specialized skills and numerous trials which is a time consuming and computationally expensive task. The state-of-the-art networks require well- tuned hyper-parameters which often demand numerous computationally intensive trials.
Among the key developments in the field of DL, Convolutional Neural Networks (CNNs) stands out as the workhorse of Computer Vision. Training a large CNN with millions of parameters is a computationally intensive task which also requires a significant amount of training data. However, several state-of-the-art image classification architectures trained on large image datasets are publicly available, including Visual Geometry Group Network (VG- GNet) Simonyan and Zisserman (2014), Inception (Szegedy et al., 2015), Residual Networks (ResNet) (He et al., 2016) and Inception-ResNet (Szegedy et al., 2017). These networks are trained on the ImageNet (Russakovsky et al., 2015) dataset which consists of 1.2 million images and 1000 classes.
Training of these types of deep networks from scratch on a huge dataset is a computa- tionally demanding task. As a result, TL, i.e. reusing parts of the pre-trained models either as-is or as a starting point within the training process, quickly became a de-facto standard in Computer Vision tasks. The general consensus seems to be that the more data one has, the more ‘aggressive’ the re-training process can be (e.g. re-training more final layers). Con- versely, the more similar the new dataset is to the one used to train the original model, the fewer layers need to be fine-tuned. Despite the wide adoption of TL in the context of CNNs, to the best of our knowledge, there is still no principled way of approaching this process. The number of layers to re-train or even the network architectures themselves are chosen in an ad-hoc manner and tested one after the other, which is a computationally inefficient procedure.
Recently, MLL has become a crucial component of DL for the selection of hyper- parameters of a specific architecture. Miikkulainen et al. (2017) proposed a comprehensive set of global and node level hyper-parameters which are critical in optimizing deep learning
46 Hyper-parameter Optimization
Auto-ML
Meta-level Learning
Hyper-parameter Optimization
Neural Architecture Search
Systems Learning from model evaluation
Learning from task properties
Learning from prior models
Blackbox Hyper-parameter Optimization
Multi-fidelity Optimization Evolutionary Method Reinforcement Learning Auto-WEKA hyperopt-sklearn auto-sklearn TPOT Relative Landmarks Surogate Models
Warm-started multi-task learning
Meta-features
Meta-model
Evolutionary Algorithms
Few shot Learning
Transfer Learning
Random Search
Guided Search
Population-based Search
Baysian Optimization
Learning Curve-Early Stopping Prediction
Bandit-based Algorithm Selection
Real et al (2017) Zoph et al (2017) Liu et al (2018) Baker et al (2017) Thornton et al (2013) Kotthoff et al (2017) Komer et al (2014) Feurer et al (2015) Olson et al (2016) Pfahringer et al (2000) Soares et al (2001) Abdelmessih et al (2010) Wistuba et al (2015) Perrone et al (2017) Rivolli et al (2018) Kalousis et al (2001) Brazdil at al (2003) Olson et al (2016) Ravi et al (2017) Tan et al (2018) Hutter et al (2011) Bergstra et al (2011) Rivolli et al 2018 Zeng et al (2017) Rijn et al (2015) Auer et al (2002)
Figure 2.7: A holistic view of Automatic Machine Learning areas and systems
architectures through evolution. The use of Reinforcement learning to generate CNN and RNN architectures have been proposed by Baker et al. (2016) and Zoph and Le (2016). They have used Q-learning to produce new CNN architectures. Finn et al. (2017) introduced a simple but powerful approach, model-agnostic meta-learning, which provides an optimal initialization of model parameters that lead to fast learning on new tasks.
EXISTING RESEARCH Hyper-parameter Optimization
TL has been positioned to effectively adapt pre-trained networks to a new domain by fine-tuning their final layers. Some studies, such as Wang et al. (2017b) and Shin et al. (2016), propose re-training of only final fully-connected (FC) layers of the network which does not guarantee state-of-the-art accuracy, particularly on relatively dissimilar tasks. On the contrary, domain adaptation becomes beneficial by fine-tuning an increasing number of layers based on the complexity and relevance of the new task (Yosinski et al., 2014). Therefore, a question arises as to how many blocks need fine-tuning to adapt to a new domain based on the complexity, size and domain relevance.
The significant breakthrough in the field of ML and computer vision began when AlexNet achieved state-of-the-art image classification accuracy against all the traditional approaches in 2012 Krizhevsky et al. (2012). Since then CNN based architectures have been con- sistently outperforming other approaches in the end-to-end image and video recognition tasks Krizhevsky et al. (2012). The key reasons of this success are large public image datasets, such as ImageNet (Deng et al., 2009) and CIFAR (“CIFAR-10 and CIFAR-100”), high-performance computing – Graphics Processing Units (GPUs) and Tensor Processing Units (TPUs), and ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) (Rus- sakovsky et al., 2014). Indeed, ILSVRC served as a platform for several state-of-the-art DL architectures which are trained on ImageNet.
Regardless of the proven success of CNNs, some limitations are still tagged with this area. They require large amounts of labeled data and massive processing to optimize millions of parameters. This limitation has been overcome by leveraging TL which acquires knowledge on a specific problem and reduces it to a different but related task (Yosinski et al., 2014).