To solve these challenges of module form, assembly, and generality, the approach taken in this dissertation is built out from the deep multitask learning framework. First, this framework is generalized to provide a more unified perspective of how neural network modules can be trained for multiple
purposes. These multiple purposes are formalized as pseudo-tasks. Within this framework, a progression of six learning systems is presented. Through this progression, the generality of modules is expanded, leading to practical solutions to module form and assembly at each step along the way. The progression proceeds by (1) testing the inherent generality of single modules; (2) improving the generality of single modules; (3) increasing the number of
modules and breadth of generality; (4+5) scaling this generality to complex automatically-designed architectures; and (6) scaling this generality across diverse architectures and problem areas. Together, these systems constitute a comprehensive approach of how deep multitask learning can be used to discover multi-purpose modules.
The first system, General Reuse of Static Modules (GRUSM), is designed to answer the question: How general are trained modules inherently, i.e., when trained for a single purpose? In this system, a module trained initially for one purpose in a single task is reused as is, i.e., without modifying its parameters, for a di↵erent purpose at a di↵erent depth in another task. A concrete implementation is developed that uses a coevolutionary flavor of neuroevolution, ESP (Gomez & Miikkulainen, 2003), to train modules and incorporate them into new locations. The implementation, GRUSM-ESP, is evaluated in general video game playing, where each task corresponds to a game. In the experiments, reused modules take the form of a single layer of weights (of adaptable dimension). The results show that sometimes modules generalize well, sometimes they do not, and that this transferability can be
predicted based on task characteristics. In particular, the modules trained in more complex tasks tend to generalize better. This makes sense, since modules that support more complex functionality will naturally contain more information that can be exploited. However, predicting module generalization is not as strong a tool as training more general modules in the first place.
With this knowledge that neural network modules have the potential to generalize across diverse purposes, the question is: How can they be forced to generalize better? The second system, Pseudo-task Augmentation (PTA), approaches this question from the foundational case where there is a single module that is simultaneously trained for many purposes. This module is a generic encoder with an arbitrary fixed topology, which is shared across all tasks and trained by gradient descent, as in classical deep multitask learning. To make this encoder more general, it is forced to solve each task in multiple distinct ways, by training it with multiple decoders for each task; each decoder defines a distinct pseudo-task. Training with additional pseudo-tasks is theoretically shown to expand the training dynamics of gradient descent. Methods are then introduced that interleave gradient descent with a coevolutionary process that controls pseudo-tasks to improve generalization. By increasing the number of ways a single core module is used, PTA is shown to improve performance across an array of deep models, including achieving state-of-the-art results on the CelebA multitask facial attribute recognition dataset.
With this knowledge that training a module in more ways can improve its generality, the question is: How far can this idea be taken? The third
system, soft ordering, takes this idea to the extreme: Each layer in a deep architecture constitutes a module, and all modules are trained simultaneously across all possible locations in all tasks in a fixed architecture. Thus, each module must support functionality everywhere to some extent. At each lo- cation, the system learns a mixture of modules used at that location, while simultaneously learning the parameters of the modules themselves, and the complete optimization is performed end-to-end using gradient descent. The method is evaluated in vision and non-spatial domains, using convolutional and fully-connected layers, and demonstrates improvements over single-task and shared feature extractor approaches, including outperforming state-of-the-art deep multitask learning approaches on Omniglot multitask character recog- nition. Visualizations indicate that modules are indeed learning functional primitives, whose behavior is tuned to match the needs of particular contexts. These results suggest that simultaneously training modules for many kinds of purposes across multiple tasks is a promising approach to discovering a compact set with generic functionality.
However, the soft ordering method does not scale well, because all modules are executed at each location in the joint model during each forward and backward pass. Scaling this approach requires a way to automatically select which module to use at each location during training. In the remaining three systems, such selection methods are used to scale soft ordering in two orthogonal directions: (1) to more complex multitask architectures discovered by neural architecture search; and (2) to sharing across diverse classes of
architectures and task modalities. The fourth and fifth systems follow the first of these directions, and the sixth follows the second.
The fourth system, Coevolution of Task Routing (CTR), uses evolution to discover where each module should be applied, and uses gradient descent to train their parameters. That is, evolution and gradient descent are interleaved in a manner similar to PTA. In CTR, starting from a minimal architecture for each task, evolution expands its use of modules incrementally so that the correct amount of complexity can be achieved. Modules in CTR can also take on more generic functionality than in soft ordering, since evolution discovers di↵erent kinds of architectures for di↵erent tasks, so modules are trained for more diverse pseudo-tasks. Two new key mechanisms make the system practical: (1) All candidate models for all tasks are trained jointly, so module semantics are preserved across generations; and (2) task architectures are coevolved, allowing more efficient optimization of the multitask architecture. In experiments, this system demonstrates marked improvement over soft ordering.
The fifth system, Coevolution of Modules and Task Routing (CMTR), is a direct generalization of CTR, in which modules are no longer restricted to being single neural network layers. In this system, along with the optimization of how modules are assembled for each task, the topologies of the modules themselves are optimized. Module topologies are optimized in an outer loop around CTR, using a variant of CoDeepNEAT (Miikkulainen et al., 2017), a popular evolutionary architecture search algorithm, which in turn is incremental, coevolutionary, and explicitly modular. By making module topologies more
flexible, CMTR yields significant improvements over CTR.
Though they improve performance in many settings, PTA, soft ordering, CTR, and CMTR cannot be used to share modules across modalities, i.e., when the data for di↵erent tasks have a fundamentally di↵erent structure, for example, vision vs. language. This restriction arises because, in these approaches, modules are defined as layers or graphs of layers, so they can only be applied where their input-output specification is satisfied, both technically and semantically. For example, the spatial semantics of a 2D convolutional layer are lost when applying this layer to non-spatial input.
The sixth system, Modular Universal Reparameterization (MUiR), over- comes this restriction. It supports sharing of modules across arbitrary deep architectures and task modalities, allowing regularities to be exploited across diverse problem areas. This system decomposes the parameters of a given archi- tectures for a set of tasks into a set of equally-sized linear maps. The parameters of each map are then generated by ahypermodule, which reduces the parameters to a small number of degrees of freedom. Hypermodules generalize the modules of the other systems by allowing each module to be marginally tuned for di↵erent purposes. This flexibility can be especially valuable when applications are highly diverse. The mapping of modules to locations is optimized in a manner similar to CTR, incrementally increasing sharing by interleaving gradi- ent descent and evolution. However, for this system, coevolution of assembly occurs not across tasks, but across all module locations, i.e., across all pseudo- tasks. Coevolving at this level yields theoretically-grounded speedups that are
especially helpful when the number of pseudo-tasks is large. Coevolution is implemented with a surrogate fitness function that uses the mixture-learning mechanism of soft ordering. Experiments demonstrate intriguing dynamics of MUiR, including positive sharing across tasks with fundamentally di↵erent modalities, and the emergence of surprising sharing behaviors. Importantly, by supporting sharing of modules across modalities, MUiR is especially valuable when a task with a new modality arises with only a small amount of data, for example, with temporal data collected from a new kind of geosensor, or a rare disease detectable by data from a new kind of medical device. In such a case, MUiR can boost models for the new modality by harnessing generic functionality discovered from vast datasets and problem repositories for more common modalities.
Since modules can now be shared across diverse architectures and modalities, and improved by optimizing their topologies and the topologies in which they are applied, a natural extension would be to combine these features in an approach that optimizes cross-modal architectures to make modularity even more e↵ective. Such an approach would be a straightforward unification of CMTR and MUiR, and is left for future work.
Overall, by progressively increasing module generality, and developing practical methods for assembling modules of various forms along the way, these six systems verify the value of the deep multitask learning approach to discovering multi-purpose modules at a level that humans cannot. These developments provide a foundation for developing future systems that combine
their advantages towards a fully-general and robust module discovery algorithm that continuously refines itself to efficiently construct high-performing solutions to a broad range of critical real-world applications.