For hard mixtures with both global and local gaters, suppose that we choose the number of experts K such that the number of examples per expert M = L/K is a fixed fraction of the total number of examples. Then, if we suppose that the training time for one expert is polynomial of order p with the number of examples M , the training time for training the experts in one outer-loop iteration of the hard mixtures becomes on the order of:
O(KMp) = O(LMp−1) = O(L) .
If the gater is not localized (as in Algorithm 6.1), then it may be a bottleneck. In the experiments we proposed, we did not measure the cost of training the gater, but as it was an MLP, it was most likely greater than O(L), even though our experiments showed the training time of the hard mixture to scale linearly with L.
Contributions
This chapter is a synthesis of the following published papers:
• R. Collobert, S. Bengio, and Y. Bengio. A parallel mixture of SVMs for very large scale problems. Neural Computation, 14(5):1105–1114, 2002a . In this paper, we introduced for the first time the notion of “hard” mix- ture of experts, and proposed training Algorithm 6.1 with a “global” gater. At this time the goal was to find a way to train SVMs on large databases. Indeed, SVMs are a very “fashioned” algorithm, but known to be intractable on large training sets. Therefore, we proposed in this paper experiments with hard mixtures of SVMs.
• R. Collobert, S. Bengio, and Y. Bengio. A parallel mixture of SVMs for very large scale problems. In T.G. Dietterich, S. Becker, and Z. Ghahra- mani, editors, Advances in Neural Information Processing Systems, NIPS 14, pages 633–640. MIT Press, 2002b .
This paper is very similar to the previous one, with a few more details, and some experiments on another realistic database, to validate another time the algorithm in practice.
• R. Collobert, Y. Bengio, and S. Bengio. Scaling large learning problems with hard parallel mixtures. International Journal on Pattern Recognition and Artificial Intelligence (IJPRAI), 17(3):349–365, 2003 .
In this paper, we proposed the hard mixture of experts with a “local” gater and Algorithm 6.2. We also added experiments with MLP experts, in addition to the ones with SVM experts.
Note that in this chapter, we extended Algorithm 6.1 to a probabilistic frame- work. We also made a clear link between the hard mixture with “global” gater and the one with a “local” gater in this probabilistic framework.
Conclusion
In this chapter we have presented new divide-and-conquer parallelizable hard mixture algorithms to reduce the training time of algorithms such as SVMs. Very good results were obtained compared to classical SVMs both in terms of training time and generalization performance on a real life database. Moreover, the algorithms scale linearly with the number of examples, at least in the ranges we presented. Both an algorithm with a global gater (that is where the gater is trained on the whole training set) and a local gater (where the local gaters are trained on the same subset as their corresponding expert)
Conclusion 85
were presented, with a demonstration that in a probabilistic framework they actually minimize a well-defined criterion.
These results are extremely encouraging and suggest that the proposed method could allow training SVM-like models for sets of data on the order of millions in a reasonable time. In the experiments, two types of “gater” models were proposed, one based on a single MLP (for the mixture with a global gater), and the other based on local Gaussian Mixture Models (for the mixture with a local gater). The latter has the advantage of being trained very quickly and locally to each expert, thereby guaranteeing linear training time for the whole system (per iteration). However, the best results are obtained with the MLP gater. Surprisingly, even faster results (with similar generalization) are obtained if the SVM experts are replaced by MLP experts. If training of the MLP gater with stochastic gradient descent grows less than quadratically in time (as we conjecture it to be the case for very large data sets, to reach a “good enough” solution), then the whole method is clearly sub-quadratic in training time with respect to the number of training examples.
Finally, “hard” mixtures have one significant drawback: the number of hyper-parameters is quite large. Hyper-parameters exist for the experts, for the gater, and for local and global training algorithms of the mixture. In practice, tuning a mixture of experts is a nightmare. Thus, in the next chapter, we will focus on MLPs, which are much easier to use in practice. MLPs are also less resource consuming than the SVMs studied in Chapter 5, and are more suitable for large scale problems. However, we have raised some optimization issues in MLPs. Indeed, we observed performance differences between an MLP trained with the MSE criterion, an MLP trained with the CE criterion, and mixtures of experts. We will try to give an explanation of these differences, which could help us to find better ways to train MLPs.