CONCLUSION AND FUTURE WORK - Stochastic Orthogonalization and Its Application to Machine Learni

CONCLUSION AND FUTURE WORK

5.1. Conclusion

In this thesis, we have introduced stochastic orthogonalization as a computationally- simple way to impose regularization on the distributions of the singular values of weight matrices in convolutional neural networks. The technique appears to work like singular- value bounding in that it raises low singular values and decreases high singular values. Unlike other approaches which focus on imposing orthogonality, the methods in this thesis cause the weight matrices to have a smooth singular value distribution but are not all equal to unity.

In addition, the methods proposed in this thesis have a low computational cost. Thus, they are well-suited to deep neural network architectures.

5.2. Future Work

Future work in this area could include the following:

The methods could be tested on a wider range of CNN architectures, such as Wide- ResNet and ResNext [13]. In some cases, these architectures have a large increase in channels from layer to layer, causing the weight matrices to be overcomplete when considered in traditional form. Thus, a modification to stochastic orthogonalization would be needed.

The methods for directly updating the W matrices within each structure could be used in place of the cost function approach, such that Algorithms 2 and 3 could be tested and explored for use within CNNs.

Optimization of the weight decay coefficient to improve performance further could be performed.

References

[1] D. H. Hubel and T. N. Wiesel, “Receptive fields and functional architecture of monkey striate cortex.,” The Journal of physiology, vol. 195 1, pp. 215–43, 1968.

[2] K. Yamaguchi, K. Sakamoto, T. Akabane, and Y. Fujimoto, A neural network for speaker-independent isolated word recognition. 1990.

[3] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E. Hubbard, and L. D. Jackel, Handwritten Digit Recognition with a Back-Propagation Network. 1989.

[4] A. Khamparia, D. K. Gupta, N. G. Nguyen, A. Khanna, B. Pandey, and P. Tiwari, “Sound classification using convolutional neural network and tensor deep stacking network,” IEEE Access, vol. 7, pp. 7717–7727, 2019.

[5] Y. Kim, Convolutional Neural Networks for Sentence Classification. 2014.

[6] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks. 2012.

[7] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” 1998.

[8] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2015.

[9] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Van- houcke, and A. Rabinovich, “Going deeper with convolutions,” 2015 IEEE Con- ference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9, 2014. [10] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-

level performance on imagenet classification,” 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1026–1034, 2015.

[11] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” J. Mach. Learn. Res., vol. 15, pp. 1929–1958, 2014.

[12] D. Mishkin and J. E. S. Matas, “All you need is a good init,” CoRR, vol. abs/1511.06422, 2015.

[13] N. Bansal, X. Chen, and Z. Wang, “Can we gain more from orthogonality regulariza- tions in training deep networks?,” in NeurIPS, 2018.

[14] K. Pearson, “Liii. on lines and planes of closest fit to systems of points in space,” The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science (Series 6), pp. 559–572, 1901.

[15] K. I. Diamantaras and S.-Y. Kung, “Principal component neural networks: Theory and applications,” 1996.

[16] A. Hyv¨arinen and E. Oja, “A fast fixed-point algorithm for independent component analysis,” Neural Computation, vol. 9, pp. 1483–1492, 1997.

[17] S. Wisdom, T. Powers, J. R. Hershey, J. L. Roux, and L. E. Atlas, “Full-capacity unitary recurrent neural networks,” in NIPS, 2016.

[18] G. H. Golub and C. V. Loan, “Matrix computations (3rd ed.),” 1996.

[19] S. Douglas, “On the singular value manifold and numerical stabilization of algorithms with orthogonality constraints,” Fourth IEEE Workshop on Sensor Array and Multichannel Processing, 2006., pp. 195–199, 2006.

[20] P. P. Vaidyanathan, “Multirate systems and filter banks,” 1992.

[21] J. Zhou, M. N. Do, and J. Kovacevic, “Special paraunitary matrices, cayley transform, and multidimensional orthogonal filter banks,” IEEE Transactions on Image Pro- cessing, vol. 15, pp. 511–519, 2006.

[22] D. G. Manolakis, V. K. Ingle, and S. M. Kogon, “Statistical and adaptive signal processing: Spectral estimation, signal modeling, adaptive filtering and array processing,” 1999.

[23] A. M. Saxe, J. L. McClelland, and S. Ganguli, “Exact solutions to the nonlinear dy- namics of learning in deep linear neural networks,” CoRR, vol. abs/1312.6120, 2013.

[24] D. Xie, J. Xiong, and S. Pu, “All you need is beyond a good init: Exploring better solution for training extremely deep convolutional neural networks with orthonor- mality and modulation,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5075–5084, 2017.

[25] K. Jia, D. Tao, S. Gao, and X. Xu, “Improving training of deep neural networks via singular value bounding,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3994–4002, 2016.

[26] M. Ozay and T. Okatani, “Optimization on submanifolds of convolution kernels in cnns,” ArXiv, vol. abs/1610.07008, 2016.

[27] L. Huang, X. Liu, B. Lang, A. W. Yu, and B. Li, “Orthogonal weight normalization: Solution to optimization over multiple dependent stiefel manifolds in deep neural networks,” ArXiv, vol. abs/1709.06079, 2017.

[28] H. Sedghi, V. Gupta, and P. M. Long, “The singular values of convolutional layers,” ArXiv, vol. abs/1805.10408, 2018.

[29] M. Harandi and B. Fernando, “Generalized backpropagation, ´etude de cas: Orthogo- nality,” ArXiv, vol. abs/1611.05927, 2016.

[30] Y. Yoshida and T. Miyato, “Spectral norm regularization for improving the generaliz- ability of deep learning,” ArXiv, vol. abs/1705.10941, 2017.

[31] Q. V. Le, N. Jaitly, and G. E. Hinton, “A simple way to initialize recurrent networks of rectified linear units,” ArXiv, vol. abs/1504.00941, 2015.

[32] M. Henaff, A. Szlam, and Y. LeCun, “Orthogonal rnns and long-memory tasks,” ArXiv, vol. abs/1602.06662, 2016.

[33] M. Arjovsky, A. Shah, and Y. Bengio, “Unitary evolution recurrent neural networks,” in ICML, 2015.

[34] E. Vorontsov, C. Trabelsi, S. Kadoury, and C. J. Pal, “On orthogonality and learning recurrent networks with long term dependencies,” ArXiv, vol. abs/1702.00071, 2017.

[35] V. D. Dorobantu, P. A. Stromhaug, and J. Renteria, “Dizzyrnn: Reparameter- izing recurrent neural networks for norm-preserving backpropagation,” ArXiv, vol. abs/1612.04035, 2016.

[36] K. Jia, S. Li, Y. Wen, T. Liu, and D. Tao, “Orthogonal deep neural networks,” IEEE transactions on pattern analysis and machine intelligence, 2019.

[37] A. Bj¨orck and C. Bowie, “An iterative algorithm for computing the best estimate of an orthogonal matrix,” 1971.

[38] S. Haykin, “Adaptive filter theory 5th edition,” 2005.

[39] C.-Y. Lee, P. W. Gallagher, and Z. Tu, “Generalizing pooling functions in convolutional neural networks: Mixed, gated, and tree,” in AISTATS, 2015.

[40] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” ArXiv, vol. abs/1603.05027, 2016.

[41] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014.

[42] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” ArXiv, vol. abs/1502.03167, 2015.

[43] J. C. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” J. Mach. Learn. Res., vol. 12, pp. 2121–2159, 2010.

[44] S. Ruder, “An overview of gradient descent optimization algorithms,” ArXiv, vol. abs/1609.04747, 2016.

In document Stochastic Orthogonalization and Its Application to Machine Learning (Page 43-47)