Who Breaks Early, Looses: Goal Oriented Training of Deep Neural Networks Based on Port Hamiltonian Dynamics
Authors
Julian Burghoff, Marc Heinrich Monells, Hanno Gottschalk
Abstract
The highly structured energy landscape of the loss as a function of parameters for deep neural networks makes it necessary to use sophisticated optimization strategies in order to discover (local) minima that guarantee reasonable performance. Overcoming less suitable local minima is an important prerequisite and often momentum methods are employed to achieve this. As in other non local optimization procedures, this however creates the necessity to balance between exploration and exploitation. In this work, we suggest an event based control mechanism for switching from exploration to exploitation based on reaching a predefined reduction of the loss function. As we give the momentum method a port Hamiltonian interpretation, we apply the ’heavy ball with friction’ interpretation and trigger breaking (or friction) when achieving certain goals. We benchmark our method against standard stochastic gradient descent and provide experimental evidence for improved performance of deep neural networks when our strategy is applied.
Keywords
neural nets; momentum; goal oriented search; port Hamilton systems
Citation
- ISBN: 9783031442032
- Publisher: Springer Nature Switzerland
- DOI: 10.1007/978-3-031-44204-9_38
- Note: International Conference on Artificial Neural Networks
BibTeX
@inbook{Burghoff_2023,
title={{Who Breaks Early, Looses: Goal Oriented Training of Deep Neural Networks Based on Port Hamiltonian Dynamics}},
ISBN={9783031442049},
ISSN={1611-3349},
DOI={10.1007/978-3-031-44204-9_38},
booktitle={{Artificial Neural Networks and Machine Learning – ICANN 2023}},
publisher={Springer Nature Switzerland},
author={Burghoff, Julian and Monells, Marc Heinrich and Gottschalk, Hanno},
year={2023},
pages={454--465}
}
References
- Lecun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE vol. 86 2278–2324 (1998) – 10.1109/5.726791
- Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images (2009)
- Xiao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. CoRR, vol. abs/1708.07747 (2017). arXiv: 1708.07747
- Werbos, P. J. Applications of advances in nonlinear sensitivity analysis. Lecture Notes in Control and Information Sciences 762–770 doi:10.1007/bfb0006203 – 10.1007/bfb0006203
- I Goodfellow. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016) (2016)
- Bazaraa, M. S., Sherali, H. D. & Shetty, C. M. Nonlinear Programming. (2005) doi:10.1002/0471787779 – 10.1002/0471787779
- Numerical Optimization. Springer Series in Operations Research and Financial Engineering (Springer-Verlag, 1999). doi:10.1007/b98874 – 10.1007/b98874
- Li, M., Zhang, T., Chen, Y. & Smola, A. J. Efficient mini-batch training for stochastic optimization. Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining 661–670 (2014) doi:10.1145/2623330.2623612 – 10.1145/2623330.2623612
- D Saad. Saad, D.: Online algorithms and stochastic approximations. Online Learn. 5(3), 6 (1998) (1998)
- Shalev-Shwartz, S. & Ben-David, S. Understanding Machine Learning. (2014) doi:10.1017/cbo9781107298019 – 10.1017/cbo9781107298019
- Becker, S., Zhang, Y. & Lee, A. A. Geometry of Energy Landscapes and the Optimizability of Deep Neural Networks. Physical Review Letters vol. 124 (2020) – 10.1103/physrevlett.124.108301
- Nesterov, Y.: A method for unconstrained convex minimization problem with the rate of convergence o (1/$\hat{ ext{k} }$2). In: Doklady an USSR, vol. 269, pp. 543–547 (1983)
- Goh, G. Why Momentum Really Works. Distill vol. 2 (2017) – 10.23915/distill.00006
- Qian, N. On the momentum term in gradient descent learning algorithms. Neural Networks vol. 12 145–151 (1999) – 10.1016/s0893-6080(98)00116-6
- A Antipin. Antipin, A.: Second order proximal differential systems with feedback control. Differ. Equ. 29, 1597–1607 (1993) (1993)
- Attouch, H., Chbani, Z., Peypouquet, J. & Redont, P. Fast convergence of inertial dynamics and algorithms with asymptotic vanishing viscosity. Mathematical Programming vol. 168 123–175 (2016) – 10.1007/s10107-016-0992-8
- B Polyack. Polyack, B.: Some methods of speeding up the convergence of iterative methods. Z. Vylist Math. Fiz. 4, 1–17 (1964) (1964)
- Ochs, P., Chen, Y., Brox, T. & Pock, T. iPiano: Inertial Proximal Algorithm for Nonconvex Optimization. SIAM Journal on Imaging Sciences vol. 7 1388–1419 (2014) – 10.1137/130942954
- Ochs, P. Local Convergence of the Heavy-Ball Method and iPiano for Non-convex Optimization. Journal of Optimization Theory and Applications vol. 177 153–180 (2018) – 10.1007/s10957-018-1272-y
- Ochs, P. & Pock, T. Adaptive FISTA for Nonconvex Optimization. SIAM Journal on Optimization vol. 29 2482–2503 (2019) – 10.1137/17m1156678
- Massaroli, S. et al. Port–Hamiltonian Approach to Neural Network Training. 2019 IEEE 58th Conference on Decision and Control (CDC) 6799–6806 (2019) doi:10.1109/cdc40024.2019.9030017 – 10.1109/cdc40024.2019.9030017
- Poli, M., Massaroli, S., Yamashita, A., Asama, H., Park, J.: Port-Hamiltonian gradient flows. In: ICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations (2020)
- NB Kovachki. Kovachki, N.B., Stuart, A.M.: Continuous time analysis of momentum methods. J. Mach. Learn. Res. 22, 1–40 (2021) (2021)
- van der Schaft, A. & Jeltsema, D. Port-Hamiltonian Systems Theory: An Introductory Overview. Foundations and Trends® in Systems and Control vol. 1 173–378 (2014) – 10.1561/2600000002
- Bengio, Y. Practical Recommendations for Gradient-Based Training of Deep Architectures. Lecture Notes in Computer Science 437–478 (2012) doi:10.1007/978-3-642-35289-8_26 – 10.1007/978-3-642-35289-8_26
- Darken, C., Moody, J.: Note on learning rate schedules for stochastic optimization. In: Advances in Neural Information Processing Systems, vol. 3 (1990)
- Darken, C., Chang, J., Moody, J., et al.: Learning rate schedules for faster stochastic gradient search. In: Neural Networks for Signal Processing, vol. 2, pp. 3–12. Citeseer (1992)
- Cabot, A., Engler, H. & Gadat, S. On the long time behavior of second order differential equations with asymptotically small dissipation. Transactions of the American Mathematical Society vol. 361 5983–6017 (2009) – 10.1090/s0002-9947-09-04785-0
- Chambolle, A. & Dossal, Ch. On the Convergence of the Iterates of the “Fast Iterative Shrinkage/Thresholding Algorithm”. Journal of Optimization Theory and Applications vol. 166 968–982 (2015) – 10.1007/s10957-015-0746-4
- Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
- Bock, S. & Weis, M. A Proof of Local Convergence for the Adam Optimizer. 2019 International Joint Conference on Neural Networks (IJCNN) 1–8 (2019) doi:10.1109/ijcnn.2019.8852239 – 10.1109/ijcnn.2019.8852239
- Forrester, A. I. J., Sóbester, A. & Keane, A. J. Engineering Design via Surrogate Modelling. (2008) doi:10.1002/9780470770801 – 10.1002/9780470770801
- Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, vol. 32, pp. 8024–8035. Curran Associates Inc (2019). http://papers.neurips.cc/paper/9015- pytorch- an- imperative- style- high- performance- deeplearning- library.pdf
- He, K., Zhang, X., Ren, S. & Sun, J. Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016) doi:10.1109/cvpr.2016.90 – 10.1109/cvpr.2016.90
- Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
- Islam, Md. R. & Matin, A. Detection of COVID 19 from CT Image by The Novel LeNet-5 CNN Architecture. 2020 23rd International Conference on Computer and Information Technology (ICCIT) 1–5 (2020) doi:10.1109/iccit51783.2020.9392723 – 10.1109/iccit51783.2020.9392723