Who Breaks Early, Looses: Goal Oriented Training of Deep Neural Networks Based on Port Hamiltonian Dynamics

Authors

Julian Burghoff, Marc Heinrich Monells, Hanno Gottschalk

Abstract

The highly structured energy landscape of the loss as a function of parameters for deep neural networks makes it necessary to use sophisticated optimization strategies in order to discover (local) minima that guarantee reasonable performance. Overcoming less suitable local minima is an important prerequisite and often momentum methods are employed to achieve this. As in other non local optimization procedures, this however creates the necessity to balance between exploration and exploitation. In this work, we suggest an event based control mechanism for switching from exploration to exploitation based on reaching a predefined reduction of the loss function. As we give the momentum method a port Hamiltonian interpretation, we apply the ’heavy ball with friction’ interpretation and trigger breaking (or friction) when achieving certain goals. We benchmark our method against standard stochastic gradient descent and provide experimental evidence for improved performance of deep neural networks when our strategy is applied.

Keywords

neural nets; momentum; goal oriented search; port Hamilton systems

Citation

ISBN: 9783031442032
Publisher: Springer Nature Switzerland
DOI: 10.1007/978-3-031-44204-9_38
Note: International Conference on Artificial Neural Networks

BibTeX

@inbook{Burghoff_2023,
  title={{Who Breaks Early, Looses: Goal Oriented Training of Deep Neural Networks Based on Port Hamiltonian Dynamics}},
  ISBN={9783031442049},
  ISSN={1611-3349},
  DOI={10.1007/978-3-031-44204-9_38},
  booktitle={{Artificial Neural Networks and Machine Learning – ICANN 2023}},
  publisher={Springer Nature Switzerland},
  author={Burghoff, Julian and Monells, Marc Heinrich and Gottschalk, Hanno},
  year={2023},
  pages={454--465}
}

Download the bib file

References

Lecun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE vol. 86 2278–2324 (1998) – 10.1109/5.726791
Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images (2009)
Xiao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. CoRR, vol. abs/1708.07747 (2017). arXiv: 1708.07747
Werbos, P. J. Applications of advances in nonlinear sensitivity analysis. Lecture Notes in Control and Information Sciences 762–770 doi:10.1007/bfb0006203 – 10.1007/bfb0006203
I Goodfellow. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016) (2016)
Bazaraa, M. S., Sherali, H. D. & Shetty, C. M. Nonlinear Programming. (2005) doi:10.1002/0471787779 – 10.1002/0471787779
Numerical Optimization. Springer Series in Operations Research and Financial Engineering (Springer-Verlag, 1999). doi:10.1007/b98874 – 10.1007/b98874
Li, M., Zhang, T., Chen, Y. & Smola, A. J. Efficient mini-batch training for stochastic optimization. Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining 661–670 (2014) doi:10.1145/2623330.2623612 – 10.1145/2623330.2623612
D Saad. Saad, D.: Online algorithms and stochastic approximations. Online Learn. 5(3), 6 (1998) (1998)
Shalev-Shwartz, S. & Ben-David, S. Understanding Machine Learning. (2014) doi:10.1017/cbo9781107298019 – 10.1017/cbo9781107298019
Becker, S., Zhang, Y. & Lee, A. A. Geometry of Energy Landscapes and the Optimizability of Deep Neural Networks. Physical Review Letters vol. 124 (2020) – 10.1103/physrevlett.124.108301
Nesterov, Y.: A method for unconstrained convex minimization problem with the rate of convergence o (1/$\hat{ ext{k} }$2). In: Doklady an USSR, vol. 269, pp. 543–547 (1983)
Goh, G. Why Momentum Really Works. Distill vol. 2 (2017) – 10.23915/distill.00006
Qian, N. On the momentum term in gradient descent learning algorithms. Neural Networks vol. 12 145–151 (1999) – 10.1016/s0893-6080(98)00116-6
A Antipin. Antipin, A.: Second order proximal differential systems with feedback control. Differ. Equ. 29, 1597–1607 (1993) (1993)
Attouch, H., Chbani, Z., Peypouquet, J. & Redont, P. Fast convergence of inertial dynamics and algorithms with asymptotic vanishing viscosity. Mathematical Programming vol. 168 123–175 (2016) – 10.1007/s10107-016-0992-8
B Polyack. Polyack, B.: Some methods of speeding up the convergence of iterative methods. Z. Vylist Math. Fiz. 4, 1–17 (1964) (1964)
Ochs, P., Chen, Y., Brox, T. & Pock, T. iPiano: Inertial Proximal Algorithm for Nonconvex Optimization. SIAM Journal on Imaging Sciences vol. 7 1388–1419 (2014) – 10.1137/130942954
Ochs, P. Local Convergence of the Heavy-Ball Method and iPiano for Non-convex Optimization. Journal of Optimization Theory and Applications vol. 177 153–180 (2018) – 10.1007/s10957-018-1272-y
Ochs, P. & Pock, T. Adaptive FISTA for Nonconvex Optimization. SIAM Journal on Optimization vol. 29 2482–2503 (2019) – 10.1137/17m1156678
Massaroli, S. et al. Port–Hamiltonian Approach to Neural Network Training. 2019 IEEE 58th Conference on Decision and Control (CDC) 6799–6806 (2019) doi:10.1109/cdc40024.2019.9030017 – 10.1109/cdc40024.2019.9030017
Poli, M., Massaroli, S., Yamashita, A., Asama, H., Park, J.: Port-Hamiltonian gradient flows. In: ICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations (2020)
NB Kovachki. Kovachki, N.B., Stuart, A.M.: Continuous time analysis of momentum methods. J. Mach. Learn. Res. 22, 1–40 (2021) (2021)
van der Schaft, A. & Jeltsema, D. Port-Hamiltonian Systems Theory: An Introductory Overview. Foundations and Trends® in Systems and Control vol. 1 173–378 (2014) – 10.1561/2600000002
Bengio, Y. Practical Recommendations for Gradient-Based Training of Deep Architectures. Lecture Notes in Computer Science 437–478 (2012) doi:10.1007/978-3-642-35289-8_26 – 10.1007/978-3-642-35289-8_26
Darken, C., Moody, J.: Note on learning rate schedules for stochastic optimization. In: Advances in Neural Information Processing Systems, vol. 3 (1990)
Darken, C., Chang, J., Moody, J., et al.: Learning rate schedules for faster stochastic gradient search. In: Neural Networks for Signal Processing, vol. 2, pp. 3–12. Citeseer (1992)
Cabot, A., Engler, H. & Gadat, S. On the long time behavior of second order differential equations with asymptotically small dissipation. Transactions of the American Mathematical Society vol. 361 5983–6017 (2009) – 10.1090/s0002-9947-09-04785-0
Chambolle, A. & Dossal, Ch. On the Convergence of the Iterates of the “Fast Iterative Shrinkage/Thresholding Algorithm”. Journal of Optimization Theory and Applications vol. 166 968–982 (2015) – 10.1007/s10957-015-0746-4
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Bock, S. & Weis, M. A Proof of Local Convergence for the Adam Optimizer. 2019 International Joint Conference on Neural Networks (IJCNN) 1–8 (2019) doi:10.1109/ijcnn.2019.8852239 – 10.1109/ijcnn.2019.8852239
Forrester, A. I. J., Sóbester, A. & Keane, A. J. Engineering Design via Surrogate Modelling. (2008) doi:10.1002/9780470770801 – 10.1002/9780470770801
Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, vol. 32, pp. 8024–8035. Curran Associates Inc (2019). http://papers.neurips.cc/paper/9015- pytorch- an- imperative- style- high- performance- deeplearning- library.pdf
He, K., Zhang, X., Ren, S. & Sun, J. Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016) doi:10.1109/cvpr.2016.90 – 10.1109/cvpr.2016.90
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Islam, Md. R. & Matin, A. Detection of COVID 19 from CT Image by The Novel LeNet-5 CNN Architecture. 2020 23rd International Conference on Computer and Information Technology (ICCIT) 1–5 (2020) doi:10.1109/iccit51783.2020.9392723 – 10.1109/iccit51783.2020.9392723