Among the multitude of researches in the literature that use neural networks (NN) for control of dynamical systems, one can cite [1]-[6]. A few amongst them develop neural network based optimal control based on an approximate dynamic programming (ADP) formulation [4], [7]-[17]. Two classes of ADP based solutions, called the Heuristic Dynamic Programming (HDP) and Dual Heuristic Programming (DHP) have emerged in the literature [4]. In HDP, the reinforcement learning is used to learn the cost- to-go from the current state while in the DHP, the derivative of the cost function with respect to the states, i.e. the costate vector is learnt by the neural networks [7]. The convergence proof of DHP for linear systems is presented in [8] and that of HDP for general case is presented in [9]. While [7]-[16] deal with discrete-time systems, some
researchers have recently focused on continuous time problems, [18]-[20].
Mechanism for ADP learning is usually provided through a dual network architecture called the Adaptive Critics (AC) [7], [11]. In the HDP class with ACs, one network, called the ‘critic’ network, maps the input states to output the cost and another network, called the ‘action’ network, outputs the control with states of the system as its inputs [9], [10]. In the DHP formulation, while the action network remains the same as with the HDP, the critic network outputs the costates with the current states as inputs.[11]-[13]. The Single Network Adaptive Critic (SNAC) architecture developed in [14] is shown to be able to eliminate the need for the second network and perform DHP using only one network. This results in a considerable decrease in the offline training effort and the resulting simplicity makes it attractive for online implementation requiring less computational resources and storage memory. Similarly, the J-SNAC eliminates the need for the action network in an HDP scheme [15]. Note that these developments in the neural network literature have mainly addressed only the infinite horizon or regulator type problems.
Finite-horizon optimal control is relatively more difficult due to the time varying Hamilton-Jacobi-Bellman (HJB) equation resulting in a time-to-go dependent optimal cost function and costates. If one were to use a shooting method, a two-point boundary value problem (TPBVP) needs to be solved for each set of initial condition for a given final time and it will provide only an open loop solution. The authors of [21] developed a method which gives closed form solution to the problem but only for some pre-specified initial condition and time-to-go. Ref. [22] develops a dynamics optimization scheme which gives an open-loop solution, then, optimal tracking is used for rejecting the online perturbation and deviations from the optimal trajectory.
Using NN for solving finite-horizon optimal control problem is considered in [16], [23]-[28]. Authors of [16] used the AC’s dual network scheme with time-dependent weights for solving the problem. Continuous-time problems are considered in [23] and [24] where the time-dependent weights are calculated through a backward integration. The finite-horizon problem with unspecified terminal time and a fixed terminal state is considered in [25]-[28]. In these researches the problem is called finite-horizon because the states are required to be brought to the origin using a finite number of steps, but, the
number of steps is not fixed which differentiates these works from the fixed-final-time problem investigated in this study.
In this paper, a single neural network based solution with a single set of weights, called Finite-horizon Single Network Adaptive Critics (Finite-SNAC), is developed which embeds solutions to the discrete-time HJB equation. Consequently, the offline trained network can be used to generate online feedback control. Furthermore, a major advantage of the proposed technique is that this network provides optimal feedback solutions to any different final time as long as it is less than the final time for which the network is synthesized.
In practical engineering problems, the designer faces constraints on the control effort. In order to facilitate the control constraint, a non-quadratic cost function [30], is used in this study.
Comparing the developed controller in this paper with the available controllers in the literature, the closest one is [16]. The difference between Finite-SNAC and the controller developed in [16] is using only one network and only one set of weights for the purpose. Despite [23] and [24] the Finite-SNAC solves discrete-time problems and uses ADP to do so. Finally, [25]-[28] solves unspecified terminal time problems while Finite- SNAC solves the problems with given and fixed final time.
Specifically, in this paper an ADP based controller for control-constrained finite- horizon optimal control of discrete-time input-affine nonlinear systems is developed. This is done through a SNAC scheme that uses the current states and the time-to-go as inputs. The scheme is DHP based. For the proof of convergence, proof of HDP for the finite- horizon case is presented first. Then, it is shown that DHP has the same convergence result as the HDP, and therefore, DHP also converges to the optimal solution. Finally, after presenting the convergence proofs of the training error and the network weights for the selected weight update law, the performance of the controller is evaluated. The first example with a linear system allows easy comparison of the Finite-SNAC with known exact optimal results. The second example is a discrete-time nonlinear problem, and as the third example a more complex nonlinear spacecraft application, that is a fixed final time attitude maneuver, is carried out to show the applicability of Finite-SNAC to difficult engineering applications.
Rest of the paper is organized as follows: the Finite-SNAC is developed in Section II. Relevant convergence theorems are presented in Section III. Numerical results and analysis are presented in Section IV. Conclusions are given in Section V.