Quasi-Newton methods in trajectory search and optimization

jarmonik

Well-known member
Orbiter Contributor
Addon Developer
Beta Tester
Joined
Mar 28, 2008
Messages
2,668
Reaction score
796
Points
128
... Continuing discussion started in other thread http://orbiter-forum.com/showthread.php?p=68073#post68073

I have been working with optimization consepts as well during the last two months. My intention was to calculate a TLI maneuver that would take the vessel into a user defined fly-by with the moon. By entering desired altitude of fly-by, inclination and date of fly-by.

At first I tried to solve the situation using quasi-newton BFGS method and a penalty functions to minimize a single function in multible dimensions to find a parameters those will lead into a transfer orbit matching the user criteria and find rest of the paramaters, those are not under user interest, to minimize the delta velocity.

And that leaded into a lot of trouble. It really didn't work as I would have hoped. That's probably because of bad design of penalty functions and the Hessian had a habit to become non-positive definite or singular for some reason. What's so special about being positive definite anyway ?

So, I made an other try using Broyden's method to search the roots only instead of trying to find a minimum, so far, it seems to be working well. The only problem is that if the user will setup the an inclination close to zero or 180, in that case the root doesn't exist.

That process requires about 3-7 iterations of the whole system and that will take about 40 -80 milliseconds to execute the program (using all cpu power). However, the speed is not yet fully optimized.

Maybe this shoud be discussed in an other thread when I release the MFD that makes the calculations described above.

I suppose the main question is how to deal with the situation that requires finding a roots of several function and minimizing a function at the same time.


-----Post Added-----


Well, positive-definiteness is a pretty important feature if you want to invert the matrix. Did you try any Levenberg-Marquardt style regularisation (i.e. add a lambda*I term to your Hessian) to ensure a positive definite matrix? Maybe you need to increase the value of lambda.

The book I have, ISBN 038 7966145, says nothing about Levenberg-Marquardt. But it do describe a method of adding lambda * Identity into the Hessian until it becomes positive definite. I thought this applies only to the initial Hessian. So, I haven' t implemented it. This will still leave question of how to find a proper value for lambda. It must be positive and small enough not to perturb the Hessian too much. I have never encountered this problem before but I suppose there is iterative solution available to differentiate the lambda respect to most negative principal minor of Hessian but I hope there are better ways.

It looks like the Broyden's method doesn't suffer about non-positive definite Jacobian ? So, does that problem apply only to second order derivatives ?

Also the book suggest to replace the initial Hessian with the identity matrix and let the hessian update to build the Hessian. In that case the first step would be in the direction of steepest descent. Taking a full Newton step at first and backtracking if it over shoots. However the backtracking may require several evaluations of the function and in this case that's a major problem.

Also the search of Levenberg-Marquardt from the Internet gives a lot of hits. So I may spend coming night searching them.

A question: are minimisation methods and root-finding methods related anyway via the derivative (minimum of f(x) is root of f'(x)), so shouldn't they be equivalent in their convergence behaviour?
I am not exactly sure what you mean by that but, Yes, I suppose they should be. Therefore, I suppose the question is why BFGS approach failed and Broyden succeed ?

A noise has been a major problem. The gradient is computed using a finite difference between two solutions g(x)= ( f(x+d) - f(f) ) / d if the 'd' is too small it seems to give a gradient of the noise not the function. I suppose that was the second flaw that prevented BFGS from working. There was a problem with the noise in the Broyden approach as well but it was easier to identify and try to do something about it.

How many parameters does your model actually have? If it is a single-impulse problem, then I guess the parameters are time of impulse (scalar) and impulse (vector), so the solution space would be of dimension 4

Yes, That is exactly what it was. But unfortunately a very small change in the impulse vector will cause very large change into a lunar fly-by. Usually the first step towards the minimum took the trajectory out-of valid range.

Since, the things ware not working out very well I desided to switch to Broyden's method and I re-described the problem in "more well behavied environment" by using a lunar offset vector. A change in the lunar offset vector will cause about equal change into the lunar fly-by and the initial Jacobian matrix is pretty close to identity matrix and doesn't change much during the iteration. Also evaluation of true Jacobian is not as much a problem as evaluation of true Hessian.

Of course, this method do require additional work to evaluate the impulse vector from the lunar offset vector. That's pretty fast and accurate, not a problem.

I suppose that re-implementing the case using BFGS method, now with more experience and knowledge about the case, would make it to converge as the Broyden does. However, it would be computationally lot more expensive. But it would allow to optimize bi-elliptic transfer trajectories.

Do you run into problems with local minima? How do you deal with them (e.g. restarts with different initial parameters, genetic algorithms, simulated annealing etc.)? The problem with all of these methods is that they require many forward solutions so could be costly.
I haven' t encountered that problem yet and it shouldn't cause problems in this task. But, of course, I do need to face that sooner or later.


-----Post Added-----


and the initial Jacobian matrix is pretty close to identity matrix
I'll take that one back. It's not exactly close since the exponent was +6 not -6

(Wed Dec 17 21:51:02 2008 )Row 0: [ -0.930364, -5.09622e-009, 8.91826e-009]
(Wed Dec 17 21:51:02 2008 )Row 1: [1.18611e+006, -0.790991, -0.0271801]
(Wed Dec 17 21:51:02 2008 )Row 2: [-1.84411e+006, 0.0458468, -0.277252]
(Wed Dec 17 21:51:02 2008 )Jacobian: Principal Minors Vector: [-0.930364, 0.741954, -0.219647]
 

martins

Orbiter Founder
Orbiter Founder
Joined
Mar 31, 2008
Messages
2,448
Reaction score
462
Points
83
Website
orbit.medphys.ucl.ac.uk
The book I have, ISBN 038 7966145, says nothing about Levenberg-Marquardt.
Levenberg-Marquardt (LM) is a flavour of Gauss-Newton solver. It approximates the Hessian with

H = J^T J + lambda I

where J is the Jacobian (the first-order derivative of the forward operator with respect to the parameters).

The lambda I control term controls the update direction and step size.

lambda=0 is a "pure Newton" step
Increasing lambda turns the update towards the gradient direction, and reduces the step size. (LM is a "trust region" method. Making lambda "sufficiently" large should always guarantee a downhill step, unless you have reached the minimum).

LM methods usually update lambda dynamically:

At iteration k
- if the update is successful (reduction of cost function), accept it and reduce lambda for next iteration
- if the update fails (no reduction in cost function, or other problems like non-positive-definiteness, increase lambda and try again)


Another Gauss-Newton flavour is "damped GN", where the step size is computed by a 1-D line search in the update direction. However, this can't be combined with the LM control parameter, because the two methods fight each other in determining the step length. Therefore, in damped GN, if you have problems with the Hessian (e.g due to noise), you need to use some other regularisation method.

Another important issue is data and solution space scaling. Scaling can influence the conditioning of the problem significantly. In your problem, for example, the solution space consists of a scalar time, and a components of an impulse vector. These two parts have different dimension, and potentially very different magnitude. Without rescaling, your optimisation method might bias the solution by optimising predominantly one of the components. Rescaling tries to level out the relative magnitude of the different components across the solution space.
A similar argument could apply to the data space.


-----Post Added-----


However, this can't be combined with the LM control parameter, because the two methods fight each other in determining the step length. Therefore, in damped GN, if you have problems with the Hessian (e.g due to noise), you need to use some other regularisation method.
Actually, thinking about it, this isn't quite true. You just can't combine a line search with a dynamic update of lambda, but you could theoretically use a line search with a fixed value of lambda (but then the problem is how to choose lambda).
 

jarmonik

Well-known member
Orbiter Contributor
Addon Developer
Beta Tester
Joined
Mar 28, 2008
Messages
2,668
Reaction score
796
Points
128
Another important issue is data and solution space scaling. Scaling can influence the conditioning of the problem significantly. In
your problem, for example, the solution space consists of a scalar time, and a components of an impulse vector. These two parts have
different dimension, and potentially very different magnitude. Without rescaling, your optimisation method might bias the solution by
optimising predominantly one of the components. Rescaling tries to level out the relative magnitude of the different components
across the solution space.
That's very good point. And this was a major problem in the first attempt to use BFGS. It was only trying to minimise one dimension
of four available dimensions. The one that had the strongest gradient but the minimum was not there. Let's consider a function:

f(x,y) = e^(x*x) + sqrt(y*y + 1), let the initial starting point be f(2, 10'000)

The gradient is very strong among the 'x' but the minimization algorithm should take a step among the 'y' in order to find the
minimum. Now if the 'y' is very noisy then the progress among the 'x' will get lost into the noise and the function doens't converge
anywhere.
Improving the linearity and scaling would help a lot. How about defining an auxiliary function x = sqrt(ln(z)) and minimise the function in (z,y) space in theory ? Of course, ln(0) is not defined.

Levenberg-Marquardt (LM) is a flavour of Gauss-Newton solver. It approximates the Hessian with
H = J^T J + lambda I
where J is the Jacobian (the first-order derivative of the forward operator with respect to the parameters).
The lambda I control term controls the update direction and step size.

Just wanted to be sure that this is a method to improve convergence of minimization methods like BFGS but it's not an other minimization method to replace BFGS ?

And the "J^T J" is a matrix by matrix transpose multiplication and not an outer product of the gradient vectors ?

Actually, thinking about it, this isn't quite true. You just can't combine a line search with a dynamic update of lambda, but you
could theoretically use a line search with a fixed value of lambda (but then the problem is how to choose lambda).
The current implementation of Broyden's method do have a line search as an backup but it have never needed it.
I suppose that if the line search backup must reduce the step significantly something must be wrong in a scaling of the model or something. But I could be mistaking.

I should be able restore the BFGS method in a few days and try it again.

I suppose the procedure would be:

1. Evaluate true Jacobian J at x0
2. Evaluate gradient at initial starting point g0 = f(x0)
3. Compute Hessian approximation H = J^T J + lambda*I

4. Solve vector x form (H+lambda*I) * x = -g0, where g0 is the gradient

5. Evaluate new gradient and scalar value of the function g = f(x0 + x), and s = f(x0 + x)

6. Is minima reached ? Each dimension of gradient with in a tolerance ? if not then continue
7. Verify convergence, if not OK increase lambda and goto 4, if OK reduce lambda and continue
8. x0 = x0 + x, g0 = g

9. Add BFGS update in the Hessian.

10 Check that (H+lambda*I) is positive definite, if not increase lambda
11 goto 4
 
Last edited:

martins

Orbiter Founder
Orbiter Founder
Joined
Mar 31, 2008
Messages
2,448
Reaction score
462
Points
83
Website
orbit.medphys.ucl.ac.uk
Just wanted to be sure that this is a method to improve convergence of minimization methods like BFGS but it's not an other minimization method to replace BFGS ?
Well, LM _is_ an independent optimisation method, so it could be used instead of BFGS. I am not sure if it would work in the scheme you lined out, but you can try. I can't quite remember how BFGS approximates the Hessian, but I suspect it is creating a basis representation with a given number of basis vectors, where for each new iteration a new vector is added, and a previous one is thrown out. I don't know how you would combine this with the GN approximation of J^T J.

And the "J^T J" is a matrix by matrix transpose multiplication and not an outer product of the gradient vectors ?
Yes. J is the Jacobian matrix dy_i/dx_j, where y is the data vector, and x is the parameter vector. It's not the gradient of the objective function (that would be J^T delta y)

If the Hessian causes problems, you could also try a first-order scheme such as nonlinear conjugate gradients. (see e.g. http://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf). However, first-order methods generally converge more slowly (in particular in the vicinity of the minimum), so may require more forward calculations.
 

jarmonik

Well-known member
Orbiter Contributor
Addon Developer
Beta Tester
Joined
Mar 28, 2008
Messages
2,668
Reaction score
796
Points
128
If the Hessian causes problems, you could also try a first-order scheme such as nonlinear conjugate gradients. (see e.g. http://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf). However, first-order methods generally converge more slowly (in particular in the vicinity of the minimum), so may require more forward calculations.
I have never seen as well explained and detailed descriptions of Steepest Descent and Conjugate Gradient methods anywhere else. Also the section of Eigenvalues makes some sence. That ducument is brilliant. :speakcool:
 

mjessick

Donator
Donator
Joined
Apr 26, 2008
Messages
174
Reaction score
0
Points
0
Location
Houston
I love the keywords:
Keywords:
conjugate gradient method, preconditioning, convergence analysis, agonizing pain
 
Top