how to propery specify a gradient function for use in optim() or other optimizer

I have an optimization problem that the Nelder-Mead method will solve, but that I would also like to solve using BFGS or Newton-Raphson, or something that takes a gradient function, for more speed, and hopefully more precise estimates. I wrote such a gradient function following (I thought) the example in the optim / optimx documentation, but when I use it with BFGS my starting values either don't move ( optim() ), or else the function outright doesn't run ( optimx() , which returns Error: Gradient function might be wrong - check it! ). I'm sorry there's a bit of code involved in reproducing this, but here goes:

This is the function that I want to get parameter estimates for (this is for smoothing old-age mortality rates, where x is age, starting at age 80):

    KannistoMu <- function(pars, x = .5:30.5){
      a <- pars["a"]
      b <- pars["b"]
      (a * exp(b * x)) / (1 + a * exp(b * x))
    }

And here's a log likelihood function for estimating it from observed rates (defined as deaths, .Dx over exposure, .Exp ):

    KannistoLik1 <- function(pars, .Dx, .Exp, .x. = .5:30.5){
      mu <- KannistoMu(exp(pars), x = .x.)
      # take negative and minimize it (default optimizer behavior)
      -sum(.Dx * log(mu) - .Exp * mu, na.rm = TRUE) 
    }

you see exp(pars) in there because I give log(pars) to optimize over, in order to constrain the final a and b to be positive.

Example data (1962 Japan females, if anyone is curious):

    .Dx <- structure(c(10036.12, 9629.12, 8810.11, 8556.1, 7593.1, 6975.08, 
      6045.08, 4980.06, 4246.06, 3334.04, 2416.03, 1676.02, 1327.02, 
      980.02, 709, 432, 350, 217, 134, 56, 24, 21, 10, 8, 3, 1, 2, 
      1, 0, 0, 0), .Names = c("80", "81", "82", "83", "84", "85", "86", 
      "87", "88", "89", "90", "91", "92", "93", "94", "95", "96", "97", 
      "98", "99", "100", "101", "102", "103", "104", "105", "106", 
      "107", "108", "109", "110"))
    .Exp <- structure(c(85476.0333333333, 74002.0866666667, 63027.5183333333, 
      53756.8983333333, 44270.9, 36749.85, 29024.9333333333, 21811.07, 
      16912.315, 11917.9583333333, 7899.33833333333, 5417.67, 3743.67833333333, 
      2722.435, 1758.95, 1043.985, 705.49, 443.818333333333, 223.828333333333, 
      93.8233333333333, 53.1566666666667, 27.3333333333333, 16.1666666666667, 
      10.5, 4.33333333333333, 3.16666666666667, 3, 2.16666666666667, 
      1.5, 0, 1), .Names = c("80", "81", "82", "83", "84", "85", "86", 
      "87", "88", "89", "90", "91", "92", "93", "94", "95", "96", "97", 
      "98", "99", "100", "101", "102", "103", "104", "105", "106", 
      "107", "108", "109", "110"))

The following works for the Nelder-Mead method:

    NMab <- optim(log(c(a = .1, b = .1)), 
      fn = KannistoLik1, method = "Nelder-Mead",
      .Dx = .Dx, .Exp = .Exp)
    exp(NMab$par) 
    # these are reasonable estimates
       a         b 
    0.1243144 0.1163926

This is the gradient function I came up with:

    Kannisto.gr <- function(pars, .Dx, .Exp, x = .5:30.5){
      a <- exp(pars["a"])
      b <- exp(pars["b"])
      d.a <- (a * exp(b * x) * .Exp + (-a * exp(b * x) - 1) * .Dx) /
        (a ^ 3 * exp(2 * b * x) + 2 * a ^ 2 * exp(b * x) + a)
      d.b <- (a * x * exp(b * x) * .Exp + (-a * x * exp(b * x) - x) * .Dx) /
        (a ^ 2 * exp(2 * b * x) + 2 * a * exp(b * x) + 1)
      -colSums(cbind(a = d.a, b = d.b), na.rm = TRUE)
    }

The output is a vector of length 2, the change with respect to the parameters a and b . I also have an uglier version arrived at by exploiting the output of deriv() , which returns the same answer, and which I don't post (just to confirm that the derivatives are right).

If I supply it to optim() as follows, with BFGS as the method, the estimates do not move from the starting values:

    BFGSab <- optim(log(c(a = .1, b = .1)), 
      fn = KannistoLik1, gr = Kannisto.gr, method = "BFGS",
      .Dx = .Dx, .Exp = .Exp)
    # estimates do not change from starting values:
    exp(BFGSab$par) 
      a   b 
    0.1 0.1

When I look at the $counts element of the output, it says that KannistoLik1() was called 31 times and Kannisto.gr() just 1 time. $convergence is 0 , so I guess it thinks it converged (if I give less reasonable starts they also stay put). I reduced the tolerance, etc, and nothing changes. When I try the same call in optimx() (not shown), I receive the waring I mentioned above, and no object is returned. I get the same results when specifying gr = Kannisto.gr with the "CG" . With the "L-BFGS-B" method I get the same starting values back as estimate, but it is also reported that both function and gradient were called 21 times, and there is an error message: "ERROR: BNORMAL_TERMINATION_IN_LNSRCH"

I'm hoping that there is some minor detail in the way the gradient function is written that will solve this, as this later warning and the optimx behavior are bluntly hinting that the function simply isn't right (I think). I also tried the maxNR() maximizer from the maxLik package and observed similar behavior (starting values don't move). Can anyone give me a pointer? Much obliged

[Edit] @Vincent suggested I compare with the output from a numerical approximation:

    library(numDeriv)
    grad( function(u) KannistoLik1( c(a=u[1], b=u[2]), .Dx, .Exp ), log(c(.1,.1)) )
    [1] -14477.40  -7458.34
    Kannisto.gr(log(c(a=.1,b=.1)), .Dx, .Exp)
     a        b 
    144774.0  74583.4 

so different sign, and off by a factor of 10? I change the gradient function to follow suit:

    Kannisto.gr2 <- function(pars, .Dx, .Exp, x = .5:30.5){
      a <- exp(pars["a"])
      b <- exp(pars["b"])
      d.a <- (a * exp(b * x) * .Exp + (-a * exp(b * x) - 1) * .Dx) /
        (a ^ 3 * exp(2 * b * x) + 2 * a ^ 2 * exp(b * x) + a)
      d.b <- (a * x * exp(b * x) * .Exp + (-a * x * exp(b * x) - x) * .Dx) /
        (a ^ 2 * exp(2 * b * x) + 2 * a * exp(b * x) + 1)
      colSums(cbind(a=d.a,b=d.b), na.rm = TRUE) / 10
    }
    Kannisto.gr2(log(c(a=.1,b=.1)), .Dx, .Exp)
    # same as numerical:
      a         b 
    -14477.40  -7458.34 

Try it in the optimizer:

    BFGSab <- optim(log(c(a = .1, b = .1)), 
      fn = KannistoLik1, gr = Kannisto.gr2, method = "BFGS",
      .Dx = .Dx, .Exp = .Exp)
    # not reasonable results:
    exp(BFGSab$par) 
      a   b 
    Inf Inf 
    # and in fact, when not exp()'d, they look oddly familiar:
    BFGSab$par
      a         b 
    -14477.40  -7458.34 

Following Vincent's answer, I rescaled the gradient function, and used abs() instead of exp() to keep parameters positive. The most recent, and better performing objective and gradient functions:

    KannistoLik2 <- function(pars, .Dx, .Exp, .x. = .5:30.5){
      mu <- KannistoMu.c(abs(pars), x = .x.)
      # take negative and minimize it (default optimizer behavior)
      -sum(.Dx * log(mu) - .Exp * mu, na.rm = TRUE) 
    }

    # gradient, to be down-scaled in `optim()` call
    Kannisto.gr3 <- function(pars, .Dx, .Exp, x = .5:30.5){
      a <- abs(pars["a"])
      b <- abs(pars["b"])
      d.a <- (a * exp(b * x) * .Exp + (-a * exp(b * x) - 1) * .Dx) /
        (a ^ 3 * exp(2 * b * x) + 2 * a ^ 2 * exp(b * x) + a)
      d.b <- (a * x * exp(b * x) * .Exp + (-a * x * exp(b * x) - x) * .Dx) /
        (a ^ 2 * exp(2 * b * x) + 2 * a * exp(b * x) + 1)
      colSums(cbind(a = d.a, b = d.b), na.rm = TRUE) 
    }

    # try it out:
    BFGSab2 <- optim(
      c(a = .1, b = .1), 
      fn = KannistoLik2, 
      gr = function(...) Kannisto.gr3(...) * 1e-7, 
      method = "BFGS",
      .Dx = .Dx, .Exp = .Exp
    )
    # reasonable:
    BFGSab2$par
            a         b 
    0.1243249 0.1163924 

    # better:
    KannistoLik2(exp(NMab1$par),.Dx = .Dx, .Exp = .Exp) > KannistoLik2(BFGSab2$par,.Dx = .Dx, .Exp = .Exp)
    [1] TRUE

This was solved much faster than I was expecting, and I learned more than a couple tricks. Thanks Vincent!


To check if the gradient is correct, you can compare it with a numeric approximation:

library(numDeriv); 
grad( function(u) KannistoLik1( c(a=u[1], b=u[2]), .Dx, .Exp ), c(1,1) ); 
Kannisto.gr(c(a=1,b=1), .Dx, .Exp)

The signs are wrong: the algorithm does not see any improvement when it moves in this direction, and therefore does not move.

You can use some computer algebra system (here, Maxima) to do the computations for you:

display2d: false;
f(a,b,x) := a * exp(b*x) / ( 1 + a * exp(b*x) );
l(a,b,d,e,x) := - d * log(f(a,b,x)) + e * f(a,b,x);
factor(diff(l(exp(a),exp(b),d,e,x),a));
factor(diff(l(exp(a),exp(b),d,e,x),b));

I just copy and paste the result into R:

f_gradient <- function(u, .Dx, .Exp, .x.=.5:30.5) {
  a <- u[1]
  b <- u[1]
  x <- .x.
  d <- .Dx
  e <- .Exp
  c(
    sum( (e*exp(exp(b)*x+a)-d*exp(exp(b)*x+a)-d)/(exp(exp(b)*x+a)+1)^2 ),
    sum( exp(b)*x*(e*exp(exp(b)*x+a)-d*exp(exp(b)*x+a)-d)/(exp(exp(b)*x+a)+1)^2 )
  )  
}

library(numDeriv)
grad( function(u) KannistoLik1( c(a=u[1], b=u[2]), .Dx, .Exp ), c(1,1) )
f_gradient(c(a=1,b=1), .Dx, .Exp)  # Identical

If you blindly put the gradient in the optimization, there is a numeric instability problem: the solution given is (Inf,Inf) ... To prevent it, you can rescale the gradient (a better workaround would be to use a less explosive transformation than the exponential, to ensure that the parameters remain positive).

BFGSab <- optim(
  log(c(a = .1, b = .1)), 
  fn = KannistoLik1, 
  gr = function(...) f_gradient(...) * 1e-3, 
  method = "BFGS",
  .Dx = .Dx, .Exp = .Exp
)
exp(BFGSab$par) # Less precise than Nelder-Mead
链接地址: http://www.djcxy.com/p/85730.html

上一篇: Python3.3:Square

下一篇: 如何propery指定用于optim()或其他优化器的渐变函数