Back to Smile

SMILE — BFGS Optimization

base/BFGS.md

6.1.010.9 KB
Original Source

SMILE — BFGS Optimization

The smile.math.BFGS class implements three closely related quasi-Newton minimization algorithms, all accessible through overloaded minimize() static methods:

VariantMethod signatureMemoryConstraints
BFGSminimize(func, x, gtol, maxIter)O(n²)none
L-BFGSminimize(func, m, x, gtol, maxIter)O(m·n)none
L-BFGS-Bminimize(func, m, x, l, u, gtol, maxIter)O(m·n)box bounds

Background

BFGS

The Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm is the gold standard for moderate-sized unconstrained smooth minimization. It belongs to the class of quasi-Newton methods: instead of computing the exact Hessian ∇²f(x) (expensive) it maintains a running approximation H of the inverse Hessian, updated cheaply at each iteration using only gradient differences.

Each BFGS iteration performs three steps:

  1. Descent direction: compute d = −H ∇f(x).
  2. Line search: find a step length α satisfying the Wolfe conditions (sufficient decrease + curvature); update x ← x + α d.
  3. Hessian update: update H using the rank-2 formula:

$$ H_{k+1} = H_k

  • \frac{s_k s_k^\top}{s_k^\top y_k}
  • \frac{H_k y_k y_k^\top H_k}{y_k^\top H_k y_k} $$

where sₖ = xₖ₊₁ − xₖ and yₖ = ∇f(xₖ₊₁) − ∇f(xₖ).

Because H is stored as a dense n × n matrix, the memory cost is O(n²), which makes BFGS impractical for problems with more than a few thousand variables.

L-BFGS

Limited-memory BFGS (L-BFGS) avoids storing the full inverse Hessian approximation. Instead it keeps a rolling window of the last m pairs of (position difference s, gradient difference y) vectors and uses a two-loop recursion to compute the matrix–vector product Hg implicitly.

Memory is O(m·n) — independent of the problem dimension. A history size of m = 5 (the default suggested range is 3–7) is sufficient for most problems.

L-BFGS-B

Limited-memory BFGS with Bounds (L-BFGS-B) extends L-BFGS to handle box constraints:

lᵢ ≤ xᵢ ≤ uᵢ    for each variable xᵢ

At each iteration the algorithm:

  1. Computes the Generalized Cauchy Point (GCP) — the first local minimizer of a piecewise-linear model along the projected steepest-descent path. This identifies which variables are at their bounds ("fixed") and which are free.
  2. Solves a reduced subspace problem over the free variables using the L-BFGS inverse Hessian approximation.
  3. Performs a projected line search that keeps the iterate inside the box.

API

All variants share the same pattern: the function value at the minimum is returned, and the solution vector x is updated in-place.

Implementing the Objective Function

All three variants require a smile.util.function.DifferentiableMultivariateFunction, which has two methods:

java
/** Returns f(x). */
double f(double[] x);

/** Computes the gradient g = ∇f(x) and returns f(x). */
double g(double[] x, double[] g);

The g() method must fill the supplied array g with the partial derivatives and also return the function value. This dual-purpose design avoids redundant evaluations when the value and gradient are computed together.

BFGS — Unconstrained, Dense Hessian

java
double minimum = BFGS.minimize(func, x, gtol, maxIter);
ParameterMeaning
funcDifferentiableMultivariateFunction to minimize
xinitial guess; overwritten with the solution on return
gtolgradient convergence tolerance (e.g. 1e-5); must be > 0
maxItermaximum number of outer iterations

Returns the function value at the best point found.

Convergence is declared when either:

  • the scaled gradient norm falls below gtol, or
  • the scaled step length falls below 4 × ε (machine epsilon × 10⁻⁸).

If maxIter is reached without convergence a warning is logged and the best value found so far is returned.

L-BFGS — Unconstrained, Limited Memory

java
double minimum = BFGS.minimize(func, m, x, gtol, maxIter);
ParameterMeaning
mnumber of correction pairs to retain (3 ≤ m ≤ 7 recommended; m = 5 is a common choice)
(others)same as BFGS

Use L-BFGS instead of BFGS when n > ~1000 or when memory is limited. For small n the dense BFGS method typically converges in fewer iterations.

L-BFGS-B — Bound-Constrained, Limited Memory

java
double minimum = BFGS.minimize(func, m, x, l, u, gtol, maxIter);
ParameterMeaning
llower bounds array, length n
uupper bounds array, length n
(others)same as L-BFGS

Both l and u must have the same length as x. The initial point x need not satisfy the bounds; x is clamped to [l, u] before the first iteration. Use Double.NEGATIVE_INFINITY / Double.POSITIVE_INFINITY to leave individual components unbounded.


Configuration

The internal convergence constants can be tuned via JVM system properties:

PropertyDefaultMeaning
smile.bfgs.epsilon1e-8Base epsilon used to derive TOLX and TOLF

Set on the command line with -Dsmile.bfgs.epsilon=1e-10.


Examples

Example 1 — BFGS on the Rosenbrock Function

The Rosenbrock banana function f(x, y) = (1-x)² + 100(y-x²)² has a narrow curved valley; its global minimum is 0 at (1, 1).

java
import smile.math.BFGS;
import smile.util.function.DifferentiableMultivariateFunction;

DifferentiableMultivariateFunction rosenbrock = new DifferentiableMultivariateFunction() {
    @Override
    public double f(double[] x) {
        double t1 = 1.0 - x[0];
        double t2 = 10.0 * (x[1] - x[0] * x[0]);
        return t1 * t1 + t2 * t2;
    }

    @Override
    public double g(double[] x, double[] g) {
        double t1 = 1.0 - x[0];
        double t2 = 10.0 * (x[1] - x[0] * x[0]);
        g[1] = 20.0 * t2;
        g[0] = -2.0 * (x[0] * g[1] + t1);
        return t1 * t1 + t2 * t2;
    }
};

double[] x = {-1.2, 1.0};             // starting point
double fmin = BFGS.minimize(rosenbrock, x, 1e-5, 500);

System.out.printf("Minimum: f = %.2e at (%.6f, %.6f)%n", fmin, x[0], x[1]);
// Minimum: f = 3.45e-10 at (1.000000, 1.000000)

Example 2 — L-BFGS for Large-Scale Problems

java
int n = 10_000;
// Quadratic bowl: f(x) = Σ i·xᵢ² — minimum is 0 at the origin
DifferentiableMultivariateFunction bowl = new DifferentiableMultivariateFunction() {
    @Override
    public double f(double[] x) {
        double sum = 0;
        for (int i = 0; i < x.length; i++) sum += (i + 1.0) * x[i] * x[i];
        return sum;
    }

    @Override
    public double g(double[] x, double[] g) {
        double sum = 0;
        for (int i = 0; i < x.length; i++) {
            g[i] = 2.0 * (i + 1.0) * x[i];
            sum += (i + 1.0) * x[i] * x[i];
        }
        return sum;
    }
};

double[] x = new double[n];
java.util.Arrays.fill(x, 1.0);        // start at all-ones

// m = 5 history pairs — dense BFGS would need a 10000×10000 matrix
double fmin = BFGS.minimize(bowl, 5, x, 1e-8, 1000);
System.out.printf("Minimum: %.6e%n", fmin);

Example 3 — L-BFGS-B with Box Constraints

Find the minimum of f(x₀, x₁) = (x₀ − 3)² + (x₁ − 4)² + 1 subject to 3.5 ≤ x₀ ≤ 5 and 3.5 ≤ x₁ ≤ 5. The unconstrained minimum (3, 4) is outside the box, so the constrained minimum is at the boundary (3.5, 4).

java
DifferentiableMultivariateFunction func = new DifferentiableMultivariateFunction() {
    @Override
    public double f(double[] x) {
        double d0 = x[0] - 3, d1 = x[1] - 4;
        return d0 * d0 + d1 * d1 + 1;
    }

    @Override
    public double g(double[] x, double[] g) {
        double d0 = x[0] - 3, d1 = x[1] - 4;
        g[0] = 2 * d0;
        g[1] = 2 * d1;
        return d0 * d0 + d1 * d1 + 1;
    }
};

double[] x = {0.0, 0.0};              // starting point (outside bounds is fine)
double[] l = {3.5, 3.5};
double[] u = {5.0, 5.0};

double fmin = BFGS.minimize(func, 5, x, l, u, 1e-8, 500);
System.out.printf("x = (%.4f, %.4f), f = %.4f%n", x[0], x[1], fmin);
// x = (3.5000, 4.0000), f = 1.2500

Choosing the Right Variant

Problem size         Constraints      Recommended variant
──────────────────────────────────────────────────────────
n ≤ 1 000            none             BFGS
n > 1 000            none             L-BFGS (m = 5)
any n                box bounds       L-BFGS-B (m = 5)

General guidelines:

  • For smooth, well-conditioned problems BFGS usually converges in O(n) iterations; L-BFGS may need slightly more but uses far less memory.
  • Prefer L-BFGS whenever n > 1000 to avoid the O(n²) Hessian storage.
  • Use L-BFGS-B any time variables have natural bounds (e.g. probabilities in [0, 1], non-negative weights, angles in [0, 2π]).
  • All three variants require the gradient to be analytically supplied. If only a function value is available, consider finite-difference approximation or automatic differentiation to construct the DifferentiableMultivariateFunction.

Convergence Criteria

All variants use the same two stopping tests:

  1. Step-length test — converged when the largest relative step component falls below TOLX = 4ε:

    max_i |Δxᵢ| / max(|xᵢ|, 1) < TOLX
    
  2. Gradient test — converged when the scaled infinity-norm of the gradient falls below gtol:

    max_i |∂f/∂xᵢ| · max(|xᵢ|, 1) / max(f, 1) < gtol
    

    L-BFGS-B uses the projected gradient norm instead, which respects the box constraints:

    max_i |proj_gradient_i| < gtol
    

    where proj_gradient_i = clip(xᵢ − gᵢ, lᵢ, uᵢ) − xᵢ.


All three variants share the same Wolfe-condition backtracking line search. Starting from a full step α = 1, the step is reduced using:

  • Quadratic interpolation on the first backtrack, then
  • Cubic interpolation on subsequent backtracks,

with the minimum step bounded below at 0.1 × α to prevent steps from shrinking too fast, and bounded above at stpmax = 100 × max(‖x‖, n) to prevent excessively large jumps into undefined regions.

The sufficient decrease condition (Armijo) is checked with ftol = 1e-4.


References

  1. R. Fletcher, Practical Methods of Optimization, 2nd ed., Wiley, 1987.
  2. D. C. Liu and J. Nocedal, "On the limited memory BFGS method for large scale optimization," Mathematical Programming B, 45 (1989), 503–528.
  3. R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu, "A limited memory algorithm for bound constrained optimization," SIAM Journal on Scientific Computing, 16(5) (1995), 1190–1208.

SMILE — © 2010-2026 Haifeng Li. GNU GPL licensed.