When Tiny Geometry & Noisy Hessians Stall Optimization — Causes and Quick Fixes

Tiny geometric effects, noisy curvature estimates, and unstable derivative dynamics often decide whether an iterative solver speeds up or stalls — even when everything “looks” fine. Understanding these second-order and stochastic effects explains puzzling slowdowns in optimizers and differentiation-through-optimizers, and points to simple fixes you can use in practice.

What the problem looks like in plain words

Plateaus that aren’t flat: an algorithm can take many tiny steps that barely reduce its error. From far away it looks stuck, but it is actually drifting slowly in certain directions determined by curvature.
Noisy curvature: when you can only query the objective (not gradients), your estimate of curvature (the Hessian) is noisy and biased unless you probe the function carefully.
Exploding derivatives: if you compute gradients by differentiating an iterative solver (“unrolling”), the derivative computation itself can blow up temporarily even though the solver’s iterates converge.

How these three behaviors are related

They all come from the same basic idea: linear (first-order) behavior can cancel or be too weak, so second-order terms, statistical noise, or the linearized dynamics of derivative iterations dominate. That makes convergence slow or unstable in subtle ways that simple monitoring (like checking whether the loss decreases) often misses.

Zooming in: ADMM on semidefinite programs (SDPs) — the “microscope” view

ADMM is a popular alternating solver for large SDPs. When the optimization problem has many equally valid primal–dual solutions, ADMM can enter long regions where the usual first-order update is effectively zero along some directions. Think of standing on a very shallow, curved ridge: your first steps cancel out, and only a tiny curvature (a second-order effect) makes you drift.

Closed cone of directions: near such a KKT point there is a cone of directions where the linear update vanishes. That is, first-order information gives you no push.
Second-order limit map: once you filter out initial transients, the algorithm’s tiny persistent drift is governed by a quadratic (second-order) map. That map tells you which directions you can move in and which directions you stay stuck in.
Practical signatures you can measure:
- Consecutive step vectors point almost the same way (small angles), with occasional spikes when something changes.
- Infeasibility measures (how much constraints are violated) stop responding to changes in the ADMM penalty parameter — adjusting the penalty barely helps.
- Iterates can behave as if trapped in a small subspace (low-dimensional drift) for a long time.

Quick, practical actions for the ADMM / SDP case

Detect the plateaus: compute angles between successive step vectors and do a PCA on recent step directions. If most variance sits in a small subspace, you’re in a slow-drift regime.
Try small random perturbations or restarts to break symmetry / degeneracy.
Use second-order-aware solvers or add a tiny regularizer (e.g., add εI) to break strict degeneracies. That can change the second-order map and speed escape.
Be cautious about only tuning the ADMM penalty — it may rescale drift but not fix the underlying geometric cause.

Estimating Hessians from function values (RDSA) — poking the surface blindfolded

Sometimes you only have noisy function evaluations (no gradients). Random Direction Stochastic Approximation (RDSA) estimates gradients and Hessians by evaluating the function at perturbed points along random directions.

Basic idea: probe the function at points like x ± h u (u random). From these values you can estimate directional second derivatives u^T H u, then combine many directions to recover H.
More probes → less bias: estimators that use more function samples per iteration have lower-order bias: they approximate the true Hessian more accurately for the same perturbation size. In the limit (with well-chosen step sizes) the estimators become asymptotically unbiased.
Trade-offs: more probes reduce bias/variance but cost more function evaluations. In high dimensions, full Hessian estimates are expensive — exploit low-rank structure, sketching, or block/diagonal approximations.

How to use RDSA in practice

If noise is small and dimension moderate, use multiple symmetric probes per random direction (center + ±h) and average across directions.
If dimension is large, estimate Hessian-vector products or low-rank factors instead of the full Hessian.
Feed these Hessian estimates into a stochastic Newton step with appropriate regularization (to keep updates stable). Use diminishing step-sizes and perturbation sizes to get asymptotically correct curvature.

Algorithm unrolling and the “curse of unrolling” — why differentiating the optimizer can misbehave

In hyperparameter tuning and meta-learning, people often compute gradients by differentiating each step of an inner iterative algorithm (unrolling). This is convenient but can cause the derivative computation to behave badly: it can blow up initially (diverge) even while the original solver converges. That’s the curse of unrolling.

Origin: the derivative iterates evolve according to a linearized dynamics determined by the inner solver’s Jacobians. Those linearized updates can be unstable transiently even when the solver itself is contracting overall.
Mitigation by truncation: stopping the derivative computation after a limited number of inner steps (truncated unrolling) reduces memory and often avoids the transient divergence. Mathematically, truncation removes the influence of unstable early iterations.
Warm-starting helps: initializing the inner solve from the previous outer iterate reduces differences between successive inner trajectories. That acts like implicit truncation — in practice it stabilizes derivative estimation and reduces the need for long unrolls.

Practical rules for differentiation-through-optimization

Whenever possible, prefer implicit differentiation at a converged fixed point (it gives exact Jacobians without long unrolling) — but it needs solving linear systems and access to Jacobians of the inner update.
If unrolling, pick an unroll length comparable to the inner solver’s mixing time; shorter unrolls plus warm-starting often beat long unrolls in memory and stability.
Clip or damp derivative iterates and monitor the norm of the computed Jacobian; if it spikes initially, reduce unroll length or add damping.

Cross-cutting, easy-to-scan checklist

Detect: measure KKT residuals, angles between successive steps, PCA on step directions, and norms of derivative iterates.
Fix cheaply: warm-start inner solves, truncate derivative unrolling, add small random perturbations, or introduce tiny regularizers.
Fix more robustly: use second-order-aware solvers (Newton or quasi-Newton), use better Hessian estimators with more probes or low-rank/sketching, or switch to implicit differentiation.
Tune wisely: don’t rely only on penalty parameters — understand whether you face a geometric degeneracy (needs structural change) or a simple scaling issue (penalty tuning helps).

Bottom line: many mysterious slowdowns or instabilities come from higher-order geometry and noise, not from simple bugs or step-size mistakes. Detect the pattern (small-angle drift, noisy Hessian, unstable derivative iterates) and pick the matching remedy: perturb/restart or regularize for geometry problems, increase probe budget or use sketching for noisy Hessians, and truncate or use implicit differentiation for unrolling issues.

When Tiny Geometry & Noisy Hessians Stall Optimization — Causes and Quick Fixes

Related Papers

Hiring AI researchers or engineers?