Quantum Outpost
variational advanced · 16 min read ·

Quantum Natural Gradient: Geometry-Aware Optimization for Variational Quantum Algorithms

Standard gradient descent ignores the geometry of the parameter space. Quantum natural gradient (Stokes 2020) uses the quantum Fisher information matrix to rescale parameter updates by the local curvature, reaching minima with fewer iterations and partially mitigating some barren-plateau-adjacent training pathologies. This tutorial covers the math, the block-diagonal approximation that makes it tractable, and a decision rule for when QNG is worth the per-step overhead.

Prerequisites: Tutorial 37: Barren Plateaus, Tutorial 39: The Parameter-Shift Rule

Vanilla gradient descent treats the parameter space of a variational quantum algorithm as flat. A parameter that has a large effect on the quantum state is updated by the same step size as a parameter that has almost no effect. This works for many problems but is structurally inefficient: you spend a lot of optimization steps moving directions that barely change the state, and few moving directions that matter.

Quantum natural gradient (QNG; Stokes, Izaac, Killoran, Carleo 2020) fixes this by rescaling parameter updates with the quantum Fisher information matrix — a metric tensor on the parameter manifold that measures how much each parameter actually moves the quantum state. Updates in directions where the state is sensitive to changes get smaller steps; updates in flat directions get larger steps. The result is faster convergence, more stable training, and a partial mitigation of some barren-plateau-adjacent landscapes.

QNG is not a magic fix. It costs more per step than vanilla gradient descent, and the cost grows with parameter count. But for variational quantum algorithms past about 10 parameters, QNG typically converges in 5-10× fewer iterations than vanilla SGD or Adam, and the per-iteration overhead is often worth it.

This tutorial covers the math, the block-diagonal approximation that makes QNG practically tractable, and a decision rule for when to reach for it.

The geometry of the parameter manifold

Consider a parameterized state ψ(θ)|\psi(\boldsymbol{\theta})\rangle. Two different parameters can move the state by different amounts: a small change dθid\theta_i might move the state by a lot (large state-space displacement) while a small dθjd\theta_j might barely move it.

The quantum Fisher information matrix quantifies this. Its components are

Fij(θ)  =  4Re[iψjψ    iψψψjψ],F_{ij}(\boldsymbol{\theta}) \;=\; 4 \, \mathrm{Re}\Bigl[\,\bigl\langle \partial_i \psi \,\big|\, \partial_j \psi \bigr\rangle \;-\; \bigl\langle \partial_i \psi \,\big|\, \psi \bigr\rangle \bigl\langle \psi \,\big|\, \partial_j \psi \bigr\rangle\,\Bigr],

a symmetric positive-semidefinite matrix that captures the quantum-state distance corresponding to parameter changes. The diagonal entries FiiF_{ii} measure how much each parameter changes the state; off-diagonal entries measure how parameter changes correlate.

This matrix turns the parameter manifold into a Riemannian manifold — a curved space where the natural notion of “distance” depends on position. Vanilla gradient descent assumes a Euclidean metric (identity matrix); natural gradient uses FF.

The QNG update rule

Standard gradient descent:

θ    θηC(θ).\boldsymbol{\theta} \;\to\; \boldsymbol{\theta} - \eta \, \boldsymbol{\nabla} C(\boldsymbol{\theta}).

Quantum natural gradient:

θ    θηF1(θ)C(θ).\boldsymbol{\theta} \;\to\; \boldsymbol{\theta} - \eta \, F^{-1}(\boldsymbol{\theta}) \, \boldsymbol{\nabla} C(\boldsymbol{\theta}).

The Fisher matrix inverse F1F^{-1} rescales each gradient component by the inverse of the local curvature. Directions where the state is sensitive (large FiiF_{ii}) get smaller updates; flat directions (small FiiF_{ii}) get larger.

There is a deep connection: QNG with the right step size is equivalent to imaginary-time evolution of the state along the parameterized manifold. If you take the limit ηdt\eta \to dt infinitesimal and integrate, you recover the McLachlan variational principle for ground-state imaginary-time evolution. QNG is, in continuous-time, just time-evolving the ansatz toward the ground state along the steepest descent in the natural metric. This is one of the cleanest justifications for QNG in chemistry-VQE applications.

Why QNG helps with barren plateaus

In a barren-plateau region, the gradient is small in every direction, but the curvature of the landscape is also small. The Fisher matrix in the barren-plateau region is approximately a multiple of the identity, with small eigenvalues. The natural-gradient update rescales the gradient by 1/F1/small1/F \sim 1/\text{small}, effectively amplifying the small gradient.

This is not a free win — the small Fisher eigenvalues are themselves measured with shot noise, and dividing by a small number amplifies that noise. But empirically, in regimes where the barren-plateau is mild rather than catastrophic (gradients 102\sim 10^{-2}, not 1010\sim 10^{-10}), QNG can extract enough signal to keep the optimizer making progress where vanilla gradient descent stalls.

In severe barren plateau regions (gradients 106\sim 10^{-6} or smaller), QNG cannot save you. The shot noise in the Fisher matrix dominates and the optimizer becomes a random walk. Tutorial 37’s structural fixes (problem-tailored ansätze, warm starts) are necessary for the deep barren-plateau regime.

The cost: estimating FF

The Fisher information matrix has N(N+1)/2N(N+1)/2 independent entries for NN parameters. Estimating each entry requires (in the standard parameter-shift method) 4\sim 4 circuit evaluations. So computing the full FF matrix costs 2N2\sim 2 N^2 circuit evaluations per gradient step — the same scaling as a Hessian.

For N=10N = 10, this is 200\sim 200 extra evaluations per step. For N=100N = 100, 20,000\sim 20{,}000. For N=1,000N = 1{,}000, 2×106\sim 2 \times 10^6 — prohibitive.

Naive QNG is feasible only for moderate parameter counts. For larger ansätze, the block-diagonal approximation below is the practical workaround.

The block-diagonal approximation

The Stokes 2020 paper proposed a layer-wise block-diagonal structure for the Fisher matrix. The intuition: for an ansatz built from layers of parameterized gates, parameters within the same layer have non-trivial Fisher correlations; parameters in different layers have approximately zero Fisher correlation.

Approximate FF as a block-diagonal matrix where each block corresponds to one layer’s parameters. Off-diagonal blocks are set to zero. This reduces the cost from O(N2)O(N^2) to O(lNl2)O(\sum_l N_l^2), where NlN_l is the parameter count in layer ll. For an ansatz with LL layers and N/LN/L parameters per layer, the cost is O(N2/L)O(N^2 / L) — a factor-LL savings.

For typical ansätze with many small layers, the block-diagonal approximation captures most of the QNG benefit at a small fraction of the cost. The standard 2026 implementation of QNG uses block-diagonal FF by default in PennyLane and similar libraries.

A working PennyLane example

PennyLane has built-in QNG via qml.QNGOptimizer:

import numpy as np
import pennylane as qml
from pennylane import numpy as pnp

n_qubits = 4
dev = qml.device("default.qubit", wires=n_qubits)

# Simple H2-like cost function: energy of a 2-qubit Hamiltonian.
H = qml.Hamiltonian(
    [1.0, -0.5, 0.3],
    [qml.PauliZ(0), qml.PauliZ(0) @ qml.PauliZ(1), qml.PauliX(2)]
)

@qml.qnode(dev, interface="autograd")
def circuit(weights):
    for q in range(n_qubits):
        qml.RY(weights[q, 0], wires=q)
        qml.RZ(weights[q, 1], wires=q)
    for q in range(n_qubits - 1):
        qml.CNOT(wires=[q, q + 1])
    for q in range(n_qubits):
        qml.RY(weights[q, 2], wires=q)
    return qml.expval(H)


def cost_fn(w):
    return circuit(w)


# Train with vanilla GradientDescent and with QNG, compare.
init = pnp.array(np.random.uniform(-0.5, 0.5, (n_qubits, 3)), requires_grad=True)

# Vanilla GD
gd = qml.GradientDescentOptimizer(stepsize=0.1)
w_gd = init.copy()
for step in range(60):
    w_gd, c = gd.step_and_cost(cost_fn, w_gd)
    if step % 20 == 0:
        print(f"GD step {step}: cost = {c:.5f}")

# QNG
qng = qml.QNGOptimizer(stepsize=0.1)
w_qng = init.copy()
for step in range(60):
    w_qng, c = qng.step_and_cost(cost_fn, w_qng)
    if step % 20 == 0:
        print(f"QNG step {step}: cost = {c:.5f}")

Sample output (illustrative; depends on initialization):

GD step 0: cost = 0.18234
GD step 20: cost = -0.42198
GD step 40: cost = -0.61734
QNG step 0: cost = 0.18234
QNG step 20: cost = -0.78412
QNG step 40: cost = -0.79134

QNG converges faster (reaches a lower energy in fewer steps) at the cost of more compute per step. The total wall-clock for the same final accuracy is typically lower for QNG, though the per-step cost is higher. The break-even depends on parameter count and how flat the landscape is.

Beyond block-diagonal: variants and extensions

A few extensions worth knowing:

  • Diagonal QNG. Use only the diagonal of FF (i.e., normalize each parameter individually by its sensitivity). Cheaper still — O(N)O(N) per step — and captures most of the benefit when off-diagonal correlations are small.
  • Stochastic Fisher estimation. Estimate FF from a random subset of entries each step. Reduces per-step cost at the price of higher variance.
  • Quantum information geometry. The Fubini-Study metric and the symmetric logarithmic derivative metric are alternative metric choices to the Fisher information matrix; in practice they give similar empirical performance to QNG for most variational algorithms.
  • L-BFGS hybrid. Some implementations alternate QNG steps with L-BFGS-style quasi-Newton updates, capturing global curvature beyond what the block-diagonal Fisher provides.

The 2026 production answer is usually block-diagonal QNG for parameter counts under ~500, diagonal QNG for larger counts, and Adam or vanilla SGD when the per-step Fisher cost is unaffordable.

Common misconceptions

“QNG eliminates barren plateaus.” No. QNG amplifies the gradient signal in mildly flat regions. In severely flat regions, the Fisher matrix itself is statistically poorly determined and QNG becomes unreliable. Structural fixes (problem-tailored ansätze, warm starts) are still necessary for severe barren plateaus.

“QNG is a second-order method like Newton’s method.” Related but distinct. QNG uses the Fisher information as the metric, which is the curvature of the Kullback-Leibler divergence between probability distributions, not the curvature of the cost function itself. Newton’s method uses the Hessian of the cost function. They coincide in some cases but generally differ; QNG’s geometric interpretation is more fundamental.

“QNG is always better than Adam.” Empirically, on barren-plateau-adjacent landscapes, QNG often outperforms Adam in iteration count. But the per-iteration cost is much higher; Adam may win on wall-clock for problems with cheap circuits or many parameters.

“You need full FF for QNG to work.” No. Block-diagonal and diagonal approximations recover most of the benefit at a fraction of the cost. The full matrix is rarely worth computing.

“QNG only helps for chemistry / specific applications.” It helps any variational algorithm where the parameter manifold is curved — which is essentially all of them. Chemistry was the first reported use because of the imaginary-time-evolution connection, but QML, QAOA, and quantum simulation all benefit.

Decision rule

Reach for QNG when:

  1. Parameter count is moderate. N[10,500]N \in [10, 500] is the sweet spot. Below 10, vanilla GD is fine; above 500, the per-step Fisher cost dominates.
  2. Landscape is barren-plateau-adjacent but not catastrophic. Gradients in the 10310^{-3} to 10110^{-1} range; QNG’s amplification helps. Gradients 106\sim 10^{-6}, QNG fails too.
  3. You can afford block-diagonal Fisher computation. 2N2/L\sim 2 N^2 / L extra evaluations per step is the cost; multiply by your shot count for the wall-clock impact.
  4. Imaginary-time evolution is the right semantics. For chemistry VQE problems where the goal is to reach the ground state, QNG’s connection to imaginary-time-evolution is more than aesthetic — it gives provable convergence properties.

Use vanilla GD or Adam when:

  1. You have hundreds of thousands of parameters. QNG cost is prohibitive.
  2. Your landscape is well-conditioned. When the cost-function landscape is approximately spherical, vanilla GD converges almost as fast as QNG with much lower per-step cost.
  3. You’re in a deep barren-plateau region. Neither method works; switch to a structurally different ansatz.

For the typical 2026 chemistry-VQE problem with a few hundred ADAPT-VQE parameters, QNG with block-diagonal Fisher is the production choice.

Exercises

1. Why FF is positive semidefinite

Show that the Fisher information matrix FF defined above is positive semidefinite. Why does this matter for the QNG update direction?

Show answer

For any vector v\mathbf{v}, vTFv=4(ψψψψ2)\mathbf{v}^T F \mathbf{v} = 4 (\langle \partial \psi | \partial \psi \rangle - |\langle \partial \psi | \psi \rangle|^2), where ψ=iviiψ|\partial \psi\rangle = \sum_i v_i |\partial_i \psi\rangle. The first term is non-negative (norm squared); the second is the Cauchy-Schwarz upper bound on it. So vTFv0\mathbf{v}^T F \mathbf{v} \geq 0. FF is positive semidefinite. This guarantees F1F^{-1} exists (when FF is positive definite) and that the QNG update direction F1C-F^{-1} \nabla C is a descent direction — its inner product with C\nabla C is negative. Without positive-definiteness, QNG could move uphill, which would be catastrophic for optimization.

2. When FF is singular

If FF has eigenvalue zero (i.e., a redundant parameter direction that doesn’t change the state), what does QNG do, and how does the implementation handle it?

Show answer

A zero eigenvalue means there is a parameter direction the state is insensitive to. F1F^{-1} is undefined in that direction. Implementations handle this by regularization: replace FF with F+λIF + \lambda I for small λ\lambda, then invert. The regularized FF is always positive definite. The cost: QNG updates in the redundant directions are not maximally large, but also not infinite. Regularization with λ103\lambda \sim 10^{-3} to 10610^{-6} is standard and rarely affects practical convergence. PennyLane’s QNGOptimizer uses regularization automatically.

3. Cost of full vs block-diagonal Fisher

For an ansatz with 60 parameters arranged in 6 layers of 10 parameters each, compute the cost of full Fisher matrix vs block-diagonal Fisher matrix per QNG step.

Show answer

Full Fisher: 60×60/2=180060 \times 60 / 2 = 1800 unique entries × 4 evaluations each = 7,200 circuit evaluations per step. Block-diagonal: 6 blocks of 10×10/2=5510 \times 10 / 2 = 55 entries × 4 evaluations = 1,320 evaluations per step. Block-diagonal is ~5.5× cheaper. For larger ansätze the savings grow proportionally to the number of layers. Combined with the empirical observation that block-diagonal recovers most of the QNG benefit, this is why production implementations default to block-diagonal.

4. Pick QNG vs Adam for a 1,000-parameter ansatz

A QML training problem has 1,000 parameters and a moderately barren-plateau-adjacent landscape (gradients 102\sim 10^{-2}). Per-step Fisher cost would be 4,000,000\sim 4{,}000{,}000 circuit evaluations. You have access to 1,000 shots/second on quantum hardware. Should you use QNG or Adam?

Show answer

QNG step cost: 4M shots / 1k shots/s = 4,000 seconds per step ≈ 1 hour per step. Even with 5-10× iteration savings, total wall-clock is many days. Adam is clearly the right choice for 1,000-parameter problems on hardware. A reasonable middle path: diagonal QNG (cost O(N)=4,000O(N) = 4{,}000 evaluations/step, ~4 seconds) which captures the per-parameter rescaling benefit at much lower cost. Diagonal QNG is the production choice for parameter counts 500\geq 500 in 2026.

Where this goes next

This concludes the four-tutorial variational deepening (37-40). The variational track now has six tutorials: VQE (13), QAOA (14), barren plateaus (37), ADAPT-VQE (38), parameter-shift rule (39), and quantum natural gradient (40). Together they cover the practical machinery of variational quantum computing as it actually exists in 2026, with realistic limits clearly stated. Future tutorials in this track will deepen specific application areas: variational chemistry beyond VQE, QAOA depth-quality theory, hybrid classical-quantum architectures, and the open problem of whether variational methods can scale into the fault-tolerant era.


Weekly dispatch

Quantum, for people who already code.

One serious tutorial per week, plus the industry moves that actually matter. No hype, no hand-waving.

Free. Unsubscribe anytime. We will never sell your email.