variational advanced · 16 min read · By LIPAI WANG · April 29, 2026

Quantum Natural Gradient: Geometry-Aware Optimization for Variational Quantum Algorithms

Standard gradient descent ignores the geometry of the parameter space. Quantum natural gradient (Stokes 2020) uses the quantum Fisher information matrix to rescale parameter updates by the local curvature, reaching minima with fewer iterations and partially mitigating some barren-plateau-adjacent training pathologies. This tutorial covers the math, the block-diagonal approximation that makes it tractable, and a decision rule for when QNG is worth the per-step overhead.

Prerequisites: Tutorial 37: Barren Plateaus, Tutorial 39: The Parameter-Shift Rule

Vanilla gradient descent treats the parameter space of a variational quantum algorithm as flat. A parameter that has a large effect on the quantum state is updated by the same step size as a parameter that has almost no effect. This works for many problems but is structurally inefficient: you spend a lot of optimization steps moving directions that barely change the state, and few moving directions that matter.

Quantum natural gradient (QNG; Stokes, Izaac, Killoran, Carleo 2020) fixes this by rescaling parameter updates with the quantum Fisher information matrix — a metric tensor on the parameter manifold that measures how much each parameter actually moves the quantum state. Updates in directions where the state is sensitive to changes get smaller steps; updates in flat directions get larger steps. The result is faster convergence, more stable training, and a partial mitigation of some barren-plateau-adjacent landscapes.

QNG is not a magic fix. It costs more per step than vanilla gradient descent, and the cost grows with parameter count. But for variational quantum algorithms past about 10 parameters, QNG typically converges in 5-10× fewer iterations than vanilla SGD or Adam, and the per-iteration overhead is often worth it.

This tutorial covers the math, the block-diagonal approximation that makes QNG practically tractable, and a decision rule for when to reach for it.

The geometry of the parameter manifold

Consider a parameterized state $|\psi(\boldsymbol{\theta})\rangle$ . Two different parameters can move the state by different amounts: a small change $d\theta_i$ might move the state by a lot (large state-space displacement) while a small $d\theta_j$ might barely move it.

The quantum Fisher information matrix quantifies this. Its components are

F_{ij}(\boldsymbol{\theta}) \;=\; 4 \, \mathrm{Re}\Bigl[\,\bigl\langle \partial_i \psi \,\big|\, \partial_j \psi \bigr\rangle \;-\; \bigl\langle \partial_i \psi \,\big|\, \psi \bigr\rangle \bigl\langle \psi \,\big|\, \partial_j \psi \bigr\rangle\,\Bigr],

a symmetric positive-semidefinite matrix that captures the quantum-state distance corresponding to parameter changes. The diagonal entries $F_{ii}$ measure how much each parameter changes the state; off-diagonal entries measure how parameter changes correlate.

This matrix turns the parameter manifold into a Riemannian manifold — a curved space where the natural notion of “distance” depends on position. Vanilla gradient descent assumes a Euclidean metric (identity matrix); natural gradient uses $F$ .

The QNG update rule

Standard gradient descent:

\boldsymbol{\theta} \;\to\; \boldsymbol{\theta} - \eta \, \boldsymbol{\nabla} C(\boldsymbol{\theta}).

Quantum natural gradient:

\boldsymbol{\theta} \;\to\; \boldsymbol{\theta} - \eta \, F^{-1}(\boldsymbol{\theta}) \, \boldsymbol{\nabla} C(\boldsymbol{\theta}).

The Fisher matrix inverse $F^{-1}$ rescales each gradient component by the inverse of the local curvature. Directions where the state is sensitive (large $F_{ii}$ ) get smaller updates; flat directions (small $F_{ii}$ ) get larger.

There is a deep connection: QNG with the right step size is equivalent to imaginary-time evolution of the state along the parameterized manifold. If you take the limit $\eta \to dt$ infinitesimal and integrate, you recover the McLachlan variational principle for ground-state imaginary-time evolution. QNG is, in continuous-time, just time-evolving the ansatz toward the ground state along the steepest descent in the natural metric. This is one of the cleanest justifications for QNG in chemistry-VQE applications.

Why QNG helps with barren plateaus

In a barren-plateau region, the gradient is small in every direction, but the curvature of the landscape is also small. The Fisher matrix in the barren-plateau region is approximately a multiple of the identity, with small eigenvalues. The natural-gradient update rescales the gradient by $1/F \sim 1/\text{small}$ , effectively amplifying the small gradient.

This is not a free win — the small Fisher eigenvalues are themselves measured with shot noise, and dividing by a small number amplifies that noise. But empirically, in regimes where the barren-plateau is mild rather than catastrophic (gradients $\sim 10^{-2}$ , not $\sim 10^{-10}$ ), QNG can extract enough signal to keep the optimizer making progress where vanilla gradient descent stalls.

In severe barren plateau regions (gradients $\sim 10^{-6}$ or smaller), QNG cannot save you. The shot noise in the Fisher matrix dominates and the optimizer becomes a random walk. Tutorial 37’s structural fixes (problem-tailored ansätze, warm starts) are necessary for the deep barren-plateau regime.

The cost: estimating $F$

The Fisher information matrix has $N(N+1)/2$ independent entries for $N$ parameters. Estimating each entry requires (in the standard parameter-shift method) $\sim 4$ circuit evaluations. So computing the full $F$ matrix costs $\sim 2 N^2$ circuit evaluations per gradient step — the same scaling as a Hessian.

For $N = 10$ , this is $\sim 200$ extra evaluations per step. For $N = 100$ , $\sim 20{,}000$ . For $N = 1{,}000$ , $\sim 2 \times 10^6$ — prohibitive.

Naive QNG is feasible only for moderate parameter counts. For larger ansätze, the block-diagonal approximation below is the practical workaround.

The block-diagonal approximation

The Stokes 2020 paper proposed a layer-wise block-diagonal structure for the Fisher matrix. The intuition: for an ansatz built from layers of parameterized gates, parameters within the same layer have non-trivial Fisher correlations; parameters in different layers have approximately zero Fisher correlation.

Approximate $F$ as a block-diagonal matrix where each block corresponds to one layer’s parameters. Off-diagonal blocks are set to zero. This reduces the cost from $O(N^2)$ to $O(\sum_l N_l^2)$ , where $N_l$ is the parameter count in layer $l$ . For an ansatz with $L$ layers and $N/L$ parameters per layer, the cost is $O(N^2 / L)$ — a factor- $L$ savings.

For typical ansätze with many small layers, the block-diagonal approximation captures most of the QNG benefit at a small fraction of the cost. The standard 2026 implementation of QNG uses block-diagonal $F$ by default in PennyLane and similar libraries.

A working PennyLane example

PennyLane has built-in QNG via qml.QNGOptimizer:

import numpy as np
import pennylane as qml
from pennylane import numpy as pnp

n_qubits = 4
dev = qml.device("default.qubit", wires=n_qubits)

# Simple H2-like cost function: energy of a 2-qubit Hamiltonian.
H = qml.Hamiltonian(
    [1.0, -0.5, 0.3],
    [qml.PauliZ(0), qml.PauliZ(0) @ qml.PauliZ(1), qml.PauliX(2)]
)

@qml.qnode(dev, interface="autograd")
def circuit(weights):
    for q in range(n_qubits):
        qml.RY(weights[q, 0], wires=q)
        qml.RZ(weights[q, 1], wires=q)
    for q in range(n_qubits - 1):
        qml.CNOT(wires=[q, q + 1])
    for q in range(n_qubits):
        qml.RY(weights[q, 2], wires=q)
    return qml.expval(H)


def cost_fn(w):
    return circuit(w)


# Train with vanilla GradientDescent and with QNG, compare.
init = pnp.array(np.random.uniform(-0.5, 0.5, (n_qubits, 3)), requires_grad=True)

# Vanilla GD
gd = qml.GradientDescentOptimizer(stepsize=0.1)
w_gd = init.copy()
for step in range(60):
    w_gd, c = gd.step_and_cost(cost_fn, w_gd)
    if step % 20 == 0:
        print(f"GD step {step}: cost = {c:.5f}")

# QNG
qng = qml.QNGOptimizer(stepsize=0.1)
w_qng = init.copy()
for step in range(60):
    w_qng, c = qng.step_and_cost(cost_fn, w_qng)
    if step % 20 == 0:
        print(f"QNG step {step}: cost = {c:.5f}")

Sample output (illustrative; depends on initialization):

GD step 0: cost = 0.18234
GD step 20: cost = -0.42198
GD step 40: cost = -0.61734
QNG step 0: cost = 0.18234
QNG step 20: cost = -0.78412
QNG step 40: cost = -0.79134

QNG converges faster (reaches a lower energy in fewer steps) at the cost of more compute per step. The total wall-clock for the same final accuracy is typically lower for QNG, though the per-step cost is higher. The break-even depends on parameter count and how flat the landscape is.

Beyond block-diagonal: variants and extensions

A few extensions worth knowing:

Diagonal QNG. Use only the diagonal of $F$ (i.e., normalize each parameter individually by its sensitivity). Cheaper still — $O(N)$ per step — and captures most of the benefit when off-diagonal correlations are small.
Stochastic Fisher estimation. Estimate $F$ from a random subset of entries each step. Reduces per-step cost at the price of higher variance.
Quantum information geometry. The Fubini-Study metric and the symmetric logarithmic derivative metric are alternative metric choices to the Fisher information matrix; in practice they give similar empirical performance to QNG for most variational algorithms.
L-BFGS hybrid. Some implementations alternate QNG steps with L-BFGS-style quasi-Newton updates, capturing global curvature beyond what the block-diagonal Fisher provides.

The 2026 production answer is usually block-diagonal QNG for parameter counts under ~500, diagonal QNG for larger counts, and Adam or vanilla SGD when the per-step Fisher cost is unaffordable.

Common misconceptions

“QNG eliminates barren plateaus.” No. QNG amplifies the gradient signal in mildly flat regions. In severely flat regions, the Fisher matrix itself is statistically poorly determined and QNG becomes unreliable. Structural fixes (problem-tailored ansätze, warm starts) are still necessary for severe barren plateaus.

“QNG is a second-order method like Newton’s method.” Related but distinct. QNG uses the Fisher information as the metric, which is the curvature of the Kullback-Leibler divergence between probability distributions, not the curvature of the cost function itself. Newton’s method uses the Hessian of the cost function. They coincide in some cases but generally differ; QNG’s geometric interpretation is more fundamental.

“QNG is always better than Adam.” Empirically, on barren-plateau-adjacent landscapes, QNG often outperforms Adam in iteration count. But the per-iteration cost is much higher; Adam may win on wall-clock for problems with cheap circuits or many parameters.

“You need full $F$ for QNG to work.” No. Block-diagonal and diagonal approximations recover most of the benefit at a fraction of the cost. The full matrix is rarely worth computing.

“QNG only helps for chemistry / specific applications.” It helps any variational algorithm where the parameter manifold is curved — which is essentially all of them. Chemistry was the first reported use because of the imaginary-time-evolution connection, but QML, QAOA, and quantum simulation all benefit.

Decision rule

Reach for QNG when:

Parameter count is moderate. $N \in [10, 500]$ is the sweet spot. Below 10, vanilla GD is fine; above 500, the per-step Fisher cost dominates.
Landscape is barren-plateau-adjacent but not catastrophic. Gradients in the $10^{-3}$ to $10^{-1}$ range; QNG’s amplification helps. Gradients $\sim 10^{-6}$ , QNG fails too.
You can afford block-diagonal Fisher computation. $\sim 2 N^2 / L$ extra evaluations per step is the cost; multiply by your shot count for the wall-clock impact.
Imaginary-time evolution is the right semantics. For chemistry VQE problems where the goal is to reach the ground state, QNG’s connection to imaginary-time-evolution is more than aesthetic — it gives provable convergence properties.

Use vanilla GD or Adam when:

You have hundreds of thousands of parameters. QNG cost is prohibitive.
Your landscape is well-conditioned. When the cost-function landscape is approximately spherical, vanilla GD converges almost as fast as QNG with much lower per-step cost.
You’re in a deep barren-plateau region. Neither method works; switch to a structurally different ansatz.

For the typical 2026 chemistry-VQE problem with a few hundred ADAPT-VQE parameters, QNG with block-diagonal Fisher is the production choice.

Exercises

1. Why $F$ is positive semidefinite

Show that the Fisher information matrix $F$ defined above is positive semidefinite. Why does this matter for the QNG update direction?

Show answer

For any vector $\mathbf{v}$ , $\mathbf{v}^T F \mathbf{v} = 4 (\langle \partial \psi | \partial \psi \rangle - |\langle \partial \psi | \psi \rangle|^2)$ , where $|\partial \psi\rangle = \sum_i v_i |\partial_i \psi\rangle$ . The first term is non-negative (norm squared); the second is the Cauchy-Schwarz upper bound on it. So $\mathbf{v}^T F \mathbf{v} \geq 0$ . $F$ is positive semidefinite. This guarantees $F^{-1}$ exists (when $F$ is positive definite) and that the QNG update direction $-F^{-1} \nabla C$ is a descent direction — its inner product with $\nabla C$ is negative. Without positive-definiteness, QNG could move uphill, which would be catastrophic for optimization.

2. When $F$ is singular

If $F$ has eigenvalue zero (i.e., a redundant parameter direction that doesn’t change the state), what does QNG do, and how does the implementation handle it?

Show answer

A zero eigenvalue means there is a parameter direction the state is insensitive to. $F^{-1}$ is undefined in that direction. Implementations handle this by regularization: replace $F$ with $F + \lambda I$ for small $\lambda$ , then invert. The regularized $F$ is always positive definite. The cost: QNG updates in the redundant directions are not maximally large, but also not infinite. Regularization with $\lambda \sim 10^{-3}$ to $10^{-6}$ is standard and rarely affects practical convergence. PennyLane’s QNGOptimizer uses regularization automatically.

3. Cost of full vs block-diagonal Fisher

For an ansatz with 60 parameters arranged in 6 layers of 10 parameters each, compute the cost of full Fisher matrix vs block-diagonal Fisher matrix per QNG step.

Show answer

Full Fisher: $60 \times 60 / 2 = 1800$ unique entries × 4 evaluations each = 7,200 circuit evaluations per step. Block-diagonal: 6 blocks of $10 \times 10 / 2 = 55$ entries × 4 evaluations = 1,320 evaluations per step. Block-diagonal is ~5.5× cheaper. For larger ansätze the savings grow proportionally to the number of layers. Combined with the empirical observation that block-diagonal recovers most of the QNG benefit, this is why production implementations default to block-diagonal.

4. Pick QNG vs Adam for a 1,000-parameter ansatz

A QML training problem has 1,000 parameters and a moderately barren-plateau-adjacent landscape (gradients $\sim 10^{-2}$ ). Per-step Fisher cost would be $\sim 4{,}000{,}000$ circuit evaluations. You have access to 1,000 shots/second on quantum hardware. Should you use QNG or Adam?

Show answer

QNG step cost: 4M shots / 1k shots/s = 4,000 seconds per step ≈ 1 hour per step. Even with 5-10× iteration savings, total wall-clock is many days. Adam is clearly the right choice for 1,000-parameter problems on hardware. A reasonable middle path: diagonal QNG (cost $O(N) = 4{,}000$ evaluations/step, ~4 seconds) which captures the per-parameter rescaling benefit at much lower cost. Diagonal QNG is the production choice for parameter counts $\geq 500$ in 2026.

Where this goes next

This concludes the four-tutorial variational deepening (37-40). The variational track now has six tutorials: VQE (13), QAOA (14), barren plateaus (37), ADAPT-VQE (38), parameter-shift rule (39), and quantum natural gradient (40). Together they cover the practical machinery of variational quantum computing as it actually exists in 2026, with realistic limits clearly stated. Future tutorials in this track will deepen specific application areas: variational chemistry beyond VQE, QAOA depth-quality theory, hybrid classical-quantum architectures, and the open problem of whether variational methods can scale into the fault-tolerant era.

The geometry of the parameter manifold

The QNG update rule

Why QNG helps with barren plateaus

The cost: estimating FFF

The block-diagonal approximation

A working PennyLane example

Beyond block-diagonal: variants and extensions

Common misconceptions

Decision rule

Exercises

1. Why FFF is positive semidefinite

2. When FFF is singular

3. Cost of full vs block-diagonal Fisher

4. Pick QNG vs Adam for a 1,000-parameter ansatz

Where this goes next

Quantum, for people who already code.

The cost: estimating $F$

1. Why $F$ is positive semidefinite

2. When $F$ is singular