variational advanced · 20 min read · By LIPAI WANG · April 29, 2026

Barren Plateaus: Why Most Variational Quantum Algorithms Fail at Scale

Barren plateaus are the dominant theoretical limit on variational quantum algorithms. Past a modest qubit count, the gradients of typical parameterized quantum circuits vanish exponentially in the system size, making training infeasible. This tutorial covers the McClean 2018 result, the cost-function-dependent and noise-induced extensions, and the four mitigation strategies that have moved the field — and gives an honest verdict on whether variational quantum computing has a future at scale.

Prerequisites: Tutorial 13: Variational Quantum Eigensolver, Tutorial 17: Is QML Worth It? A Skeptic's Benchmark

The single most consequential theoretical result for variational quantum computing in the last decade is the barren plateau — the observation that for sufficiently expressive parameterized quantum circuits, the gradients of the cost function with respect to the circuit parameters vanish exponentially in the qubit count. Past about 30-50 qubits, training becomes infeasible: the optimizer sees a flat landscape and cannot find a useful direction to move.

This is not a hardware limitation. It is a structural property of the variational ansatz itself, proven for circuits drawn from the Haar measure on the unitary group, and extended to many practically used ansätze. Tutorial 17 (the QML skeptic’s benchmark) hinted at this; this tutorial makes it precise. The bottom line: most published variational quantum algorithms have not been demonstrated past the regime where barren plateaus would force them to fail, and those that have been pushed further almost always rely on a small handful of mitigation strategies.

This is not a death sentence for variational quantum computing. It is a hard constraint that the field is actively engineering around, with mixed results. This tutorial covers the core result, the extensions, the four major mitigation strategies, and a decision rule for evaluating whether your variational algorithm has a credible path to scale.

The McClean 2018 result

The original barren plateau result (McClean, Boixo, Smelyanskiy, Babbush, Neven 2018, Nature Communications) is a clean statement. Consider a parameterized quantum circuit

U(\boldsymbol{\theta}) \;=\; \prod_{l=1}^L V_l \, e^{-i \theta_l W_l},

where $V_l$ are fixed unitaries, $W_l$ are Hermitian generators, and $\boldsymbol{\theta}$ are trainable parameters. Define a cost function via expectation of some Hermitian observable $H$ :

C(\boldsymbol{\theta}) \;=\; \langle 0 | U^\dagger(\boldsymbol{\theta}) H U(\boldsymbol{\theta}) | 0 \rangle.

The barren plateau result states: under reasonable conditions on the ansatz (the unitary 2-design property, satisfied by sufficiently random circuits), the variance of any single partial derivative $\partial C / \partial \theta_k$ over the parameter space scales as

\mathrm{Var}\bigl[\partial_k C\bigr] \;\sim\; \frac{1}{2^n},

where $n$ is the qubit count. The mean of the gradient is zero by symmetry, so the gradient itself is a zero-mean random variable with exponentially small variance. At a randomly initialized parameter, the gradient is exponentially small with high probability.

The exponential dependence on qubit count is the kicker. At 10 qubits, gradients have variance $\sim 10^{-3}$ — manageable. At 50 qubits, $\sim 10^{-15}$ — drowned in shot noise on any plausible quantum hardware. At 100 qubits, $\sim 10^{-30}$ — physically unmeasurable.

Why this happens

The proof strategy: invoke the unitary 2-design property to reduce variance integrals over the parameter space to integrals over the Haar measure on the unitary group. Closed-form formulas for these “Weingarten” integrals exist and produce the $1/2^n$ scaling.

The intuition: a sufficiently expressive parameterized circuit is “essentially Haar-random” on the system Hilbert space. A typical Haar-random unitary maps $|0\rangle$ to a maximally spread state, where every observable’s expectation hovers near a flat global average. The gradient of any local observable’s expectation is the difference between two such flat averages — small.

Three corollaries follow immediately:

The deeper the ansatz, the worse the barren plateau. More layers means closer to Haar-random, means flatter landscape.
Random initialization is bad. The proof applies to randomly chosen parameters. Choosing structured initializations (warm starts) is one of the standard mitigations.
Quantum advantage is correlated with vanishing gradients. Ansätze expressive enough to capture quantum-advantage states are exactly the ansätze where barren plateaus appear. There is a deep tension between trainability and expressivity.

That third point is the most uncomfortable. The theoretical sweet spot — an ansatz that can represent useful quantum states and has trainable gradients — is narrow.

Extensions: it gets worse

The McClean 2018 result is for unitary 2-designs. Three later results extended barren plateaus to more practical settings.

Cost-function-dependent barren plateaus (Cerezo et al. 2021)

In some QML and VQE applications, the cost function is a sum of local observables (e.g., the energy of a local Hamiltonian). One might hope local cost functions avoid barren plateaus. Sometimes they do, sometimes they do not.

Cerezo, Sone, Volkoff, Cincio, Coles 2021 (Nature Communications) showed:

For ansätze with depth $O(\log n)$ , local cost functions have polynomially-vanishing gradients (no barren plateau).
For ansätze with depth $O(\text{poly}(n))$ , even local cost functions have exponentially vanishing gradients.
Global cost functions (e.g., $|\langle 0 | U(\boldsymbol{\theta}) | 0 \rangle|^2$ ) always have barren plateaus, regardless of depth.

The practical consequence: shallow ansätze with local cost functions can sometimes train. Deep ansätze with global cost functions cannot. Most published VQE and QML work falls in the borderline regime where the answer depends on the specific problem.

Noise-induced barren plateaus (Wang et al. 2021)

What about noise? Wang, Fontana, Cerezo, Sharma, Sone, Cincio, Coles 2021 showed that noise itself induces barren plateaus, even in ansätze that would otherwise avoid them:

\mathrm{Var}\bigl[\partial_k C\bigr] \;\sim\; q^{2L},

where $q < 1$ is the per-gate noise factor and $L$ is the circuit depth. Noise-induced barren plateaus are exponential in depth, not in qubit count. Even a 5-qubit, 50-layer noisy circuit can have unmeasurable gradients.

This is the most depressing extension because it cannot be fixed by ansatz design — it is a property of running noisy hardware. The mitigation is short circuits, which limits expressive power.

Entanglement-induced barren plateaus (Marrero, Kieferová, Wiebe 2021)

If the ansatz produces highly entangled states (e.g., volume-law entanglement), the marginal-state landscape becomes essentially structureless, producing yet another barren plateau mechanism. This connects barren plateaus to a deep statistical-mechanics picture of quantum landscapes.

The collective verdict: barren plateaus are the rule, not the exception, in the ansatz family that has dominated NISQ-era variational research.

The four major mitigation strategies

Strategy 1: Problem-tailored ansätze

Instead of a generic ansatz (hardware-efficient, brick-wall), use one that exploits the structure of the problem. Examples:

Hamiltonian variational ansatz (HVA): for finding ground states of Hamiltonian $H$ , use an ansatz built from the gates $e^{-i \tau_j H_j}$ where $\{H_j\}$ are summands of $H$ . Less expressive than HEA but trainable for many physical systems.
Unitary coupled cluster (UCCSD): for chemistry, an ansatz inspired by classical coupled-cluster theory. Provably trainable for chemistry molecules at small qubit counts; barren plateaus reappear at larger sizes but more slowly.
Tree tensor network ansatz: geometrically motivated ansatz with logarithmic-depth structure that admits proven absence of barren plateaus for certain cost functions.

The pattern: trade generic expressivity for problem-specific trainability. Most successful 2024-2025 VQE results use problem-tailored ansätze, not hardware-efficient ones.

Strategy 2: Warm starts and structured initialization

Don’t initialize randomly. Use:

A warm-start from a classical solution (e.g., Hartree-Fock for chemistry, classical SDP relaxation for QAOA).
An identity-block initialization where the circuit starts as approximately the identity, breaking symmetry only weakly.
Pretraining on smaller-system instances and transferring parameters.

The McClean 2018 result is about random initialization. A specific structured initialization can land outside the barren-plateau region; the question is whether the optimization can escape into a useful subspace before drifting back into the plateau.

Strategy 3: Layerwise / blockwise training

Don’t train all parameters simultaneously. Train one layer at a time, freezing the others. After each layer is trained, add another and continue. This avoids the high-dimensional Haar-randomness that triggers the plateau.

Layerwise training has the practical drawback of being slower and is not always strictly better than full optimization, but it has been shown to escape some barren plateaus in practice. Variants include “growth” strategies where the ansatz expands as training progresses (which is the seed of ADAPT-VQE in tutorial 38).

Strategy 4: Geometric and natural-gradient methods

Standard gradient descent ignores the geometry of the parameter space. Quantum natural gradient (Stokes 2020, tutorial 40) uses the Fisher information matrix to scale parameter updates by the local curvature, which can effectively escape some flat regions.

This is not a free fix — quantum natural gradient adds quadratic-in-parameter-count overhead to compute the Fisher matrix — but it has been shown to outperform vanilla SGD on some barren-plateau-adjacent landscapes.

A small experiment: see the plateau yourself

Code that illustrates the barren plateau on a simple ansatz. The point is to see how quickly gradients vanish as qubit count grows.

import numpy as np
import pennylane as qml
from pennylane import numpy as pnp

def variance_of_gradient(n_qubits: int, n_layers: int, n_samples: int = 50) -> float:
    dev = qml.device("default.qubit", wires=n_qubits)

    @qml.qnode(dev)
    def circuit(weights):
        # Hardware-efficient ansatz: alternating Y rotations and CNOT layers.
        for layer in range(n_layers):
            for q in range(n_qubits):
                qml.RY(weights[layer, q], wires=q)
            for q in range(n_qubits - 1):
                qml.CNOT(wires=[q, q + 1])
        return qml.expval(qml.PauliZ(0))

    grad_fn = qml.grad(circuit, argnum=0)

    grads = []
    for _ in range(n_samples):
        weights = pnp.array(np.random.uniform(-np.pi, np.pi, (n_layers, n_qubits)),
                            requires_grad=True)
        g = grad_fn(weights)
        # Take the gradient with respect to the first parameter.
        grads.append(g[0, 0])

    return np.var(grads)


for n in [4, 6, 8, 10, 12]:
    v = variance_of_gradient(n_qubits=n, n_layers=20)
    print(f"n_qubits={n:2d}  Var(gradient)={v:.3e}  (predicted ~ {2**(-n):.3e})")

Sample output (illustrative; actual numbers will vary by random seed):

n_qubits= 4  Var(gradient)=4.123e-02  (predicted ~ 6.250e-02)
n_qubits= 6  Var(gradient)=1.187e-02  (predicted ~ 1.563e-02)
n_qubits= 8  Var(gradient)=2.514e-03  (predicted ~ 3.906e-03)
n_qubits=10  Var(gradient)=6.892e-04  (predicted ~ 9.766e-04)
n_qubits=12  Var(gradient)=1.736e-04  (predicted ~ 2.441e-04)

The variance roughly halves per added pair of qubits — consistent with $1/2^n$ scaling. By $n = 30$ the variance would be $\sim 10^{-9}$ , smaller than the shot noise on any reasonable hardware run. The barren plateau is observable in simulation at modest qubit counts, which is one of the best ways to convince yourself the result is real and not a theoretical curiosity.

What this means for VQE and QAOA

The two most prominent variational algorithms — VQE for chemistry and QAOA for combinatorial optimization — sit on different points on the barren-plateau spectrum.

VQE: Hamiltonian variational ansätze (UCCSD-derived, ADAPT-VQE) are the trainable choice. Hardware-efficient ansätze are not. As of 2026, no published peer-reviewed VQE result has crossed the 50-qubit-equivalent line on a hardware-efficient ansatz; results that have are using problem-tailored ansätze, often with classical pre-training.

QAOA: The QAOA ansatz is naturally a Hamiltonian variational ansatz — its structure encodes the problem Hamiltonian directly. The barren plateau picture is more nuanced: shallow QAOA (low $p$ ) is provably trainable; deep QAOA (high $p$ ) approaches barren-plateau territory. Tutorial 14 covered QAOA depth-quality tradeoffs; barren plateaus are part of why “more depth” doesn’t always help.

For both algorithms, the practical 2026 picture is: train at modest qubit count and modest depth where barren plateaus are manageable; rely on problem structure to bridge from there to larger scales.

Common misconceptions

“Barren plateaus mean variational quantum algorithms don’t work.” Not in general. Barren plateaus rule out random-initialization training of generic deep ansätze. Problem-tailored ansätze with warm starts can train successfully at scales where generic ansätze would fail.

“You can fix barren plateaus by adding more shots.” No. The variance scales exponentially in qubit count; the shot count required to overcome it scales exponentially. There is not enough wall-clock budget on any plausible quantum-classical hybrid system to brute-force a 50-qubit barren plateau with shots.

“Noise-induced barren plateaus are fixed by error mitigation.” Partially. Error mitigation can effectively reduce the per-gate noise factor, pushing the threshold for noise-induced barren plateaus deeper. But it doesn’t eliminate the mechanism; sufficiently noisy circuits still hit the plateau.

“Barren plateaus are a NISQ-era problem.” They are most acute in the NISQ era because of noise + hardware-efficient ansatz design. In the fault-tolerant era, the noise-induced plateau goes away, but the entanglement-induced and global-cost plateaus remain. Fault-tolerant variational algorithms are not automatically barren-plateau-free.

“Quantum machine learning solves this.” It does not. Most QML cost functions inherit the barren-plateau problem from the underlying parameterized-circuit framework. Tutorial 17 covered the bigger picture of QML’s mixed track record.

Decision rule

Before committing to a variational quantum algorithm, work through this checklist:

What is your ansatz family? Hardware-efficient (generic) → high barren-plateau risk. Problem-tailored (HVA, UCCSD, QAOA-style) → manageable. A specific classical-trained warm-start → low risk.
What is your cost function structure? Local sum-of-Paulis → trainable at logarithmic depth. Global (state overlap, fidelity) → barren plateau regardless. Hybrid → depends on relative weights.
What is your target qubit count? Below 20 qubits: barren plateaus rarely matter. 20-50 qubits: matter for some ansätze. Above 50: matter for almost everything; need explicit mitigation.
What is your noise level? Low noise + shallow ansatz → fine. High noise + deep ansatz → noise-induced plateau dominates. Most NISQ experiments live in the latter regime; that’s the source of much of the “VQA results don’t reproduce at scale” pattern.
What is your training budget? With limited shots, even a marginally-better-than-plateau gradient is noise-drowned. Plan for warm-started, layerwise, or natural-gradient training to maximize the signal-to-noise per training step.

A variational algorithm proposal that survives all five questions is plausibly trainable at scale. Most do not survive. Ask the questions before designing the experiment, not after debugging the failed run.

Exercises

1. Threshold qubit count

For a hardware-efficient ansatz with shot noise of $10^{-2}$ per measurement (~ $10{,}000$ shots), at what qubit count does the barren plateau make the gradient unmeasurable?

Show answer

Variance of gradient: $\sigma_g^2 \sim 1/2^n$ . Standard deviation: $\sigma_g \sim 1/2^{n/2}$ . We need $\sigma_g >$ shot noise = $10^{-2}$ , so $2^{n/2} < 100$ , $n < 13$ . Past 13 qubits, the gradient sits in the noise. At $n = 13$ , $\sigma_g \approx 0.011$ — barely measurable. At $n = 20$ , $\sigma_g \approx 10^{-3}$ — drowned. Without mitigation, hardware-efficient training is feasible only for very small systems.

2. Why local cost functions sometimes help

Cerezo 2021 says local cost functions at logarithmic depth avoid barren plateaus. Why does the locality matter for the gradient variance?

Show answer

A local cost function is a sum of local-observable expectations, each of which depends on only a few qubits’ marginals. At logarithmic depth, each parameter affects only a constant number of qubits’ light cones, so the relevant marginals don’t approach the Haar-random state. The Haar-random concentration that produces the global plateau doesn’t apply to the local marginals, so gradients have polynomial variance. Locality + shallow depth = local light cones = trainable. Lose either condition and the plateau returns.

3. Why warm starts work

A warm-started VQE ansatz is initialized to approximately the Hartree-Fock state. Why does this help with barren plateaus, and when does the warm-start advantage disappear during training?

Show answer

The Hartree-Fock state is not a Haar-random state — it is a structured, low-entanglement state with many local correlations. The barren plateau theorem applies to typical (Haar-random) initialization; structured initialization is outside the high-probability barren-plateau region, so initial gradients can be measurable. However, as training progresses and the parameters drift toward more expressive states, the optimizer can wander into the Haar-typical region and gradients can vanish. The warm-start gets you into a useful subspace; it does not protect you forever. Practical warm-started VQE often combines warm starts with layerwise training to keep the optimizer in trainable territory.

4. The expressivity-trainability tension

Propose an explanation for why ansätze that can represent useful quantum-advantage states tend to have barren plateaus, while ansätze that avoid plateaus tend to be classically simulable.

Show answer

The barren plateau is fundamentally a Haar-typicality result. Ansätze that approach Haar-typical (highly entangled, broadly expressive) by design also approach the regime where local-observable expectations concentrate near the Haar mean — flat. By contrast, classically-simulable ansätze (Clifford-only, small-bond-dimension tensor networks) are far from Haar-typical and have measurable gradients precisely because they don’t fully explore the Hilbert space. There is a fundamental tension: being able to represent quantum-advantage states (high expressivity) implies Haar-typicality (flat landscape), and being trainable (non-flat landscape) implies low expressivity. The whole research program of barren-plateau mitigation is about finding “thin” ansatz families that capture the structure of the problem without becoming Haar-typical — squeezing through the narrow window between expressivity and trainability. Whether this window is wide enough for genuine quantum advantage is one of the deepest open questions in variational quantum computing.

Where this goes next

Tutorial 38 covers ADAPT-VQE — the most-cited barren-plateau mitigation strategy in chemistry, which grows the ansatz adaptively from a problem-defined operator pool. Tutorial 39 covers the parameter-shift rule, the standard exact-gradient method for variational circuits. Tutorial 40 covers quantum natural gradient, the geometric-optimization method that helps with some barren-plateau-adjacent landscapes. Together these four tutorials cover the practical machinery of variational quantum computing as it actually exists in 2026, including its hard limits.

The McClean 2018 result

Why this happens

Extensions: it gets worse

Cost-function-dependent barren plateaus (Cerezo et al. 2021)

Noise-induced barren plateaus (Wang et al. 2021)

Entanglement-induced barren plateaus (Marrero, Kieferová, Wiebe 2021)

The four major mitigation strategies

Strategy 1: Problem-tailored ansätze

Strategy 2: Warm starts and structured initialization

Strategy 3: Layerwise / blockwise training

Strategy 4: Geometric and natural-gradient methods

A small experiment: see the plateau yourself

What this means for VQE and QAOA

Common misconceptions

Decision rule

Exercises

1. Threshold qubit count

2. Why local cost functions sometimes help

3. Why warm starts work

4. The expressivity-trainability tension

Where this goes next

Quantum, for people who already code.