hardware intermediate · 14 min read · May 1, 2026

Randomized Benchmarking: How to Measure Gate Fidelity Without Tomography

Randomized benchmarking (RB) is the standard protocol for measuring gate fidelities on real quantum hardware. Run random Clifford sequences of varying length, measure how the survival probability decays, fit an exponential, and extract the per-gate error. RB is fast, scalable, and produces a single robust fidelity number that is the standard quoted hardware metric. This tutorial covers the protocol, the math behind why exponential decay happens, the variants (interleaved, simultaneous, mirror), and the limitations.

Prerequisites: Tutorial 25: The Clifford Group, Tutorial 33: Transmon Qubits

When a hardware vendor reports “two-qubit gate error of $3 \times 10^{-3}$ ,” the number almost always comes from randomized benchmarking (RB). RB is the dominant gate-fidelity-measurement protocol in production, used by every major quantum-hardware company (Google, IBM, Quantinuum, IonQ, Rigetti, etc.) and by virtually every academic group publishing hardware results.

The protocol is structurally simple: run random Clifford-circuit sequences of varying length $L$ . After each sequence, apply the inverse to return to the initial state and measure. Because the random Cliffords average noise to a depolarizing channel, the survival probability decays exponentially in $L$ . The decay constant gives the per-gate error.

RB is the workhorse of hardware characterization because:

It’s robust to state-preparation and measurement (SPAM) errors. The exponential fit isolates gate error from measurement error.
It’s scalable. Cost grows polynomially in qubit count — orders of magnitude cheaper than full tomography.
It returns a single robust number. No averaging over many error metrics; one fidelity per gate is the deliverable.
It works on any qubit platform. Same protocol on transmons, ions, atoms, photonic.

This tutorial covers the protocol, the math behind exponential decay, the main variants (interleaved, simultaneous, mirror), and the cases where RB is misleading.

The standard protocol

Randomized benchmarking on $n$ qubits:

Choose a length $L$ (number of random Cliffords).
Sample $L$ random Cliffords $C_1, \ldots, C_L$ uniformly from the Clifford group.
Compute the inverse $C_\text{inv} = (C_L \cdots C_1)^{-1}$ .
Build the circuit: apply $C_1, \ldots, C_L, C_\text{inv}$ .
Measure the qubits. Record the fraction of trials returning the all-zeros state.
Repeat for many random sequences at this length, average the survival probability.
Repeat for many lengths $L$ .

The survival probability as a function of $L$ should follow

P_\text{survival}(L) \;=\; A p^L + B,

where $p$ is the average gate fidelity parameter, $A$ is a SPAM-related constant, and $B$ is the asymptote (typically $1/2^n$ for $n$ qubits).

Fit the exponential. The per-gate error rate is

r \;=\; \frac{2^n - 1}{2^n} (1 - p).

For 2 qubits, $r \approx \frac{3}{4}(1-p)$ . The reported “gate error” is $r$ .

Why exponential decay

The math. A random Clifford $C$ acting on a noisy gate operation followed by the Clifford inverse $C^{-1}$ gives a “twirled” channel:

\mathcal{E}_\text{twirled} \;=\; \frac{1}{|\mathcal{C}|} \sum_C C^{-1} \mathcal{E} C.

For any noise channel $\mathcal{E}$ , the twirled version is a depolarizing channel — it has the form $\mathcal{E}_\text{twirled}(\rho) = p \rho + (1-p) I/D$ for some $p$ .

Applying $L$ twirled channels in sequence: $\mathcal{E}_\text{twirled}^L(\rho) = p^L \rho + (1-p^L) I/D$ . Measuring survival gives $p^L$ — exponential decay in $L$ .

The randomization-twirling step is what makes RB robust to specific error types: even if the underlying noise is coherent and complicated, the twirling averages it to a depolarizing channel, which is fully characterized by the single number $p$ .

Interleaved RB: measuring a specific gate

Standard RB measures the average Clifford error. To measure the error of a specific gate (say, CNOT), use interleaved RB (Magesan et al. 2012):

Run standard RB and extract $p_\text{ref}$ .
Run RB with the target gate $G$ inserted between every pair of random Cliffords. Extract $p_G$ .
The gate-specific error is $r_G = \frac{D-1}{D} (1 - p_G / p_\text{ref})$ .

This isolates the target gate’s contribution. Most reported “two-qubit gate error” numbers in production are from interleaved RB on the native two-qubit gate (CNOT, CZ, ZZ, etc.).

Simultaneous RB: measuring crosstalk

For multi-qubit systems, gates on one qubit can affect other qubits via crosstalk. Simultaneous RB (Gambetta et al. 2012) measures these effects:

Run RB independently on each qubit (or qubit group), measure $p_\text{ind}$ for each.
Run RB simultaneously on all qubits in parallel, measure $p_\text{sim}$ .
The crosstalk-induced error is the difference between $p_\text{ind}$ and $p_\text{sim}$ .

If the gates are completely independent, simultaneous RB gives the same numbers as individual RB. If there’s crosstalk, the simultaneous numbers are worse. Simultaneous RB is the diagnostic for whether your hardware has scalable parallel-gate operation.

Mirror benchmarks: catching the noise RB misses

Standard RB averages noise to a depolarizing channel. This averaging hides certain types of noise — coherent errors that don’t depolarize cleanly, time-correlated noise that doesn’t average over individual gates, etc.

Mirror benchmarks (Proctor et al. 2022) capture some of this missed noise. Instead of random Cliffords, use circuits that look like real algorithm structures — repeated patterns, structured layers — and measure how their fidelity differs from RB predictions. The gap between RB-extrapolated fidelity and mirror-benchmark fidelity is a measure of “structured noise.”

In production, this is used to compare hardware vendors honestly: a vendor that reports $99.9\%$ from RB but $99\%$ from mirror benchmarks is suffering from structured noise that affects real algorithms.

A small RB demonstration

Concrete code that simulates RB on a 2-qubit system with depolarizing noise:

import numpy as np
import pennylane as qml

n_qubits = 2
dev = qml.device("default.mixed", wires=n_qubits, shots=1000)


def random_clifford():
    """Return a random 2-qubit Clifford as a list of basic gates."""
    n_layers = np.random.randint(2, 4)
    gates = []
    for _ in range(n_layers):
        # Random single-qubit Cliffords (H, S, identity, X, Y, Z combinations)
        for q in range(n_qubits):
            choice = np.random.randint(4)
            if choice == 0:
                gates.append(("H", q))
            elif choice == 1:
                gates.append(("S", q))
            elif choice == 2:
                pass
            else:
                gates.append(("X", q))
        # Random 2-qubit Clifford (CNOT or CZ)
        if np.random.randint(2):
            gates.append(("CNOT", 0, 1))
    return gates


def apply_gate_list(gate_list, p_depol):
    """Apply a gate list with depolarizing noise after each gate."""
    for g in gate_list:
        if g[0] == "H":
            qml.Hadamard(wires=g[1])
            qml.DepolarizingChannel(p_depol, wires=g[1])
        elif g[0] == "S":
            qml.S(wires=g[1])
            qml.DepolarizingChannel(p_depol, wires=g[1])
        elif g[0] == "X":
            qml.PauliX(wires=g[1])
            qml.DepolarizingChannel(p_depol, wires=g[1])
        elif g[0] == "CNOT":
            qml.CNOT(wires=[g[1], g[2]])
            qml.DepolarizingChannel(p_depol, wires=g[1])
            qml.DepolarizingChannel(p_depol, wires=g[2])


@qml.qnode(dev)
def rb_circuit(L, p_depol, seed):
    """Run RB sequence of length L with given depolarizing noise rate."""
    np.random.seed(seed)
    cliffords = [random_clifford() for _ in range(L)]
    for c in cliffords:
        apply_gate_list(c, p_depol)
    # Inverse: in real RB, compute the inverse Clifford. Simplified here.
    # Just measure all qubits.
    return qml.probs(wires=range(n_qubits))


# Run RB for varying lengths.
p_depol = 0.005  # 0.5% per gate
for L in [4, 8, 16, 32, 64]:
    survival_probs = []
    for seed in range(50):  # 50 random sequences per length
        probs = rb_circuit(L, p_depol, seed)
        survival_probs.append(probs[0])  # |00> probability
    avg_survival = np.mean(survival_probs)
    print(f"Length L={L}: average |00> probability = {avg_survival:.4f}")

The exponential decay would be clearer with proper random Clifford sampling and inverse computation — this is illustrative. Production RB tools (Cirq’s randomized benchmarking module, Qiskit’s QV/RB tools, IBM’s Heron benchmarks) handle the full Clifford sampling and inverse computation correctly.

Common misconceptions

“RB is the gold standard for fidelity.” It’s the standard but not necessarily gold. RB averages over Clifford circuits and assumes Markovian noise. Real circuits often have non-Markovian, structured noise that RB doesn’t capture. Mirror benchmarks complement RB.

“RB gives the same number on different platforms.” No — different platforms have different native Clifford implementations, different circuit depths for RB sequences, and different averaging structures. Comparing RB numbers across platforms requires care.

“RB error rate is the per-gate error rate of any specific gate.” It’s the average error rate across the Clifford group. The specific gate (e.g., CNOT) error rate from interleaved RB is more useful for algorithm-level estimates.

“RB scales to many qubits.” Up to a point. Standard RB on $n$ qubits requires the Clifford group’s exponentially-large size for full averaging — though Cycle Benchmarking and other variants scale better. Modern multi-qubit characterization typically uses simultaneous + mirror benchmarks rather than direct multi-qubit RB.

Where this goes next

Tutorial 64 covers gate-set tomography — the more detailed (and more expensive) characterization that returns a full description of the gate’s action. Together, RB and GST cover the standard hardware-characterization toolkit. RB is the production fidelity number; GST is the diagnostic that explains where the fidelity loss is coming from.