hardware advanced · 14 min read · May 1, 2026

Gate-Set Tomography: The Detailed-and-Expensive Twin of Randomized Benchmarking

Gate-set tomography (GST) is the most detailed hardware-characterization protocol available. Unlike randomized benchmarking which gives one number per gate, GST returns a full description of every gate's action including coherent errors, incoherent errors, and SPAM errors. The price: a much larger data set, hundreds-to-thousands of distinct circuits, and complex post-processing. This tutorial covers what GST measures, why it differs from RB, and when the extra detail is worth the cost.

Prerequisites: Tutorial 47: Density Matrices and Mixed States, Tutorial 63: Randomized Benchmarking

Randomized benchmarking (tutorial 63) gives you a fidelity number per gate. Gate-set tomography (GST) gives you the full structure of every gate’s action: what unitary the gate is approximating, what coherent error pattern it adds, what incoherent error channels it suffers, and how SPAM (state-preparation and measurement) errors interact with all of this. GST returns a complete characterization of the gate set, not a single fidelity per gate.

The cost: GST requires a much larger data set than RB. A standard 1-qubit GST run uses ~1000-3000 distinct circuits, each repeated ~1000 times for statistics. A 2-qubit GST runs into 10,000-100,000 circuits. The post-processing involves nonlinear self-consistent estimation across all the circuits, which is computationally expensive.

GST is the right tool when:

Algorithm-level performance is mismatched against RB-predicted performance. RB says 99.9% but the real algorithm only gets 95%? GST can tell you why.
You need to debug specific error mechanisms. Coherent miscalibration, leakage, drift — GST identifies these directly.
You’re publishing characterization for fault-tolerance research. Production papers on QEC require GST-level detail.

This tutorial covers what GST measures, the differences from RB, the open-source pyGSTi tool, and a decision rule for when GST is worth the data-collection cost.

What GST measures

For each gate $G$ in the gate set, GST estimates:

The gate’s process matrix — a $D^2 \times D^2$ complex matrix (for $D$ -dim Hilbert space) capturing how the gate transforms density matrices. This is more detailed than a unitary; it captures both unitary and non-unitary actions.
The state-preparation density matrix — the actual $\rho_0$ produced by the state-preparation procedure (which is not exactly $|0\rangle\langle 0|$ on real hardware due to thermal population, miscalibration, etc.).
The measurement POVM — the set of measurement operators actually implemented, accounting for readout errors.

The output is a self-consistent description: the gate process matrices, the state-prep matrix, and the measurement POVM all fit a single observed dataset. GST is “self-consistent tomography” — it does not assume any particular operation is exact.

Why this is hard: SPAM circularity

Naive process tomography measures a single gate by:

Prepare the qubit in known states.
Apply the gate.
Measure.

The result is a fit for the gate’s process matrix. But this assumes the prepared states are exactly known and the measurement is exactly known. On real hardware, both have errors. If you fit the gate matrix assuming perfect SPAM, you get a biased estimate; the gate’s apparent fidelity reflects SPAM errors as well as gate errors.

GST breaks the circularity by jointly fitting the gates and the SPAM. The data set is large enough (and structured carefully) that all parameters can be self-consistently determined. The price is the larger data set and the nonlinear estimation procedure.

Gauge freedom

A subtle issue: GST cannot determine the gate set uniquely. There is a “gauge” degree of freedom — applying a unitary similarity transformation to all gates and SPAM operators gives the same observable predictions. GST returns equivalence classes, not unique gate descriptions.

This matters for interpretation. A specific gate’s process matrix from GST is meaningful only modulo gauge; physical observables (fidelity, error rates) are gauge-invariant and reliable.

In practice, this is rarely a problem — gauge ambiguity is a structural fact of quantum tomography, not a bug. The pyGSTi tool reports gauge-fixed results that are visually interpretable, with a note that the specific representation is one of many gauge-equivalent ones.

What GST tells you that RB doesn’t

RB returns one fidelity number per gate. GST returns:

Coherent vs incoherent error split. RB lumps these into a single fidelity. GST can identify whether your error is a small unitary mis-rotation (coherent, can be calibrated out) vs decoherence (incoherent, fundamental).
Per-Pauli-channel error rates. GST tells you how much $X$ error vs $Y$ error vs $Z$ error you have on each gate. This is invisible in RB.
Leakage. GST naturally captures population escape to non-computational levels. RB requires special variants to detect leakage.
SPAM errors quantified. GST returns the actual prepared state and measurement operators, not assumed ones.
Drift. Repeated GST runs over time can detect drift in gate parameters that RB averages out.

For algorithm-level performance debugging, this detail is what you need. Why is your algorithm getting 95% fidelity when RB predicts 99.9%? GST often identifies a coherent miscalibration or a leakage channel that RB hid.

pyGSTi: the production tool

The dominant open-source GST tool is pyGSTi (Sandia National Labs, Erik Nielsen et al.). Installation: pip install pygsti. Key features:

Pre-built circuit lists for standard 1-qubit and 2-qubit GST.
Self-consistent estimation via maximum-likelihood fitting.
Visualization of process matrices, error pattern decompositions, and drift analyses.
Integration with Cirq, Qiskit, and other circuit frameworks for running on real hardware.

Production characterization workflow:

Define the gate set to be characterized.
Generate the GST circuit list using pyGSTi’s protocol (typically 1000-10000 circuits depending on system size).
Run the circuits on hardware, collect results.
Run pyGSTi’s MLE fitter to extract gate process matrices, SPAM, etc.
Analyze with pyGSTi’s reporting tools.

The full characterization for a 1-qubit gate set takes ~1000 circuits × 1000 shots = $10^6$ circuit evaluations, ~1 hour of dedicated hardware time on most platforms. For 2-qubit, multiply by 10-100×.

A small GST sketch

GST’s structure is too complex for a single short Python example, but here is a sketch of the data-collection step using pyGSTi-style circuit lists:

# Conceptual GST workflow (pyGSTi-style API)

# pyGSTi defines a "gate set spec":
# - Set of gates (G_x, G_y, G_cnot, etc.)
# - State preparation
# - Measurement basis

# GST circuits are built as: prep + germ_repeat + meas
# where 'germ' is a short fiducial sequence designed to amplify specific errors.

# Pseudocode:
# from pygsti.modelpacks import smq1Q_XYI
# target_model = smq1Q_XYI.target_model()
# experiment_design = smq1Q_XYI.create_gst_experiment_design(max_max_length=64)
# 
# For each circuit in experiment_design:
#   - run on hardware
#   - record bit results
# 
# data = pygsti.protocols.ProtocolData(experiment_design, observed_data)
# protocol = pygsti.protocols.StandardGST()
# results = protocol.run(data)
# print(results.estimates['final iteration estimate'].models['final'])

print("GST workflow:")
print("1. Define target gate set (e.g., G_x, G_y, identity, CNOT).")
print("2. Generate circuit list (pyGSTi handles this).")
print("3. Run circuits on hardware. Collect counts.")
print("4. Run MLE estimation. Get full gate process matrices + SPAM.")
print("5. Analyze: identify coherent vs incoherent errors, leakage, drift.")

In production, the circuit-running step is what takes wall-clock time. The MLE step takes minutes-to-hours on a laptop for 1-qubit GST, hours-to-days for 2-qubit GST.

When GST is worth the cost

GST’s data cost is 100-1000× higher than RB. When is the extra detail worth it?

Hardware development. GST identifies specific error mechanisms; vendors use it during chip bring-up to debug fab and calibration issues.
Algorithm performance debugging. When RB predicts good fidelity but your algorithm doesn’t get it, GST tells you why.
Fault-tolerance research. Detailed error models for QEC simulations require gate-set-tomography-level data.
Drift characterization. Repeated GST over weeks identifies long-term drift in calibration.

When RB is enough:

Publishing a single fidelity number. RB is the standard.
Quick characterization during gate calibration loops. RB is fast and good enough for closed-loop optimization.
Comparing two hardware platforms. RB’s standardization makes apples-to-apples comparisons reasonable.

For most practical work in 2026, RB is the default characterization, GST is the deep-dive when debugging is needed. Production hardware vendors run GST during development; users typically run only RB.

Common misconceptions

“GST is replaced by RB.” No — they answer different questions. RB gives a single fidelity number; GST gives the full gate-set description. Production uses both.

“GST gauge freedom is a bug.” It’s a structural fact of quantum tomography. All physical observables computed from GST are gauge-invariant; only the specific matrix representation has gauge freedom.

“GST is impossible at multi-qubit scale.” It’s expensive but feasible. 2-qubit GST is routine in research; 3-qubit GST has been demonstrated. For larger systems, simultaneous and parallel GST variants scale better than naive 4+ qubit GST.

“You always need GST for fault-tolerance research.” Not always. Some FT research uses simpler error models (Pauli twirling, depolarizing channel approximation) that RB suffices for. GST is needed when the noise has structure (correlations, coherent components) that simpler models miss.

Decision rule

Use GST when:

You’re debugging an unexplained gap between RB-predicted and algorithm-observed performance.
You’re publishing detailed error models for FT research.
You need to track drift over time.
You need to identify coherent miscalibrations vs decoherence.

Use RB when:

You just need a fidelity number.
You’re doing fast characterization during calibration loops.
You’re comparing platforms or fitting RB-style decay parameters.

GST is the heavy artillery; RB is the routine measurement. Most days, the routine measurement suffices.

Where this goes next

This concludes the four-tutorial hardware-track final-deepening (61-64). The track now has 9 tutorials covering: hardware comparison (20), per-platform deep dives (33-36), cryogenic control (61), quantum control theory (62), randomized benchmarking (63), and gate-set tomography (64) — the complete operational toolkit for a hardware engineer or characterization researcher. The hardware track is in good shape for any reader serious about quantum-hardware claims.