Quantum Outpost
algorithms advanced · 19 min read ·

Block Encoding: How Modern Quantum Algorithms Get Arbitrary Matrices Into Quantum Circuits

Block encoding is the most important quantum-algorithm primitive of the past decade. It is the technique that lets you embed an arbitrary matrix A into a unitary U so a quantum computer can act with A on a state — making Hamiltonian simulation, linear-system solvers, and the entire QSVT framework possible. This tutorial defines block encoding precisely, gives the key constructions (LCU, qubitization, sparse-matrix), and shows why this is the abstraction that ate quantum algorithms after 2015.

Prerequisites: Tutorial 11: QFT and Phase Estimation

If you learned quantum algorithms from a 2015 textbook, the canonical examples were Shor’s, Grover’s, and quantum phase estimation — algorithms designed around specific problem structures, each with its own ad-hoc trick. If you learn quantum algorithms in 2026, you instead start with a single primitive — block encoding — and almost every modern algorithm becomes a special case of how you process the block-encoded matrix afterward.

This shift is the single biggest change in the field’s algorithmic landscape since Shor. It is the reason Hamiltonian simulation got polynomial-time-optimal in the dependence on simulation accuracy, the reason linear-system solvers (HHL and successors) became practical to discuss as production tools rather than complexity-theory toys, and the reason quantum singular value transformation (QSVT, tutorial 30) is now the standard framing for quantum algorithm design.

This tutorial is the conceptual prerequisite for all of that. It defines block encoding precisely, gives the three standard constructions, and shows why the abstraction works.

The problem block encoding solves

Quantum computers natively act with unitary operators. The state-vector evolution is ψ=Uψ|\psi'\rangle = U |\psi\rangle for some unitary UU. Many problems we want to solve, however, are most naturally phrased in terms of non-unitary matrices:

  • A Hamiltonian HH is Hermitian but not generally unitary. Hamiltonian simulation needs eiHte^{-iHt}, which is unitary, but the underlying computational object is HH itself.
  • A linear system Ax=bAx = b has a non-unitary AA. To prepare a state proportional to A1bA^{-1} |b\rangle, you need to act with A1A^{-1}, which is also non-unitary.
  • A polynomial p(A)p(A) of a matrix AA is generically non-unitary even if AA is unitary.
  • A least-squares solution A+bA^+ b uses the pseudoinverse A+A^+, also non-unitary.

The classical workaround on a quantum computer is to embed the non-unitary matrix into a larger unitary — using extra ancilla qubits to host the rest of the unitary structure — and arrange things so that the action of the matrix on the data appears as one block of the larger unitary. The data qubits experience the matrix; the ancilla qubits absorb the unitary completion. This embedding is block encoding.

The definition

Let AA be an N×NN \times N matrix with N=2nN = 2^n. A unitary UU acting on n+an + a qubits is an (α,a,ϵ)(\alpha, a, \epsilon)-block encoding of AA if

Aα(0aI)U(0aI)    ϵ.\bigl\| A - \alpha \cdot (\langle 0 |^{\otimes a} \otimes I) \, U \, (|0\rangle^{\otimes a} \otimes I) \bigr\| \;\le\; \epsilon.

Three parameters are doing the work in this definition.

  • α\alpha is the subnormalization factor. Block encoding represents A/αA / \alpha inside the unitary; you have to multiply by α\alpha to recover the actual matrix. The subnormalization must satisfy αA\alpha \ge \|A\| so that A/α1\|A / \alpha\| \le 1, which is necessary for A/αA / \alpha to fit inside a unitary at all.
  • aa is the ancilla count. The block encoding uses aa extra qubits beyond the nn data qubits.
  • ϵ\epsilon is the encoding error. A perfect block encoding has ϵ=0\epsilon = 0; in practice we tolerate some approximation error.

The structural picture: write UU as an (n+a)(n+a)-qubit unitary. Project onto 0a|0\rangle^{\otimes a} on the ancilla register and read off the upper-left N×NN \times N block of UU. That block is approximately A/αA / \alpha.

In matrix form:

U  =  (A/α),U \;=\; \begin{pmatrix} A / \alpha & * \\ * & * \end{pmatrix},

where the * entries fill out the unitary completion and are not required to have any particular structure. As long as the upper-left block matches A/αA / \alpha to error ϵ\epsilon, the unitary is a valid block encoding.

Using a block encoding

Suppose UU is an (α,a,0)(\alpha, a, 0)-block encoding of AA, and you want to apply AA (up to subnormalization) to a state ψ|\psi\rangle. The protocol:

  1. Prepare 0aψ|0\rangle^{\otimes a} \otimes |\psi\rangle.
  2. Apply UU.
  3. Measure the ancilla register. Postselect on the all-zero outcome.

If you observe 0a|0\rangle^{\otimes a} on the ancilla, the data register is now in the state Aψ/(αAψ)A|\psi\rangle / (\alpha \cdot \|A|\psi\rangle\|) — exactly A/αA/\alpha acting on ψ|\psi\rangle, normalized. The probability of the desired ancilla outcome is Aψ2/α2\|A|\psi\rangle\|^2 / \alpha^2, which is 1\le 1 since αA\alpha \ge \|A\|.

Two practical observations:

  • Subnormalization costs success probability. A larger α\alpha means a smaller success probability per attempt. Aggressively reducing α\alpha (making the block encoding “tight”) is one of the main optimization targets in block-encoding constructions.
  • Amplitude amplification fixes most of the success-probability problem. Tutorial 10 showed that Grover-style amplitude amplification turns a probability-pp event into a probability-Θ(1)\Theta(1) event in O(1/p)O(1/\sqrt{p}) rounds. So a block encoding with α=A\alpha = \|A\| and amplitude amplification gives an effective O(α)O(\alpha) overhead — modest, polynomial, and tunable.

Construction 1: Linear combination of unitaries (LCU)

The most general block-encoding construction is linear combination of unitaries (LCU), introduced by Childs and Wiegert in 2012 in the context of Hamiltonian simulation but quickly recognized as a universal block-encoding technique.

Suppose you can write AA as

A  =  j=0L1αjVj,A \;=\; \sum_{j=0}^{L-1} \alpha_j \, V_j,

where each VjV_j is unitary and αj\alpha_j is a real coefficient (positive WLOG). Define α=jαj\alpha = \sum_j \alpha_j.

The LCU block encoding uses two ancilla registers: a log2L\lceil \log_2 L \rceil-qubit “select” register and the data register. Two unitaries do the work:

  • PREP prepares the state prep=jαj/αj|\text{prep}\rangle = \sum_j \sqrt{\alpha_j / \alpha} \, |j\rangle on the select register.
  • SEL is the controlled unitary jjjVj\sum_j |j\rangle\langle j| \otimes V_j — applies VjV_j to the data when the select register is in state j|j\rangle.

The full circuit U=(PREPI)SEL(PREPI)U = (\text{PREP}^\dagger \otimes I) \cdot \text{SEL} \cdot (\text{PREP} \otimes I) is an (α,log2L,0)(\alpha, \lceil \log_2 L \rceil, 0)-block encoding of AA. The subnormalization is α=jαj\alpha = \sum_j \alpha_j; the ancilla cost is logarithmic in the number of terms.

This construction is universal in the sense that any matrix AA with a known decomposition into unitaries (e.g., a Pauli decomposition: A=jαjPjA = \sum_j \alpha_j P_j where each PjP_j is a Pauli string) admits an LCU block encoding immediately. The cost depends on the decomposition: short decompositions (few terms, small α\alpha) give cheap block encodings.

Construction 2: Qubitization

For Hermitian AA, qubitization (Low and Chuang, 2017) gives a particularly elegant block encoding. The idea: rather than building UU from a sum, build it from a single oracle that “walks” on the eigenvalues of AA.

Given an oracle that prepares the state ψλ|\psi_\lambda\rangle — a uniform superposition over eigenvectors of AA weighted by their eigenvalues — qubitization constructs a single unitary WW that acts as a 2D rotation on each eigenvalue-labeled subspace. Specifically, the eigenvalues of WW are e±iarccos(λj/α)e^{\pm i \arccos(\lambda_j / \alpha)} for each eigenvalue λj\lambda_j of AA.

Why this matters: WW is a block encoding of A/αA / \alpha in a stronger sense — it not only embeds AA but does so in a way that all eigenvalue-dependent subblocks are 2D rotations whose angles are simple functions of the eigenvalues. This structure is what enables QSVT (tutorial 30) to apply arbitrary polynomials to AA by composing rotations on these 2D subspaces.

Qubitization typically uses fewer ancillas than LCU — often just 1 — and achieves the optimal subnormalization α=A\alpha = \|A\| for many matrix families. The catch: it requires the matrix to come with a specific oracle structure, which not all problems naturally provide.

Construction 3: Sparse-matrix block encoding

Many physically interesting matrices are sparse: the number of nonzero entries per row is s=O(poly(n))s = O(\text{poly}(n)), much less than the N=2nN = 2^n matrix dimension. A sparse-matrix block encoding takes AA specified by two oracles:

  • OFO_F: given (j,k)(j, k) where jj is a row and kk is a column-index counter, return the column index of the kk-th nonzero entry in row jj.
  • OHO_H: given (j,k)(j, k), return the value Aj,kA_{j, k} (or its compressed-precision representation).

The block encoding combines OFO_F and OHO_H via an LCU-like trick to produce a unitary UU that block-encodes A/(sAmax)A / (s \cdot \|A\|_{\max}), where Amax\|A\|_{\max} is the largest matrix entry by absolute value. Subnormalization: α=sAmax\alpha = s \cdot \|A\|_{\max}. Ancilla cost: O(n+logs)O(n + \log s). Query cost to the oracles: O(1)O(1) per block-encoding application.

This construction is the workhorse for Hamiltonian simulation of physically motivated systems (lattice models, molecular Hamiltonians) where sparsity is natural and the oracles are easy to construct from the problem data. Most of the published Hamiltonian-simulation algorithms in 2026 use a sparse-matrix block encoding under the hood.

A small Python example

Here is the Pauli-decomposition LCU construction made concrete. We block-encode a 4×44 \times 4 Hermitian matrix AA as a sum of Pauli strings, then verify the upper-left block of the resulting unitary equals A/αA / \alpha.

import numpy as np
from itertools import product

# Pauli matrices.
I = np.eye(2, dtype=complex)
X = np.array([[0, 1], [1, 0]], dtype=complex)
Y = np.array([[0, -1j], [1j, 0]], dtype=complex)
Z = np.array([[1, 0], [0, -1]], dtype=complex)
PAULIS = {"I": I, "X": X, "Y": Y, "Z": Z}

def pauli_decompose(A: np.ndarray) -> dict:
    """Decompose a 2^n x 2^n Hermitian matrix into Pauli strings."""
    n = int(np.log2(A.shape[0]))
    coeffs = {}
    for labels in product("IXYZ", repeat=n):
        P = PAULIS[labels[0]]
        for label in labels[1:]:
            P = np.kron(P, PAULIS[label])
        coef = np.trace(P @ A) / (2**n)
        if abs(coef) > 1e-10:
            coeffs["".join(labels)] = coef.real
    return coeffs

def lcu_block_encoding(coeffs: dict) -> np.ndarray:
    """Build the LCU block encoding for a Pauli-decomposed Hermitian matrix.
    Returns the unitary U on (a + n) qubits, where 2^a >= number of Pauli terms.
    """
    n = len(next(iter(coeffs)))                       # qubits in data register
    L = len(coeffs)                                   # number of Pauli terms
    a = int(np.ceil(np.log2(max(L, 2))))              # ancilla qubits
    L_padded = 2**a

    alphas = np.array(list(coeffs.values()))
    alphas_padded = np.zeros(L_padded)
    alphas_padded[:L] = np.abs(alphas)
    signs = np.sign(alphas)

    alpha_sum = alphas_padded.sum()
    # PREP: prepares sqrt(alpha_j / alpha_sum) |j>.
    prep_amplitudes = np.sqrt(alphas_padded / alpha_sum)
    prep = np.zeros((L_padded, L_padded), dtype=complex)
    prep[:, 0] = prep_amplitudes
    # Fill remaining columns to make prep unitary (Gram-Schmidt).
    for col in range(1, L_padded):
        v = np.zeros(L_padded); v[col] = 1
        for prev in range(col):
            v = v - (prep[:, prev].conj() @ v) * prep[:, prev]
        nv = np.linalg.norm(v)
        if nv > 1e-10:
            prep[:, col] = v / nv

    # SEL: block diagonal of signed Pauli strings.
    pauli_strings = list(coeffs.keys())
    sel = np.zeros((L_padded * 2**n, L_padded * 2**n), dtype=complex)
    for j in range(L_padded):
        if j < L:
            P = PAULIS[pauli_strings[j][0]]
            for label in pauli_strings[j][1:]:
                P = np.kron(P, PAULIS[label])
            sel[j*2**n:(j+1)*2**n, j*2**n:(j+1)*2**n] = signs[j] * P
        else:
            sel[j*2**n:(j+1)*2**n, j*2**n:(j+1)*2**n] = np.eye(2**n)

    U = np.kron(prep.conj().T, np.eye(2**n)) @ sel @ np.kron(prep, np.eye(2**n))
    return U, alpha_sum, a

# Test: block-encode a random 4x4 Hermitian matrix.
np.random.seed(0)
A = np.random.randn(4, 4) + 1j * np.random.randn(4, 4)
A = (A + A.conj().T) / 2

coeffs = pauli_decompose(A)
print(f"Pauli decomposition has {len(coeffs)} terms; alpha = {sum(abs(c) for c in coeffs.values()):.3f}")

U, alpha, a = lcu_block_encoding(coeffs)
print(f"Block encoding uses {a} ancilla qubits.")
print(f"Subnormalization alpha = {alpha:.3f}")

# Verify: upper-left 4x4 block of U equals A / alpha.
upper_left = U[:4, :4]
recovered_A = alpha * upper_left
print(f"Reconstruction error ||A - alpha * upper_left||: {np.linalg.norm(A - recovered_A):.2e}")

Sample output:

Pauli decomposition has 16 terms; alpha = 4.387
Block encoding uses 4 ancilla qubits.
Subnormalization alpha = 4.387
Reconstruction error ||A - alpha * upper_left||: 1.57e-15

The construction is exact up to floating-point precision. Four data dimensions plus four ancilla qubits gives a 256-dimensional unitary whose upper-left 4×44 \times 4 block is A/αA / \alpha.

This is the LCU recipe in 50 lines of Python. Production block-encoding implementations in Qiskit or Cirq are more sophisticated (they avoid materializing the full unitary, they use efficient Pauli SELECT circuits, they exploit sparsity), but the structural picture is identical.

Cost summary

The key cost parameters for any block encoding:

ParameterWhat it measuresTypical scaling
α\alpha (subnormalization)Amplitude-amplification overheadO(A1)O(\|A\|_1) for LCU; O(A)O(\|A\|) for qubitization
aa (ancilla count)Qubit overheadO(logL)O(\log L) for LCU; O(1)O(1) for qubitization on 1-sparse oracles
query complexityCost per useO(1)O(1) to oracles; constructions vary
ϵ\epsilon (encoding error)AccuracyTunable; usually ϵ\epsilon is propagated through the rest of the algorithm

The art of designing a quantum algorithm using block encoding is choosing the construction whose cost parameters match the structure of the problem. Pauli-decomposition LCU is great when the Pauli decomposition is short. Qubitization is great when the matrix has natural eigenvalue access. Sparse-matrix block encoding is great when the matrix is sparse and has constructive oracles.

Why this took over the field

Three reasons block encoding became the algorithmic abstraction of choice after 2015:

  1. Compositional algebra. Block encodings of AA and BB can be combined into a block encoding of A+BA + B, ABA \cdot B, AA^*, or any polynomial p(A)p(A) using QSVT. This means once you have block encoded the relevant matrices, you can reason about quantum algorithms purely in terms of matrix arithmetic, with no need to redo the circuit-level work.
  2. Optimal Hamiltonian simulation. Block-encoding-based Hamiltonian simulation (using qubitization or QSVT) achieves the optimal O(αt+log(1/ϵ))O(\alpha t + \log(1/\epsilon)) scaling — strictly better than any product-formula approach on most physical Hamiltonians. This is the reason every modern Hamiltonian-simulation algorithm uses a block encoding under the hood (tutorial 31).
  3. Algorithmic uniformity. A single primitive (block encoding) plus a single transformation (QSVT) generates Hamiltonian simulation, linear-system solvers, eigenvalue estimation, ground-state preparation, and many more algorithms. The pre-2015 quantum-algorithm zoo was a collection of disparate tricks; the post-2015 zoo is largely a list of block-encoding recipes plus polynomial choices in QSVT.

This is why a 2026 quantum-algorithms class spends weeks on block encoding before getting to specific algorithms. The primitive is the algorithm.

Common misconceptions

“Block encoding is just LCU.” LCU is one construction. Qubitization, sparse-matrix oracles, and direct unitary completions are also standard. The umbrella concept is broader; LCU is the easiest entry point.

“Subnormalization is a fundamental quantum-physics overhead.” No. The subnormalization is a consequence of how we choose to embed a non-unitary AA into a unitary UU; tighter encodings have smaller α\alpha, better encodings (qubitization) achieve α=A\alpha = \|A\| which is the minimum allowed. Subnormalization is a design choice, not a physical law.

“Block encoding is a complexity-theory abstraction with no practical impact.” False. Modern Hamiltonian-simulation libraries (in Qiskit, Cirq, PennyLane) use block-encoding-based methods as their default. Quantum machine-learning algorithms post-2018 use block encodings for matrix arithmetic. Most published quantum-algorithm research papers since 2018 cite block encoding centrally. It is the working primitive of the current field.

“Block encoding requires fault tolerance to be useful.” No. Many small-scale block-encoding demonstrations have been published on NISQ hardware. The full power of block encoding (long algorithms, deep QSVT polynomials) does need fault tolerance, but the primitive itself is hardware-agnostic.

Decision rule

When you read a quantum-algorithm paper or proposal, ask:

  1. Is the algorithm framed in terms of block encoding? If yes, the analysis is probably modern and clean. If no, it is either a pre-2015 algorithm being re-presented, or one of the small set of algorithms (Grover, QFT) that does not naturally fit the framework.
  2. What is the subnormalization α\alpha? This is the dominant cost factor. A tight α\alpha is the difference between practical and impractical.
  3. Which construction (LCU, qubitization, sparse oracle)? This determines the ancilla overhead and the access model. The construction has to match the available hardware capabilities.
  4. What polynomial is being applied via QSVT? If the algorithm is post-2017, it almost certainly is a QSVT polynomial. The polynomial choice is the algorithm’s “secret sauce.” Ask what it is and why.

Understanding these four questions is the difference between reading modern quantum-algorithm papers fluently and reading them as a sequence of tricks.

Exercises

1. Subnormalization minimum

Show that any block encoding of a matrix AA must have αA\alpha \ge \|A\| (operator norm). Why is this a fundamental lower bound?

Show answer

The block encoding embeds A/αA / \alpha as an upper-left block of a unitary UU. Any block of a unitary has operator norm at most 1: A/αU=1\|A / \alpha\| \le \|U\| = 1. Therefore A/α1\|A\| / \alpha \le 1, i.e., αA\alpha \ge \|A\|. The minimum is achieved by qubitization for many matrix families. LCU achieves α=A1=jαj\alpha = \|A\|_1 = \sum_j |\alpha_j| which is in general strictly larger than the operator norm — the LCU subnormalization is tight on some matrices but loose on others.

2. Pauli decomposition cost

Decompose the 2-qubit Hamiltonian H=XX+YY+ZZH = X \otimes X + Y \otimes Y + Z \otimes Z into Pauli strings. What is the LCU subnormalization α\alpha? What is the operator norm H\|H\|? Are they equal?

Show answer

The Pauli decomposition is given: 3 terms, each with coefficient 1. So αLCU=jαj=3\alpha_\text{LCU} = \sum_j |\alpha_j| = 3. The operator norm: H=XX+YY+ZZH = X \otimes X + Y \otimes Y + Z \otimes Z is the well-known SWAP-related matrix; its eigenvalues are ±3,±1\pm 3, \pm 1 (it’s 2SWAPI2 \cdot \text{SWAP} - I), so H=3\|H\| = 3. In this case LCU is tight — αLCU=H\alpha_\text{LCU} = \|H\|. The terms align constructively. For more complicated Hamiltonians, LCU is generically loose.

3. When LCU is loose

Construct a Hermitian matrix AA where the LCU subnormalization αLCU\alpha_\text{LCU} exceeds A\|A\| by a factor of 2. Hint: think about cancellations in a Pauli decomposition.

Show answer

Take A=XX=0A = X - X = 0. Trivially zero with LCU coefficients 1+(1)1 + (-1), so αLCU=2\alpha_\text{LCU} = 2 but A=0\|A\| = 0. The ratio is infinite. For a non-trivial example: A=XZ=iYA = X \cdot Z = -iY, which has A=1\|A\| = 1, but if you naively decompose as A=(X+Z)2/2(X2+Z2)/2=A = (X + Z)^2 / 2 - (X^2 + Z^2)/2 = \ldots with multiple cancelling terms, you can get αLCU\alpha_\text{LCU} as large as you like. The general lesson: LCU is loose precisely when the Pauli decomposition has cancellation between same-sign and opposite-sign terms. Qubitization or a smarter LCU regrouping fixes this.

4. Ancilla budget

You have a Hamiltonian on 50 qubits with 1,000 Pauli terms. Estimate the ancilla qubits needed for an LCU block encoding. What about a sparse-matrix block encoding if the Hamiltonian is 4-sparse?

Show answer

LCU: log21000=10\lceil \log_2 1000 \rceil = 10 ancillas, plus the 50 data qubits = 60 qubits total. Sparse-matrix: alog2(50)+log2(4)6+2=8a \approx \log_2(50) + \log_2(4) \approx 6 + 2 = 8 ancilla qubits. Sparse-matrix is somewhat cheaper in ancillas, but its subnormalization is α=sAmax=4Amax\alpha = s \cdot \|A\|_\text{max} = 4 \cdot \|A\|_\text{max}, which can be larger than the LCU subnormalization for matrices where the Pauli coefficients are well-distributed. The “right” choice depends on the structure of the Hamiltonian — most modern Hamiltonian-simulation papers benchmark both.

Where this goes next

Tutorial 30 takes block encoding and applies it via quantum singular value transformation (QSVT) — the technique that turns “I have a block encoding of AA” into “I can compute p(A)p(A) for any low-degree polynomial pp in essentially the same cost as computing one block-encoding round.” This is the abstraction that ties Hamiltonian simulation, linear-system solving, and eigenvalue estimation into a single framework.


Weekly dispatch

Quantum, for people who already code.

One serious tutorial per week, plus the industry moves that actually matter. No hype, no hand-waving.

Free. Unsubscribe anytime. We will never sell your email.