algorithms advanced · 19 min read · By LIPAI WANG · April 29, 2026

Block Encoding: How Modern Quantum Algorithms Get Arbitrary Matrices Into Quantum Circuits

Block encoding is the most important quantum-algorithm primitive of the past decade. It is the technique that lets you embed an arbitrary matrix A into a unitary U so a quantum computer can act with A on a state — making Hamiltonian simulation, linear-system solvers, and the entire QSVT framework possible. This tutorial defines block encoding precisely, gives the key constructions (LCU, qubitization, sparse-matrix), and shows why this is the abstraction that ate quantum algorithms after 2015.

Prerequisites: Tutorial 11: QFT and Phase Estimation

If you learned quantum algorithms from a 2015 textbook, the canonical examples were Shor’s, Grover’s, and quantum phase estimation — algorithms designed around specific problem structures, each with its own ad-hoc trick. If you learn quantum algorithms in 2026, you instead start with a single primitive — block encoding — and almost every modern algorithm becomes a special case of how you process the block-encoded matrix afterward.

This shift is the single biggest change in the field’s algorithmic landscape since Shor. It is the reason Hamiltonian simulation got polynomial-time-optimal in the dependence on simulation accuracy, the reason linear-system solvers (HHL and successors) became practical to discuss as production tools rather than complexity-theory toys, and the reason quantum singular value transformation (QSVT, tutorial 30) is now the standard framing for quantum algorithm design.

This tutorial is the conceptual prerequisite for all of that. It defines block encoding precisely, gives the three standard constructions, and shows why the abstraction works.

The problem block encoding solves

Quantum computers natively act with unitary operators. The state-vector evolution is $|\psi'\rangle = U |\psi\rangle$ for some unitary $U$ . Many problems we want to solve, however, are most naturally phrased in terms of non-unitary matrices:

A Hamiltonian $H$ is Hermitian but not generally unitary. Hamiltonian simulation needs $e^{-iHt}$ , which is unitary, but the underlying computational object is $H$ itself.
A linear system $Ax = b$ has a non-unitary $A$ . To prepare a state proportional to $A^{-1} |b\rangle$ , you need to act with $A^{-1}$ , which is also non-unitary.
A polynomial $p(A)$ of a matrix $A$ is generically non-unitary even if $A$ is unitary.
A least-squares solution $A^+ b$ uses the pseudoinverse $A^+$ , also non-unitary.

The classical workaround on a quantum computer is to embed the non-unitary matrix into a larger unitary — using extra ancilla qubits to host the rest of the unitary structure — and arrange things so that the action of the matrix on the data appears as one block of the larger unitary. The data qubits experience the matrix; the ancilla qubits absorb the unitary completion. This embedding is block encoding.

The definition

Let $A$ be an $N \times N$ matrix with $N = 2^n$ . A unitary $U$ acting on $n + a$ qubits is an $(\alpha, a, \epsilon)$ -block encoding of $A$ if

\bigl\| A - \alpha \cdot (\langle 0 |^{\otimes a} \otimes I) \, U \, (|0\rangle^{\otimes a} \otimes I) \bigr\| \;\le\; \epsilon.

Three parameters are doing the work in this definition.

$\alpha$ is the subnormalization factor. Block encoding represents $A / \alpha$ inside the unitary; you have to multiply by $\alpha$ to recover the actual matrix. The subnormalization must satisfy $\alpha \ge \|A\|$ so that $\|A / \alpha\| \le 1$ , which is necessary for $A / \alpha$ to fit inside a unitary at all.
$a$ is the ancilla count. The block encoding uses $a$ extra qubits beyond the $n$ data qubits.
$\epsilon$ is the encoding error. A perfect block encoding has $\epsilon = 0$ ; in practice we tolerate some approximation error.

The structural picture: write $U$ as an $(n+a)$ -qubit unitary. Project onto $|0\rangle^{\otimes a}$ on the ancilla register and read off the upper-left $N \times N$ block of $U$ . That block is approximately $A / \alpha$ .

In matrix form:

U \;=\; \begin{pmatrix} A / \alpha & * \\ * & * \end{pmatrix},

where the $*$ entries fill out the unitary completion and are not required to have any particular structure. As long as the upper-left block matches $A / \alpha$ to error $\epsilon$ , the unitary is a valid block encoding.

Using a block encoding

Suppose $U$ is an $(\alpha, a, 0)$ -block encoding of $A$ , and you want to apply $A$ (up to subnormalization) to a state $|\psi\rangle$ . The protocol:

Prepare $|0\rangle^{\otimes a} \otimes |\psi\rangle$ .
Apply $U$ .
Measure the ancilla register. Postselect on the all-zero outcome.

If you observe $|0\rangle^{\otimes a}$ on the ancilla, the data register is now in the state $A|\psi\rangle / (\alpha \cdot \|A|\psi\rangle\|)$ — exactly $A/\alpha$ acting on $|\psi\rangle$ , normalized. The probability of the desired ancilla outcome is $\|A|\psi\rangle\|^2 / \alpha^2$ , which is $\le 1$ since $\alpha \ge \|A\|$ .

Two practical observations:

Subnormalization costs success probability. A larger $\alpha$ means a smaller success probability per attempt. Aggressively reducing $\alpha$ (making the block encoding “tight”) is one of the main optimization targets in block-encoding constructions.
Amplitude amplification fixes most of the success-probability problem. Tutorial 10 showed that Grover-style amplitude amplification turns a probability- $p$ event into a probability- $\Theta(1)$ event in $O(1/\sqrt{p})$ rounds. So a block encoding with $\alpha = \|A\|$ and amplitude amplification gives an effective $O(\alpha)$ overhead — modest, polynomial, and tunable.

Construction 1: Linear combination of unitaries (LCU)

The most general block-encoding construction is linear combination of unitaries (LCU), introduced by Childs and Wiegert in 2012 in the context of Hamiltonian simulation but quickly recognized as a universal block-encoding technique.

Suppose you can write $A$ as

A \;=\; \sum_{j=0}^{L-1} \alpha_j \, V_j,

where each $V_j$ is unitary and $\alpha_j$ is a real coefficient (positive WLOG). Define $\alpha = \sum_j \alpha_j$ .

The LCU block encoding uses two ancilla registers: a $\lceil \log_2 L \rceil$ -qubit “select” register and the data register. Two unitaries do the work:

PREP prepares the state $|\text{prep}\rangle = \sum_j \sqrt{\alpha_j / \alpha} \, |j\rangle$ on the select register.
SEL is the controlled unitary $\sum_j |j\rangle\langle j| \otimes V_j$ — applies $V_j$ to the data when the select register is in state $|j\rangle$ .

The full circuit $U = (\text{PREP}^\dagger \otimes I) \cdot \text{SEL} \cdot (\text{PREP} \otimes I)$ is an $(\alpha, \lceil \log_2 L \rceil, 0)$ -block encoding of $A$ . The subnormalization is $\alpha = \sum_j \alpha_j$ ; the ancilla cost is logarithmic in the number of terms.

This construction is universal in the sense that any matrix $A$ with a known decomposition into unitaries (e.g., a Pauli decomposition: $A = \sum_j \alpha_j P_j$ where each $P_j$ is a Pauli string) admits an LCU block encoding immediately. The cost depends on the decomposition: short decompositions (few terms, small $\alpha$ ) give cheap block encodings.

Construction 2: Qubitization

For Hermitian $A$ , qubitization (Low and Chuang, 2017) gives a particularly elegant block encoding. The idea: rather than building $U$ from a sum, build it from a single oracle that “walks” on the eigenvalues of $A$ .

Given an oracle that prepares the state $|\psi_\lambda\rangle$ — a uniform superposition over eigenvectors of $A$ weighted by their eigenvalues — qubitization constructs a single unitary $W$ that acts as a 2D rotation on each eigenvalue-labeled subspace. Specifically, the eigenvalues of $W$ are $e^{\pm i \arccos(\lambda_j / \alpha)}$ for each eigenvalue $\lambda_j$ of $A$ .

Why this matters: $W$ is a block encoding of $A / \alpha$ in a stronger sense — it not only embeds $A$ but does so in a way that all eigenvalue-dependent subblocks are 2D rotations whose angles are simple functions of the eigenvalues. This structure is what enables QSVT (tutorial 30) to apply arbitrary polynomials to $A$ by composing rotations on these 2D subspaces.

Qubitization typically uses fewer ancillas than LCU — often just 1 — and achieves the optimal subnormalization $\alpha = \|A\|$ for many matrix families. The catch: it requires the matrix to come with a specific oracle structure, which not all problems naturally provide.

Construction 3: Sparse-matrix block encoding

Many physically interesting matrices are sparse: the number of nonzero entries per row is $s = O(\text{poly}(n))$ , much less than the $N = 2^n$ matrix dimension. A sparse-matrix block encoding takes $A$ specified by two oracles:

$O_F$ : given $(j, k)$ where $j$ is a row and $k$ is a column-index counter, return the column index of the $k$ -th nonzero entry in row $j$ .
$O_H$ : given $(j, k)$ , return the value $A_{j, k}$ (or its compressed-precision representation).

The block encoding combines $O_F$ and $O_H$ via an LCU-like trick to produce a unitary $U$ that block-encodes $A / (s \cdot \|A\|_{\max})$ , where $\|A\|_{\max}$ is the largest matrix entry by absolute value. Subnormalization: $\alpha = s \cdot \|A\|_{\max}$ . Ancilla cost: $O(n + \log s)$ . Query cost to the oracles: $O(1)$ per block-encoding application.

This construction is the workhorse for Hamiltonian simulation of physically motivated systems (lattice models, molecular Hamiltonians) where sparsity is natural and the oracles are easy to construct from the problem data. Most of the published Hamiltonian-simulation algorithms in 2026 use a sparse-matrix block encoding under the hood.

A small Python example

Here is the Pauli-decomposition LCU construction made concrete. We block-encode a $4 \times 4$ Hermitian matrix $A$ as a sum of Pauli strings, then verify the upper-left block of the resulting unitary equals $A / \alpha$ .

import numpy as np
from itertools import product

# Pauli matrices.
I = np.eye(2, dtype=complex)
X = np.array([[0, 1], [1, 0]], dtype=complex)
Y = np.array([[0, -1j], [1j, 0]], dtype=complex)
Z = np.array([[1, 0], [0, -1]], dtype=complex)
PAULIS = {"I": I, "X": X, "Y": Y, "Z": Z}

def pauli_decompose(A: np.ndarray) -> dict:
    """Decompose a 2^n x 2^n Hermitian matrix into Pauli strings."""
    n = int(np.log2(A.shape[0]))
    coeffs = {}
    for labels in product("IXYZ", repeat=n):
        P = PAULIS[labels[0]]
        for label in labels[1:]:
            P = np.kron(P, PAULIS[label])
        coef = np.trace(P @ A) / (2**n)
        if abs(coef) > 1e-10:
            coeffs["".join(labels)] = coef.real
    return coeffs

def lcu_block_encoding(coeffs: dict) -> np.ndarray:
    """Build the LCU block encoding for a Pauli-decomposed Hermitian matrix.
    Returns the unitary U on (a + n) qubits, where 2^a >= number of Pauli terms.
    """
    n = len(next(iter(coeffs)))                       # qubits in data register
    L = len(coeffs)                                   # number of Pauli terms
    a = int(np.ceil(np.log2(max(L, 2))))              # ancilla qubits
    L_padded = 2**a

    alphas = np.array(list(coeffs.values()))
    alphas_padded = np.zeros(L_padded)
    alphas_padded[:L] = np.abs(alphas)
    signs = np.sign(alphas)

    alpha_sum = alphas_padded.sum()
    # PREP: prepares sqrt(alpha_j / alpha_sum) |j>.
    prep_amplitudes = np.sqrt(alphas_padded / alpha_sum)
    prep = np.zeros((L_padded, L_padded), dtype=complex)
    prep[:, 0] = prep_amplitudes
    # Fill remaining columns to make prep unitary (Gram-Schmidt).
    for col in range(1, L_padded):
        v = np.zeros(L_padded); v[col] = 1
        for prev in range(col):
            v = v - (prep[:, prev].conj() @ v) * prep[:, prev]
        nv = np.linalg.norm(v)
        if nv > 1e-10:
            prep[:, col] = v / nv

    # SEL: block diagonal of signed Pauli strings.
    pauli_strings = list(coeffs.keys())
    sel = np.zeros((L_padded * 2**n, L_padded * 2**n), dtype=complex)
    for j in range(L_padded):
        if j < L:
            P = PAULIS[pauli_strings[j][0]]
            for label in pauli_strings[j][1:]:
                P = np.kron(P, PAULIS[label])
            sel[j*2**n:(j+1)*2**n, j*2**n:(j+1)*2**n] = signs[j] * P
        else:
            sel[j*2**n:(j+1)*2**n, j*2**n:(j+1)*2**n] = np.eye(2**n)

    U = np.kron(prep.conj().T, np.eye(2**n)) @ sel @ np.kron(prep, np.eye(2**n))
    return U, alpha_sum, a

# Test: block-encode a random 4x4 Hermitian matrix.
np.random.seed(0)
A = np.random.randn(4, 4) + 1j * np.random.randn(4, 4)
A = (A + A.conj().T) / 2

coeffs = pauli_decompose(A)
print(f"Pauli decomposition has {len(coeffs)} terms; alpha = {sum(abs(c) for c in coeffs.values()):.3f}")

U, alpha, a = lcu_block_encoding(coeffs)
print(f"Block encoding uses {a} ancilla qubits.")
print(f"Subnormalization alpha = {alpha:.3f}")

# Verify: upper-left 4x4 block of U equals A / alpha.
upper_left = U[:4, :4]
recovered_A = alpha * upper_left
print(f"Reconstruction error ||A - alpha * upper_left||: {np.linalg.norm(A - recovered_A):.2e}")

Sample output:

Pauli decomposition has 16 terms; alpha = 4.387
Block encoding uses 4 ancilla qubits.
Subnormalization alpha = 4.387
Reconstruction error ||A - alpha * upper_left||: 1.57e-15

The construction is exact up to floating-point precision. Four data dimensions plus four ancilla qubits gives a 256-dimensional unitary whose upper-left $4 \times 4$ block is $A / \alpha$ .

This is the LCU recipe in 50 lines of Python. Production block-encoding implementations in Qiskit or Cirq are more sophisticated (they avoid materializing the full unitary, they use efficient Pauli SELECT circuits, they exploit sparsity), but the structural picture is identical.

Cost summary

The key cost parameters for any block encoding:

Parameter	What it measures	Typical scaling
$\alpha$ (subnormalization)	Amplitude-amplification overhead	$O(\\|A\\|_1)$ for LCU; $O(\\|A\\|)$ for qubitization
$a$ (ancilla count)	Qubit overhead	$O(\log L)$ for LCU; $O(1)$ for qubitization on 1-sparse oracles
query complexity	Cost per use	$O(1)$ to oracles; constructions vary
$\epsilon$ (encoding error)	Accuracy	Tunable; usually $\epsilon$ is propagated through the rest of the algorithm

The art of designing a quantum algorithm using block encoding is choosing the construction whose cost parameters match the structure of the problem. Pauli-decomposition LCU is great when the Pauli decomposition is short. Qubitization is great when the matrix has natural eigenvalue access. Sparse-matrix block encoding is great when the matrix is sparse and has constructive oracles.

Why this took over the field

Three reasons block encoding became the algorithmic abstraction of choice after 2015:

Compositional algebra. Block encodings of $A$ and $B$ can be combined into a block encoding of $A + B$ , $A \cdot B$ , $A^*$ , or any polynomial $p(A)$ using QSVT. This means once you have block encoded the relevant matrices, you can reason about quantum algorithms purely in terms of matrix arithmetic, with no need to redo the circuit-level work.
Optimal Hamiltonian simulation. Block-encoding-based Hamiltonian simulation (using qubitization or QSVT) achieves the optimal $O(\alpha t + \log(1/\epsilon))$ scaling — strictly better than any product-formula approach on most physical Hamiltonians. This is the reason every modern Hamiltonian-simulation algorithm uses a block encoding under the hood (tutorial 31).
Algorithmic uniformity. A single primitive (block encoding) plus a single transformation (QSVT) generates Hamiltonian simulation, linear-system solvers, eigenvalue estimation, ground-state preparation, and many more algorithms. The pre-2015 quantum-algorithm zoo was a collection of disparate tricks; the post-2015 zoo is largely a list of block-encoding recipes plus polynomial choices in QSVT.

This is why a 2026 quantum-algorithms class spends weeks on block encoding before getting to specific algorithms. The primitive is the algorithm.

Common misconceptions

“Block encoding is just LCU.” LCU is one construction. Qubitization, sparse-matrix oracles, and direct unitary completions are also standard. The umbrella concept is broader; LCU is the easiest entry point.

“Subnormalization is a fundamental quantum-physics overhead.” No. The subnormalization is a consequence of how we choose to embed a non-unitary $A$ into a unitary $U$ ; tighter encodings have smaller $\alpha$ , better encodings (qubitization) achieve $\alpha = \|A\|$ which is the minimum allowed. Subnormalization is a design choice, not a physical law.

“Block encoding is a complexity-theory abstraction with no practical impact.” False. Modern Hamiltonian-simulation libraries (in Qiskit, Cirq, PennyLane) use block-encoding-based methods as their default. Quantum machine-learning algorithms post-2018 use block encodings for matrix arithmetic. Most published quantum-algorithm research papers since 2018 cite block encoding centrally. It is the working primitive of the current field.

“Block encoding requires fault tolerance to be useful.” No. Many small-scale block-encoding demonstrations have been published on NISQ hardware. The full power of block encoding (long algorithms, deep QSVT polynomials) does need fault tolerance, but the primitive itself is hardware-agnostic.

Decision rule

When you read a quantum-algorithm paper or proposal, ask:

Is the algorithm framed in terms of block encoding? If yes, the analysis is probably modern and clean. If no, it is either a pre-2015 algorithm being re-presented, or one of the small set of algorithms (Grover, QFT) that does not naturally fit the framework.
What is the subnormalization $\alpha$ ? This is the dominant cost factor. A tight $\alpha$ is the difference between practical and impractical.
Which construction (LCU, qubitization, sparse oracle)? This determines the ancilla overhead and the access model. The construction has to match the available hardware capabilities.
What polynomial is being applied via QSVT? If the algorithm is post-2017, it almost certainly is a QSVT polynomial. The polynomial choice is the algorithm’s “secret sauce.” Ask what it is and why.

Understanding these four questions is the difference between reading modern quantum-algorithm papers fluently and reading them as a sequence of tricks.

Exercises

1. Subnormalization minimum

Show that any block encoding of a matrix $A$ must have $\alpha \ge \|A\|$ (operator norm). Why is this a fundamental lower bound?

Show answer

2. Pauli decomposition cost

Decompose the 2-qubit Hamiltonian $H = X \otimes X + Y \otimes Y + Z \otimes Z$ into Pauli strings. What is the LCU subnormalization $\alpha$ ? What is the operator norm $\|H\|$ ? Are they equal?

Show answer

The Pauli decomposition is given: 3 terms, each with coefficient 1. So $\alpha_\text{LCU} = \sum_j |\alpha_j| = 3$ . The operator norm: $H = X \otimes X + Y \otimes Y + Z \otimes Z$ is the well-known SWAP-related matrix; its eigenvalues are $\pm 3, \pm 1$ (it’s $2 \cdot \text{SWAP} - I$ ), so $\|H\| = 3$ . In this case LCU is tight — $\alpha_\text{LCU} = \|H\|$ . The terms align constructively. For more complicated Hamiltonians, LCU is generically loose.

3. When LCU is loose

Construct a Hermitian matrix $A$ where the LCU subnormalization $\alpha_\text{LCU}$ exceeds $\|A\|$ by a factor of 2. Hint: think about cancellations in a Pauli decomposition.

Show answer

Take $A = X - X = 0$ . Trivially zero with LCU coefficients $1 + (-1)$ , so $\alpha_\text{LCU} = 2$ but $\|A\| = 0$ . The ratio is infinite. For a non-trivial example: $A = X \cdot Z = -iY$ , which has $\|A\| = 1$ , but if you naively decompose as $A = (X + Z)^2 / 2 - (X^2 + Z^2)/2 = \ldots$ with multiple cancelling terms, you can get $\alpha_\text{LCU}$ as large as you like. The general lesson: LCU is loose precisely when the Pauli decomposition has cancellation between same-sign and opposite-sign terms. Qubitization or a smarter LCU regrouping fixes this.

4. Ancilla budget

You have a Hamiltonian on 50 qubits with 1,000 Pauli terms. Estimate the ancilla qubits needed for an LCU block encoding. What about a sparse-matrix block encoding if the Hamiltonian is 4-sparse?

Show answer

LCU: $\lceil \log_2 1000 \rceil = 10$ ancillas, plus the 50 data qubits = 60 qubits total. Sparse-matrix: $a \approx \log_2(50) + \log_2(4) \approx 6 + 2 = 8$ ancilla qubits. Sparse-matrix is somewhat cheaper in ancillas, but its subnormalization is $\alpha = s \cdot \|A\|_\text{max} = 4 \cdot \|A\|_\text{max}$ , which can be larger than the LCU subnormalization for matrices where the Pauli coefficients are well-distributed. The “right” choice depends on the structure of the Hamiltonian — most modern Hamiltonian-simulation papers benchmark both.

Where this goes next

Tutorial 30 takes block encoding and applies it via quantum singular value transformation (QSVT) — the technique that turns “I have a block encoding of $A$ ” into “I can compute $p(A)$ for any low-degree polynomial $p$ in essentially the same cost as computing one block-encoding round.” This is the abstraction that ties Hamiltonian simulation, linear-system solving, and eigenvalue estimation into a single framework.

The problem block encoding solves

The definition

Using a block encoding

Construction 1: Linear combination of unitaries (LCU)

Construction 2: Qubitization

Construction 3: Sparse-matrix block encoding

A small Python example

Cost summary

Why this took over the field

Common misconceptions

Decision rule

Exercises

1. Subnormalization minimum

2. Pauli decomposition cost

3. When LCU is loose

4. Ancilla budget

Where this goes next

Quantum, for people who already code.