About two years ago, I switched teams at Google to focus on fully homomorphic encryption (abbreviated FHE, or sometimes HE). Since then I’ve got to work on a lot of interesting projects, learning along the way about post-quantum cryptography, compiler design, and the ins and outs of fully homomorphic encryption.

If you’ve heard about FHE and you’re a software person, you’ve probably heard two things: it lets you run programs directly on encrypted data without ever decrypting it; and it’s still too slow to be useful for anything. But beyond that, there aren’t a lot of resources that provide a survey of the field that’s more accessible than a 40-page research paper (if you want that, here’s one from 2022). This article will provide that overview—technical, but still high level enough that you can get a better sense for how this stuff works and where the field is headed. There is also a nice brief history of FHE research over at FHE.org.

**Disclaimers:** Because I am two years young in this field, I feel obliged to
warn you that I’m not an expert. I haven’t published any original research in
FHE, and there are decently large gaps in my knowledge. However, I’m coming to
this field with an eye toward integrating as many useful techniques as possible
into my current main project: the HEIR compiler toolchain.
As such, I don’t have a bias toward any particular approach to FHE. I’m also
not going to include a timeline of publications or a citation graph of
incremental improvements. Instead, I’m going to focus on the subset of
techniques that, from my perspective, are at the forefront of the field.

Table of contents:

## The highest-level view

Here’s the explanation of homomorphic encryption I give to my fellow software engineers over lunch.

Homomorphic encryption lets you encrypt data in such a way that you can run programs on it without ever decrypting it. This means that the computer running the program has no access to the underlying data while running the program—neither via intermediate computed values, nor even the result. In particular, if a nefarious human had access to the machine’s raw memory, they still could not learn any information about the underlying data (without breaking the cryptography). A user sends the program an encrypted input, and when the program is done, the encrypted result is sent back to the user to decrypt.

Running a program on encrypted data sounds magical. It works by choosing an encryption scheme that is “compatible” with addition and multiplication in the following sense:

- Adding ciphertexts gives you an encryption of the sum of the underlying plaintexts.
- Multiplying two ciphertexts give you an encryption of the product of the underlying plaintexts.

Given this power, you can encrypt your data bit by bit, express your program as
a boolean circuit—an XOR gate is addition and an AND gate is multiplication—and
simulate the circuit. Since XOR and AND form a universal basis for boolean
logic, you can always decompose a circuit this way.^{1}

To hammer my point home about how FHE can—in theory—handle universal computation, both Conway’s Game of Life and a complete processor architecture have been implemented in FHE.

There are four main caveats to this. First, a boolean circuit alone is not enough to do everything computers can do. Circuits always terminate, so without doing something more, you can’t write programs that loop infinitely. Transcendental functions like sine and exponentiation have their own challenges, as emulating them with bit arithmetic is often infeasible for the reason given next.

Second, simulating circuits is slow, especially with the overhead of doing
cryptography for every gate. Generously, a modern boolean-gate FHE
implementation takes on the order of 1 millisecond to evaluate a single gate
without any special hardware acceleration. Compare that to an older CPU running
at 2 GHz and requiring 4 clock cycles to compute a single (32-bit!)
instruction, that’s on the order of 1 nanosecond per operation. If you did a
naive circuit decomposition without anything else, running FHE on CPU is at
least a million times slower than the corresponding unencrypted program.^{2}

Third, the cryptographic guarantee implies limitations on what sorts of optimizations you could use (even if you didn’t use the boolean circuit approach), because realizing that optimization would necessarily break the cryptography. For example, it is not be possible to write an FHE program that uses fewer instructions than the program’s worst-case input. Doing so would reveal information that the input is not the worst-case, which contradicts the security of the cryptography. Instead, an FHE program must eagerly compute all branches of if-statements, and then use a select statement to decide which result should be used. The same reasoning applies to loops: all loops must run through their maximum iteration count, and effectively this means that all loops must have statically-known upper bounds.

Fourth, there is a bandwidth concern. FHE encryption schemes generally increase the size of the data being encrypted, and the user must send the server a special set of encryption keys to enable the computation, which are relatively large as well. The special keys need only be generated and sent once and can be used for all future computations, but they can easily be gigabytes in size. In one example FHE scheme with lightweight keys, a ciphertext encrypting a single integer is on the order of 25 KB, and the special keys are about 0.5 GB. In others, 16,000 or more integers are packed into a single ciphertext of similar size, but the keys can be 10s of GiBs.

I will describe the various ways that FHE schemes work with or around these limitations in later sections.

## Commonalities of FHE schemes

This section will give a mildly-technical overview of how the encryption schemes underlying FHE work. The next section will provide more specific details of specific FHE schemes and their differing capabilities.

### LWE and RLWE

All modern FHE schemes encrypt their data using effectively the same cryptographic foundation: a noisy dot product. The simplest version of this actually used in an FHE scheme is called the Learning With Errors problem, or LWE.

LWE encryption supports encrypting small-bit-width integers by placing those bits in the most significant bits of a machine word—for simplicity, say it’s a 4-bit integer in the top-most bits of a 32-bit integer with all the other bits initialized to zero, and call that whole encoded plaintext $m$. The secret key $s$ is a random binary vector of some fixed length $n$ chosen to achieve a certain security—say $n=800$. Then to encrypt, you sample a random length-$n$ vector of 32-bit integers $a$, take a dot product with the secret key $s$, add the message $m$, and then add some random noise from a carefully-chosen distribution that is intimately tied to the security of the scheme. The encrypted value is both the random samples chosen, as well as the noise-masked result, i.e., $(a, a \cdot s + m + \textup{noise})$. More precisely, in code

```
def encrypt(m, s, a, noise):
# the 28 below would realistically come from a config
clean_product = np.dot(a, s) + (m << 28)
obfuscated_product = clean_product + noise
return np.append(a, obfuscated_product)
```

The noise distribution is the most mysterious part, and I won’t go into much
detail here, but in most cases it’s taken to be a symmetric integer-valued
Gaussian distribution, centered at zero with a tunable variance. Larger variance
implies higher security, but as I’ll discuss more later, it reduces the number
of homomorphic operations one can apply before the underlying message is corrupted,
which in turn hurts performance because expensive cleanup operations are needed
to recover from high noise ciphertexts.^{3}

LWE decryption then reverses this process: re-compute the dot product and subtract it from the output of the encryption. But at the end you need to apply a rounding step to remove the noise that was added to the ciphertext during encryption. Because the message is in the highest-order bits of the message (say, bits 28-31), and because the noise added was not very large, rounding to the nearest multiple of $2^{28}$ removes it.

```
def decrypt(ciphertext, secret_key):
obfuscated_plaintext = ciphertext[-1] - np.dot(ciphertext[:-1], secret_key)
return round_to_power_of_2(obfuscated_plaintext , 28)
```

The cryptographic guarantee is that, given a large sample of encryptions of a message with the same secret key, one cannot learn the secret key or the underlying message with any nontrivial probability above random guessing. This is effectively a hard machine learning problem: try to learn a fixed function $f(a) = a \cdot s + m$ from a family of possible functions parameterized by $s$ and $m$. The learner may sample uniformly random input-output pairs, but the outputs are slightly perturbed from their true values. Without the perturbance, this problem is equivalent to solving a linear system and would be easy. With the perturbance, nobody knows how to solve it in polynomial time. There is vast literature on why LWE (or even simpler problems) should be hard to solve. Here is a survey on the best known attacks.

LWE is the front-facing encryption used for the CGGI cryptosystem, described in more detail later, but most of the other FHE schemes use a generalization of LWE to polynomial arithmetic, called Ring Learning With Errors (RLWE).

In RLWE, the scalar multiplications and additions from LWE are upgraded to polynomial multiplications and additions. Instead of operations being done modulo a machine word size, operations are done modulo a “polynomial modulus,” and that modulus is critical to the security of the scheme.

In more detail, we start by picking a polynomial ring. We pick a polynomial degree as the maximum degree (say, $N = 1024$), the coefficients are always integers modulo some chosen modulus $q$ (it may be a machine word size or some large prime), and finally we pick a polynomial, usually $x^N + 1$, and represent the result of every operation as a remainder when divided by that polynomial. In math notation, this set of polynomials would be denoted $\mathbb{Z}/q\mathbb{Z}[x] / (x^N + 1)$. The requirements of the specification of $q, N,$ and the polynomial modulus differ from scheme to scheme. The modulus $x^N+1$ is standard because it enables efficient polynomial multiplication via the discrete fourier transform (DFT) or the number-theoretic transform (NTT). See my earlier article on how these special polynomial multiplications can be implemented, though I haven’t written about the NTT yet.

To encrypt, one or more small-integer messages are encoded into a polynomial in some scheme-dependent way (I wrote a whole article about this). The secret key is a list of random polynomials with binary coefficients, and the samples are random polynomials with uniformly random mod $q$ coefficients. Then you take a dot product, add the message, and add a similar “noise polynomial” to mask the result.

Because polynomials are “harder” than integers (for more details, see this article), the vector size $n$, which was 800 in my LWE example, is often set to 1 in RLWE. That is, the “dot product” from LWE is replaced by a single polynomial product. If you think about it, a polynomial product is a lot like a convolution, which is a twisty cousin of the simpler dot product. The polynomial degree replaces the dimension of the LWE vector for scaling security. That said, some schemes use two or more polynomials to get different trade-offs of security and performance.

The main advantage to using RLWE over LWE is that you can pack many messages into a single polynomial, and the homomorphic operations you do apply to all the messages. It’s the same sort of advantage as SIMD, but there are more restrictions on what SIMD operations are available on these packed messages that I’ll discuss in later sections.

Finally, many people have observed that an integer is just a polynomial with degree 0. And because RLWE started off with the vector dimension always being set to 1, you can take these two parameters (vector dimension and polynomial degree) to get a single mother-problem where both parameters can be variable. That mother problem is called “Module LWE,” and it’s beyond the scope of this article.

### Basic operations and dealing with noise

In both LWE and RLWE encryptions, the addition of two ciphertexts naturally corresponds to the addition of the underlying messages. Getting a homomorphic multiplication is harder, and the method is often scheme-dependent, but it’s possible. You might have noticed that LWE and RLWE support “small bit width integers” instead of the single bits I mentioned previously. Indeed, modern FHE practitioners take advantage of this and represent their circuits in terms of low-bit-width arithmetic operations, or take advantage of fixed-width representations to support decimal computations. Depending on the computation, this can have a dramatic benefit over representing everything in terms of pure boolean circuits.

In addition to operating on the underlying messages, homomorphic addition also adds the cryptographic noise in the two operands to the result ciphertext. Likewise, a homomorphic product roughly multiplies the noise in the two ciphertexts. This is clearly unsustainable. Once the noise exceeds the some-28 bits of space allocated for it, decryption will fail and the program result effectively becomes random noise.

By way of example, in a recent FHE program I worked on, I had 3-bit messages in the top of a 32-bit LWE plaintext. The cryptographic parameters I set up had my initial program starting with an effective bound of about 12 bits of noise (this is probably not secure enough for real world uses). A single addition adds one more bit of noise, and a multiplication doubles the noise. So we can do at most one multiplication followed by at most five or so additions before we are too noisy to continue.

So nearly all of the complexity in FHE is based around how to avoid noise
growth or how to reduce noise mid-program. The latter is called *bootstrapping*,
which deserves some special attention. The idea of bootstrapping is trippy. Its
technical details are beyond the scope of this article, but I can summarize the
main idea: you can *homomorphically decrypt an FHE ciphertext and re-encrypt
it under a new secret key* without ever seeing the underlying message. In doing so,
the noise of the resulting ciphertext is “reset” to a smaller
value. But to make it work, the
user has to provide the server with a special encryption of the user’s secret
key, which adds an extra security assumption called circular security, or
key-dependent message
security. With bootstrapping,
you can extend an FHE program indefinitely. The downside is that bootstrapping
can be expensive, taking milliseconds in some schemes and seconds to minutes in others.

If you don’t want to do bootstrapping, then you are left with putting a hard
limit on noise growth. This is often called *leveled homomorphic encryption*
because the noise growth is discretized into levels, and programs are
prohibited from exceeding the max level. These two techniques roughly split the
FHE community in half.

One side is colloquially named “boolean FHE,” and it represents a family of FHE implementations that focus on smaller encryptions and efficient bootstrapping. It excels at programs that require random access to particular bits, such as string processing or programs with lots of comparison operators. It does not do well with programs that implement large bit-width multiplications, because (at some level or another) this reduces to simulating a large circuit, bootstrapping at every gate. Boolean FHE is typically compute bound, and bootstrapping methods in this domain tend to require sequential accumulations of many polynomial multiplications.

The other side is colloquially named “arithmetic FHE.” These schemes represent a program as an arithmetic circuit (additions and multiplications that aren’t interpreted as bit operations), and all but forbids bootstrap operations because nobody has figured out how to make them fast. Instead, they increase the parameters so as to make the noise ceiling high enough (think hundreds of bits), and they use tricks to reduce the depth of the circuit so that the noise ceiling is never reached. More specifically, they want to reduce the multiplicative depth of the circuit—the number of multiplication ops on the longest path from input to output. Multiplications incur much more noise than additions, and the noise added by additions can be mostly ignored. Because arithmetic circuits can only formally compute polynomials, arithmetic FHE users (or in my case, compilers) must also contend with how to represent non-polynomial operations as polynomials. A ReLU activation function in a neural network, for example, might be approximated as a degree-3 polynomial. Then you have to consider the error introduced by that approximation on top of the ciphertext noise. But for all that hassle, arithmetic FHE excels at doing large amounts of SIMD arithmetic (packing many messages into one ciphertext), and this tends to work well for machine learning applications. One price to pay for this is that arithmetic FHE has much larger ciphertexts, more “special keys” to manage, and tends to be memory bound in addition to compute bound.

The last thing I’d like to mention in this section is that LWE-like problems often support additional cryptography tricks that become critical implementation details of FHE schemes. The most important one is the ability to switch encryption keys homomorphically. That is, given a special key switching key that encrypts a key $s_1$ using another key $s_2$, you can convert an LWE or RLWE encryption of a message $m$ using $s_1$ to an encryption of $m$ using $s_2$ without needing to decrypt and re-encrypt (and the noise is not reset like in bootstrapping). This requires an additional encryption key from the user. Other building blocks include changing the modulus of the coefficients of the encryption scheme homomorphically, applying automorphisms of the ring underlying an RLWE ciphertext (e.g., rotating the position of the messages encoded in the ciphertext), and packing or unpacking ciphertexts from LWE to or from RLWE.

### Residue number systems

In all of these schemes, one encounters the problem of wanting to use larger numbers than is available within the limits of the cryptosystem. This can be because the messages you’d like to encrypt (e.g., 32-bit integers) are larger than the allowed size of the inputs to the encryption function (say, for the parameters you’ve selected it’s 8-bit integers), or because the coefficients of the encrypted messages don’t fit in a machine word and you don’t want to incur the overhead of multi-precision integer arithmetic. The latter is common in arithmetic FHE, where one blows up the scheme’s parameters to avoid bootstrapping.

In such cases, FHE schemes often turn to the idea of a residue number system. This is based on the Chinese Remainder Theorem (CRT, also known as Sunzi’s theorem), which roughly states that a large number $x$ can be represented as a tuple of numbers $(x_1, \dots, x_k)$, with each $x_i \equiv x \mod n_i$ for some choices of $n_i$ (they must be co-prime, and their product must be larger than $x$). You can reconstruct the original $x$ from the tuple, and a residue number system represents numbers in terms of their “residues” (the modulo tuples). This extends to a similar statement about polynomial quotient rings that factor as a direct product of smaller quotient rings.

The amazing thing about residue number systems is that you can still compute with them. Adding or multiplying the tuples entry-wise (followed by the relevant modular reduction in each entry) is equivalent to adding or multiplying the original number.

In FHE this means we can take a large ciphertext, decompose it in an RNS, and do all subsequent computations in the RNS. This is a good tradeoff when you have lots of SIMD parallelism available in your hardware, while the alternative would involve multi-precision numbers—like the 800 bit numbers mentioned above for arithmetic FHE.

RNS is most prominently used for arithmetic FHE schemes with RLWE, as the arithmetic FHE schemes already impose a SIMD-like computational model, which I’ll discuss in more detail later in this article. However, it is also used for boolean schemes to neatly represent high-precision arithmetic with low-precision underlying messages.

### Double-CRT

Another way that RNS/CRT sneaks into the picture is during encoding.
An addition of two polynomials corresponds naturally to the sum of their coefficients,
but a multiplication of two polynomials corresponds to the *convolution* of their coefficients.
In RLWE encodings described above,
a message can be encoded in the polynomial’s coefficients before encryption.
The most common RLWE encoding, however, encodes the messages
in the *evaluations* of the polynomial at certain points.
This can be thought of in a few equivalent ways
as a DFT/NTT inversion,
as a polynomial interpolation,
or as a CRT decomposition of the underlying polynomial ring.
All of them involve converting from an “evaluation domain” to a “coefficient domain.”

When this trick is applied for arithmetic FHE schemes in conjunction
with “RNS to support large input messages,”
the encoding is often called *double-CRT*.
For more details, see my deep dive on encodings.

### Gadget decompositions

Another general means for managing growth is the so-called
*gadget decomposition* idea (I hate this name because it means nothing).
This comes up often as an implementation detail of particular schemes,
invisible to the user of the scheme,
but critical to ensure FHE building block operations (like key switching)
can be implemented without blowing through the noise budget.
I wrote an article
describing how it works, but the main idea
is that, when you might multiply two numbers $a, b$,
instead you sum the product of $a$ with the bits of $b$,
and add them up and scale them appropriately.
This is useful when the noise growth is asymmetric in the two arguments,
and so basically you make the noise-heavy part as small as possible
and move the powers of 2 to the other side.

## Individual FHE schemes

Now I’ll give a broad overview of the main FHE schemes at the forefront of research. They are named via author initialisms: BFV, BGV, CKKS, and CGGI.

### BFV and BGV (integer/fixed-point arithmetic)

BGV and BFV (two papers B and FV) are two schemes in the “arithmetic FHE” family. In the decade since their initial publication in 2011/2012, BGV and BFV have been tweaked and re-analyzed and identified to be more or less equivalent, with minor differences in their implementations, efficiency, and noise propagation across different parameter choices. Many of the high level tricks and considerations of this scheme are also used in CKKS, so this section will be longer to account for that.

Both schemes use a version of RLWE to encrypt vectors of messages,
and both use the sort of RLWE that involves taking a dot product of only two polynomials
(this will be relevant when I discuss the concept of a key basis next).
That said, their first obvious difference
is that BGV encodes cleartexts as plaintexts by storing the messages in the
*least* significant bits of the polynomial coefficients,
while BFV stores it in the *most* significant bits.
BGV/BFV might typically use 16 bits for the message.
In these 16 bits, one can represent a 16-bit integer,
or various kinds of fixed-point numbers that fit in the same number of bits.

Both BGV and BFV use standard RLWE additions, and while they differ in how they handle multiplication, both of their multiplication routines rely on key switching for the following reason. They view a ciphertext as a “degree 1 polynomial” in the secret key $s$. That is, a ciphertext is a pair $(c_0, c_1)$ for which the first step of decryption is to compute $c_0 \cdot 1 + c_1 \cdot s$. The $c_i$ are the coefficients of this linear polynomial, but the relation between polynomials gets even clearer when we see that multiplication between ciphertexts $(c_0, c_1)$ and $(b_0, b_1)$ is defined as

\[ (c_0, c_1) \cdot (b_0, b_1) = (c_0 b_0, c_0 b_1 + c_1 b_0, c_1 b_1) \]

This is a valid ciphertext, but only if you consider it as a degree-2 polynomial in $s$, i.e., $c_0b_0 \cdot 1 + (c_0 b_1 + c_1 b_0) s + c_1 b_1 s^2$. Another way to put this is that we’re talking not about polynomials in $s$, but that there is a “key basis” that starts as $(1, s)$ and multiplication changes the basis to $(1, s, s^2)$ (then we’re not computing a polynomial, but taking a dot product with the basis).

Note the formula for multiplication above is a simplification. BGV and BFV differ in how they handle ensuring the above stays within the appropriate coefficient modulus, and tweaking these details is the primary goal of a handful of academic papers. I mainly want to highlight the change in key basis, because it creates an obstacle. You could continue to do FHE operations in this degree-two key basis, but it would be more computationally expensive, and applying another multiplication would make the basis degree three or higher.

So the BGV/BFV schemes use a technique called *relinearization*
to convert from the degree-two basis back to the degree-one basis.
This requires key switching as a subroutine,
and an additional encryption of $s^2$ to support it.
Relinearization is the bottleneck of multiplication,
the same subroutine for both BGV and BFV,
and also slightly increases ciphertext noise.

Another relevant aspect of BGV is that is uses a technique called “modulus switching” to reduce noise growth across the evaluation of a circuit. As a reminder, BGV and BFV both take the “arithmetic FHE” approach of avoiding bootstrapping by increasing parameter sizes. This modulus switching technique is the specific method by which having large parameters can allow one to cheat performance death in this way: the act of modulus switching makes parameters smaller while reducing noise, and so the larger the parameters, the more room there is to reduce noise.

Specifically, BGV’s parameters include a specification of a set of numbers $Q_0, Q_1, \dots, Q_L$ called “moduli,” where each $Q_i$ divides $Q_{i+1}$. Usually they are constructed by taking $Q_{i+1} = p_{i+1}Q_i$ for some 32-bit or 64-bit primes $p_i$. These moduli are literally the coefficient modulus of the polynomial ring of RLWE ciphertexts, but abstractly they are considered like “levels.” A ciphertext starts at the top level $Q = Q_L$, and once it reaches $Q_0$ you can’t do any more multiplications with it. After each multiplication, the output ciphertext’s coefficients are rescaled using a modulus switching operation that converts its coefficients from being mod $Q_{i+1}$ to being mod $Q_i$. This involves rescaling the ciphertext’s coefficients by $Q_i / Q_{i+1}$, but requires some careful details to get it right. This has the side effect of rescaling the noise by the same factor. In particular, without this trick, ciphertext noise grows quadratically with the number of multiplication operations, meaning you can only evaluate $\log_2(L)$ multiplications. With the trick, you can do $L$ multiplications. It is not uncommon for $L$ to be as high as 128.

BFV uses modulus switching as well, but it is hidden inside the multiplication operation and maintains “scale invariance” as part of its public API (outside of a multiplication operation), while BGV exposes the moduli and the user must ensure that operations are only applied that combine ciphertexts with the same modulus.

The modulus switching question is relevant
because it impacts how you set up the parameters of the scheme.
For BGV, the parameters depend on the circuit you want to evaluate.
When people talk about “leveled” homomorphic encryption,
this is what they mean: picking larger parameters
so that you have enough levels to avoid bootstrapping.
Got multiplicative depth? You need More Primes.
One FHE engineer mentioned to me that—between
needing them for RNS decompositions and modulus switching—they
actually *ran out* of 32-bit primes
(there are roughly 60 million of them, or 0.25 GiB worth,
based on this paper)
and as a result had to switch their underlying types to 64 bits.^{4}
Meanwhile, BFV’s analogous parameters are independent of the circuit.
In exchange for this complexity, BGV boasts some 20-50% speedups
over BFV for larger moduli $Q$ (i.e., deeper circuits),
though it does depend on what particular subsets of other tricks are applied.

The last thing I wanted to mention is how BGV and BFV sit within a larger computing paradigm. While there is a lot of attention focused on improving microbenchmarks (like relinearization or key switching or noise growth from a single multiplication), there is also a challenge in expressing a program in the BGV/BFV computational model. Because while you can do SIMD operations like adding and multiplying ciphertexts, the other operations common to SIMD architectures like AVX aren’t present. Forget that a ciphertext is a polynomial and consider it just as a vector of the underlying messages. AVX and friends allow you to apply an arbitrary permutation to the vector elements, shuffling them around to align them better for later SIMD operations. In FHE we are significantly more limited. There is a limited subset of supported permutations that correspond to cyclic rotations of a vector. In polynomial land, these are implemented via Frobenius automorphisms of the underlying ring that map $x \mapsto x^k$ for some $k$.

To extend this to an arbitrary permutation requires
constructing a log-depth *permutation network*.
I’m not aware of anyone on the practical side of FHE
who actually does this, and instead they resort to
computations that are well expressed with available rotations,
like simple loops and matrix-vector products.
Still, these automorphisms require their own special key material,
one for each supported rotation shift amount,
and they are expensive relative to what they do.
So naturally one wants to minimize their use,
and this can be a challenge in expressing a program
that is not hand-crafted for FHE to fit the arithmetic FHE model.

Part of the reason rotations are expensive is that they require a key switching operation, which both requires additional encrypted key material and is computationally intensive relative to the other (non-bootstrapping) operations in BGV/BFV. For a deeper, academic discussion of BGV, BFV, and their differences and optimization tricks, see this paper of Kim, Polyakov, and Zucca.

Finally, there has been some recent work on optimizing bootstrapping for BFV/BGV schemes, but to my knowledge the best bootstrapping runtime in these schemes still takes minutes to complete, or seconds per message if you amortize.

### CKKS (approximate arithmetic)

CKKS,
the most recently invented scheme of the bunch,
makes the insightful observation
that the error injected into an FHE ciphertext
can be absorbed into the error inherent to floating point operations.
They take advantage of this both for expanding the allowed space of inputs
and for reducing noise more aggressively during the scheme.
In particular, CKKS uses a *rescaling* operation
that serves a similar purpose to modulus switching in BGV/BFV—reducing
noise growth from multiplications—but in the process
it drops some precision from the least significant bits of the message.

Like BGV and BFV, CKKS uses RLWE to encrypt messages, and excels at SIMD-style programming. It uses the same sorts of residue number system tricks as BGV and BFV to accelerate polynomial multiplication, emphasizes a “leveled” approach to avoiding bootstrapping, and uses relinearization, rotation, and key switching similarly.

The CKKS scheme supports arbitrary real or complex numbers as input messages—with the understanding that some precision will be lost—and it excels at evaluating transcendental functions via their Taylor series polynomial approximations. For this reason, CKKS is considered best suited for SIMD-style computations that permit some limited loss of precision, such as machine learning inference tasks, where some transcendental functions are necessary, but not too frequent. In particular, it can do better than BGV/BFV at finding small-degree polynomials that approximate functions in some local neighborhood of a point. This relies on having a continuous message space: if you set your message space to be the unit interval, your polynomial approximation can assume the inputs are in that range.

However, the added performance of CKKS comes at a usability cost. In addition to keeping track of FHE ciphertext noise growth from homomorphic operations, and in addition to keeping track of the approximation errors introduced by polynomial approximations of non-polynomial functions, in CKKS one must also track precision loss from these other operations like rescaling. Even encrypting and immediately decrypting a message (with no homomorphic operations) does not give back the original message. Two ciphertexts with disagreeing scales cannot be combined, and similarly to BGV, this is exposed in CKKS’s “public” API, but it is more closely related to the correct program output in CKKS, which makes it harder to hide than BGV’s modulus switching.

I personally understand CKKS the least well of all four FHE schemes, but the more I dig in and discuss it with colleagues, the less differences I see between CKKS and BGV/BFV. In particular, many of the features of CKKS seem to get quickly ported to BGV/BFV, and vice versa.

### CGGI (boolean/short integer arithmetic)

The CGGI scheme supports encryption of small bit-width integers (1 to roughly 6
bits) using LWE encryption as its entrypoint.
It uses LWE’s natural addition operation,
but has no obvious multiplication operation like BGV/BFV did.
Instead, CGGI focuses on two key components:
having *fast* bootstrapping (< 10ms per ciphertext),
and doing extra (free) computation on a ciphertext during bootstrapping
via a technique called *programmable bootstrapping*.

Normal bootstrapping takes a ciphertext encrypting a message $m$ as input, and produces another encryption of $m$ with lower noise. Programmable bootstrapping allows you to reduce the noise while simultaneously computing an arbitrary (univariate) function of the underlying message. Since the messages are $m$-bit integers, the “arbitrary function” materializes as a lookup table of $2^m$ entries. For technical reasons about how functional bootstrapping works, as $m$ grows the bootstrapping efficiency degrades quite quickly (you require exponentially larger data size parameters). So you basically hard code all the lookup tables.

With this power, you can implement a squaring function $x \mapsto x^2$, and using that with addition and ciphertext-constant multiplication, you can implement ciphertext-ciphertext multiplication: $x \cdot y = ((x + y)^2 - (x - y)^2) / 4$.

But once you have programmable bootstrapping, multiplication is no longer a core function. You can implement arbitrary boolean gates directly, comparison ops, bitwise ops, max/min, and whatever (very low precision) fixed-point ops you’d like. And then you can optimize your program by finding clever ways to express your logic in terms of a low-precision lookup tables. You can also squeeze bivariate functions in by putting the bits of the two inputs side by side, provided you have enough precision.

A single bootstrap op in CGGI is cheap compared to BGV/BFV/CKKS. The leading CPU implementations can run bootstrap in ~8ms, and run many of them in parallel, with special versions that apply the same lookup table to many ciphertexts at once. This makes bootstrap more palatable to run frequently, to the extent that some implementations do what’s called “gate bootstrapping,” meaning they blindly bootstrap after every non-addition gate in an FHE program. But still, bootstrapping is the primary bottleneck in CGGI, accounting for over 90% of the total computation.

The implementation of programmable bootstrapping can be thought of as a black box, but peeking inside a bit, it is implemented by representing the lookup table in an RLWE ciphertext, and “blindly rotating” it by the encrypted message (homomorphically evaluating an automorphism $x \mapsto x^m$ when $m$ is encrypted). This effectively involves a large number (~800) of sequential matrix-vector polynomial multiplications. Then it must be followed up by an RLWE-to-LWE conversion and a key switch, both of which are minor compared to the blind rotate step.

Some extensions of CGGI (such as this one) incorporate BFV-style multiplication and relinearization, decompose larger functions into trees of lookup tables, and add additional features to CGGI.

Because of this flexibility, CGGI can be considered better for programs that have a lot of branching evaluations and a relatively low-width circuit. Though see the Beyond Schemes section below for more color. But either way, there are fewer moving parts than the arithmetic schemes.

Sometimes you will see CGGI referred to by a few other names,
like “TFHE” (for “Torus” FHE, the name the authors of CGGI gave it),^{5}
“FHEW” for “Fastest HE in the West”
(an homage to its use of the “Fastest Fourier Transform in the West” algorithm;
FHEW was a critical precursor to the CGGI scheme),
or “DM” for the initialisms of two authors of FHEW.

Very recent work has demonstrated functional bootstrapping support in BFV. CKKS also has a bootstrapping operation, though initial versions of it take seconds (amortized per message) to run. Certain specializations of this bootstrapping procedure have been designed for CKKS that operates much faster, but assumes the input ciphertexts are single bits.

For a deeper dive on CGGI, see this implementation article by Daniel Lowengrub, or this paper-length explanation by Marc Joye.

## Beyond Schemes

Earlier in this article I described a sort of wall between the arithmetic schemes and the boolean schemes that divided the research community. One interesting development in recent years is the ability to convert ciphertexts from one FHE scheme to another. This has become known as “scheme switching.” It started with the CHIMERA paper, which demonstrated that switching was possible between CGGI, BFV, and CKKS. A followup called PEGASUS improved the CKKS to CGGI conversion. And a recent research compiler demonstrated that one can statically analyze a program to decide what parts should be implemented in CKKS and which in CGGI, and the resulting hybrid is faster than either scheme can do alone.

Some very recent research (as I have heard, to be published at EuroCrypt 2024) also designs a custom bootstrapping procedure for CKKS, which applies to the special case that CKKS is encrypting single-bit messages. It further claims that CKKS bootstrapping is faster (amortized) than CGGI bootstrapping if you have more than 150 ciphertexts to bootstrap at the same time. So you can combine this with scheme switching to convert a large batch of CGGI ciphertexts to CKKS, bootstrap, and convert back, to have faster bootstrapping.

This underlines the main research trend I’ve seen in FHE in the last two years: the distinction between FHE schemes is becoming less pronounced. CKKS, BGV, and BFV are merging into one ur-scheme whose details differ primarily in plaintext encodings and variations on the individual operations they share in common at a high level. If recent research holds up, programmable bootstrapping is available in BFV and will probably be made to work in CKKS and BGV soon. And scheme-switching implies that one’s initial choice of an FHE scheme may not be so binding.

## Hardware acceleration

One major aspect of FHE research is the work on designing custom hardware to accelerate FHE operations. There are many projects, but let me summarize a few I am more familiar with.

The most prominent group of projects are part of a DARPA program called DPRIVE. In short, DARPA is funding FHE hardware designers with a challenge to perform logistic regression, CNN inference, and CNN training as quickly as possible in FHE. There are currently four participants:

- Intel, whose accelerator is called HERACLES.
- Duality, whose accelerator is called TREBUCHET.
- SRI International, whose accelerator is called CraterLake.
- Niobium Microsystems, whose accelerator is called BASALISC.

All DPRIVE participants are working on ASICs that accelerate arithmetic FHE schemes (BFV/BGV and CKKS), and they are at various points along the path toward fabrication. Their initial performance claims are based on simulation, but to trumpet the horn of the underdog, Niobium, their initial paper claims a 5,000x speedup over CPU, with logistic regression of a 1,024-sample, 10-feature dataset is estimated to take 40 seconds versus 60 hours on CPU. To me this is a lower bound on what’s possible with hardware acceleration. At the core of most of these accelerators are accelerations of number-theoretic transforms (NTTs) and other polynomial operations in the relevant polynomial rings. To my understanding, the hard parts of these accelerators are packing enough RAM into them so that they can store all of the ciphertexts and auxiliary key material and get good memory locality.

Another project I’m familiar with is an FPGA-based approach to accelerating CGGI, which goes by the name FPT, out of the COSIC research lab at KU Leuven. They use an Alveo U280, and they functionally bootstrap pipelined batches of 16 ciphertexts at a time to achieve a throughput of 1 bootstrap / 35 microseconds. I’ve seen their live demo, in which they run Conway’s Game of Life in CGGI, and the animation is effectively in real time. Unlike the NTT-crunching machines from the DPRIVE program, FPT is an FFT-crunching machine. Naturally, this project starts from the TFHE-rs API for CGGI.

Then there are approaches I’m less familiar with. The folks at Intel have a HEXL project that focuses on targeting Intel CPUs using AVX and similar modern CPU fanciness. There are also folks at NVIDIA working on GPU acceleration, and the HEaaN library (CKKS) also supports GPU acceleration. There is also a company called Optalysys that is building an optical computing chip for FHE. The idea there is that, by using interference patterns of light passing through lenses (or rather, nanoscale equivalents), one can compute Fourier transforms “at the speed of light,” and in doing so accelerate bootstrapping.

And finally, I’m working on my own hardware acceleration approach: CGGI on TPUs. This is in an open source library called jaxite (named so because it’s written in JAX). The performance is nothing to write home about yet, but my hope is that if I can get performance to be 10-100x faster than CPU, then I can use the fact that Google already has TPUs deployed at scale to ship some FHE products before more intense hardware acceleration is ready at scale.

For some more details on these and other accelerators I know less about, see this paper, “SoK: Fully Homomorphic Encryption Accelerators”.

## Software

There is a lot of software written to support FHE, from implementations of the core crypto to compilers and higher-level frameworks. For an exhaustive list, see Jonathan Schneider’s Awesome FHE. I will highlight some of my favorites here.

OpenFHE is an implementation of all major FHE schemes, including scheme switching, primarily maintained by Duality. It replaced an earlier PALISADE project, which is no longer maintained. To my knowledge it is the only library that supports scheme switching, and it is the primary API that the DPRIVE participants are using as an entry point for their hardware.

HEaaN is the canonical CKKS implementation, and it was taken closed source when the authors started CryptoLab, a for-profit research lab. The last open source version remains on GitHub.

TFHE-rs is the leading CGGI implementation, written in rust and developed/maintained by the for-profit company Zama. It was based off of the original CGGI implementation (also called TFHE), which was written in C++ and is no longer maintained. Zama also built a variety of higher-level libraries over TFHE-rs, including a compiler called concrete, and a scikit-learn API mirror called concrete-ml.

Lattigo is a pure Go implementation of CKKS, along with BGV/BFV/CGGI implementations, originally by Jean-Philippe Bossuat and now maintained by Tune Insight.

To my knowledge, all other FHE implementation libraries are either not significantly used or no longer actively developed. Two notable examples of the latter category include:

- Microsoft’s SEAL, which supports BGV/BFV and some CKKS support with the associated EVA compiler (also not actively developed).
- HELib, a BGV & CKKS implementation by IBM.

There are a few other FHE tools under active development worth mentioning.

- HELayers is IBM’s FHE compiler for arithmetic FHE, which specializes in finding good packing schemes
- HEaaN.mlir is CryptoLab’s compiler targeting HEaaN, but as far as I can tell it is not open source.
- As mentioned above, Concrete and Concrete-ML are a pair of compiler/front-end for machine learning with CGGI via TFHE-rs.
- HEIR an FHE compiler—and my main project—that aims to have support all major FHE schemes and hardware backends. As of this writing, it supports CGGI exporting to TFHE-rs for CPU and the FPT FPGA, and BGV exporting to OpenFHE (and getting ready for the DPRIVE accelerators).

There are also a variety of other research-level compilers that are not maintained, but innovated many of the ideas used in current FHE compilers.

## Writing FHE programs

After all the cryptography is there and the software is in place, what does it actually look like to write FHE programs? I’m very interested in this topic, and I’d like to write more about it in the future.

However, in brief, writing FHE programs today still highlights the arithmetic/boolean divide. While there are some dedicated libraries like Concrete-ML that do a good job for specific tasks, general FHE programming is still a highly manual process, akin to writing custom graphics kernels where you are acutely aware of the “right” way to implement something in hardware, but in our case the “hardware” is the limited computational model of FHE.

For arithmetic FHE, this means that you have to avoid
operations that aren’t polynomials,
or can’t easily be approximated by low-degree polynomials.
This rules out even simple comparisons, max/min, and branching.
A vast amount of research has been dedicated
just to finding precise polynomial approximations
for the sign function (`1 if x > 0 else -1`

).
And the “right” polynomial approximation to choose
depends heavily on your program’s semantics.
That’s why most of the arithmetic FHE world
has been focusing on neural network inference,
because most of the work of a neural network
is in computing matrix multiplications—and
most of the hard part of doing this in FHE
is handling transcendental activation functions.
A program with too many discontinuous
functions with low error-tolerance
would simply be infeasibly slow in arithmetic FHE
because it would require constant bootstrapping.
For example,
the most impressive program I’ve seen implemented
in arithmetic FHE is ResNet in CKKS,
a deep neural network architecture that stacks convolutional layers
with ReLU activation.
That ReLU requires a high-degree polynomial approximation
that consumes all of the multiplicative depth available,
and so after each convolutional layer one requires a CKKS bootstrap.

To my knowledge, the state of the art implementation
is the paper linked above,
which reports one inference (on a packed ciphertext containing 50 images)
taking about one hour (one minute per image, amortized).^{6}
That paper also implements ResNet-110,
at a latency of 6 hours per inference.
Meanwhile, ResNet-50 CPU inference has
amortized latency of ~1 image /
ms,
which is order of magnitude 100,000x slowdown.

Outside of compiling pre-trained neural networks with specific tricks, to my understanding all practical arithmetic FHE programs are hand-written by experts with manually chosen approximations and batching schemes.

For boolean FHE, non-polynomial operations are
much cheaper, but you run into other problems related
to bitwidth. Specifically, large-bit-width additions and multiplication ops
become quite expensive, because boolean FHE schemes
only support relatively small-bit-width messages
and must effectively simulate large adder/multiplier circuits
via large trees of functional bootstrap operations.
As such, boolean FHE expends a lot of effort
on *quantization*, meaning reducing the bit width
as much as possible.
For neural networks, libraries like Brevitas and TFLite
come in very handy, and Zama has made great use of this
along with other pruning tricks
to implement networks like VGG on
CPU
to achieve ~40s inference per image.
It’s slower than the ResNet example above (and VGG is a much smaller network),
but the benefit of this approach is you can also do
things like decision trees that would be much less practical in arithmetic FHE,
and far less of it is manually hand-crafted,
as exemplified by Zama’s concrete-ml library hiding many of the details.
Another benefit of the boolean FHE model
is that you can throw a lot more off-the-shelf hardware
optimization tooling at the problem,
because boolean FHE circuits look much more like traditional combinational circuits
than arithmetic FHE does.^{7}

## Getting involved

If FHE sounds interesting and you want to get involved, there are a few decent ways to start. The FHE.org group, spearheaded by Zama, runs a weekly seminar series on FHE research, and has an active Discord, (disclosure: I am a moderator, and the HEIR project has a channel). FHE.org also hosts a yearly conference in March.

The OpenFHE project has an active Discourse forum where the maintainers answer questions about OpenFHE and FHE in general.

There is an active ISO standardization effort for FHE, organized by the Homomorphic Encryption Standardization Consortium, which also hosts research workshops.

And finally, our HEIR project has twice-monthly open design meetings for anyone interested in contributing.

## Thanks

Thanks to Seonhong Min, Johannes Mono, Finn Plummer, and Jeffrey Sorensen for providing feedback on this article.

By boolean circuit, here I mean a purely combinational circuit. No clock cycles or flip flops involved. ↩︎

An important caveat here is that a 1,000,000x performance hit is not automatically a deal breaker for some applications. Much of the FHE industry is focused right now on the subset of applications where legal hurdles make the two available options “slow, but compliant” and “fast, but incurs prison time.” ↩︎

A question that puzzled me when I first studied LWE is why we can’t simply ignore the least significant bits of a ciphertext to “bypass” the noise. The reason is that the noise can be negative. Adding negative noise changes the higher order bits of an integer by the arithmetic “wrapping around.” It “tampers” with the highest order bits, but still allows you to remove it by rounding up in the decryption process. I enshrined this mistake in my first broken LWE implementation by masking and shifting during decryption instead of rounding. ↩︎

For performance reasons, not all primes are equal when selecting parameters. Often you need to restrict them to have a certain GCD with the underlying plaintext modulus, and sometimes you want the primes to be adequately spaced apart in some sense that I’m not intimately familiar with, but FHE hardware design experts tell me are important. ↩︎

The “torus” in “TFHE” is a modeling formalism I prefer not to use, because I feel it adds unnecessary complexity to the FHE schemes. In the end, and in every implementation, this “torus” is discretized to the ring of integers with a given bit width, and as far as I can tell having it as a “torus” does not make the theoretical analysis any simpler. ↩︎

The authors report using an AMD Ryzen Threadripper PRO 3995WX at 2.096 GHz (64 cores) with 512 GB RAM. ↩︎

While I’ve been discussing “circuits” loosely in this article, one point of consensus among my colleagues working on the HEIR compiler is that the circuit perspective only gets us so far, and really we want to turn toward thinking of FHE programs as traditional programs. FHE ciphertexts are not the wires that flow between gates, so much as they are data stored in traditional memory, perhaps packed in a specific way analogous to vector registers in modern hardware. ↩︎

Want to respond? Send me an email, post a webmention, or find me elsewhere on the internet.