Summary of Logic for Log-Precision Transformer Models

Summary Logic for Log-Precision Transformer Models arxiv.org

10,830 words - PDF document - View PDF document

One Line

Researchers propose the logic FO(M) as a new way to express computations performed by transformer models, which is more powerful than previous logics and can provide insights into how transformer models perform their computations.

Key Points

FO(M) logic can express the computations performed by transformer models and handle a wider range of attention patterns.
Log-precision transformers cannot express uniform attention patterns, which are a core algorithmic primitive of transformers.
A method for constructing log-precision transformer models using a block mapping algorithm is presented.
Log-precision transformer models rely on addition, conditional branching, and a finite number of functions computable in time O(log n).
Affine transformations, layer normalization, and the output classifier head can be computed by log-uniform TC0 circuit families.

Summaries

225 word summary

This paper presents the logic for log-precision transformer models, analyzing the uniformity of common neural net building blocks within the transformer. Affine transformations, layer normalization, and output classifier head can be computed by log-uniform circuit families. The article discusses the use of log-precision transformer models in neural sequence models, covering their limitations and theoretical limitations of self-attention. A method for constructing log-precision transformer models using a block mapping algorithm is presented. The paper establishes the equivalence of log-uniform circuits and FO(M) and suggests the possibility of translating real transformers to FO(M) sentences and establishing a hierarchy theorem describing the FO(M) quantifier depth needed to simulate a TC 0 family of a certain size. Researchers have developed a logic called FO(M) that can express computations performed by transformer models. FO(M) is more powerful than previous logics and handles a wider range of attention patterns. The authors propose FO(M) as a new logic that allows majority quantifiers and can express any function computed by a log-precision transformer. The paper discusses the limitations of log-precision transformer models and the importance of uniform attention. Researchers raise concerns about the safe deployment, fairness, and accountability of increasingly complex transformer models. FO(M) has the advantage of being mechanistically interpretable, providing insights into how transformer models perform their computations. The researchers believe their work could guide the design of "transformer-complete" programming languages.

699 word summary

Researchers have developed a logic, called FO(M), that can express the computations performed by transformer models. FO(M) is more powerful than previous logics used for this purpose and can handle a wider range of attention patterns. The authors demonstrate that finite-precision transformers cannot express uniform attention patterns, which are a core algorithmic primitive of transformers. The authors propose a new logic, FO(M), which allows majority quantifiers and can express any function computed by a log-precision transformer. They provide an upper bound and first logical characterization of log-precision transformers. The paper raises concerns about the safe deployment, fairness, and accountability of increasingly complex transformer models with hundreds of billions of parameters. FO(M) has the advantage of being mechanistically interpretable, meaning that it can provide insights into how transformer models perform their computations. The researchers believe that their work could guide the design of “transformer-complete” programming languages. This paper discusses the logic behind log-precision transformer models, which are transformers where the weights are fixed and the precision used to compute the forward pass is at most O(log n). The authors propose a logic for such models, including various examples of languages that can be expressed in first-order logic with the addition of certain predicates. The document discusses the limitations of log-precision transformer models and the importance of uniform attention. Theorem 2 states that any computation of a transformer on input x can be reduced to a single integer division or a finite number of Dyck-language queries. Proposition 1 analyzes a model of transformers where the precision depends on the context length. Computation graph families and threshold circuits are also discussed as useful tools for processing inputs of arbitrary length. A logic for log-precision transformer models is presented in this paper. The main theorem shows that any log-precision transformer can be expressed in FO(M) and can be simulated by a TC 0 family C that is both a power of 2 and computable in time O(log n). A method for constructing log-precision transformer models using a block mapping algorithm is also presented. The algorithm defines a circuit family simulating the computation graph family and enforces that the block size of each node matches the size of the circuit for that node. The paper establishes the equivalence of log-uniform circuits and FO(M), proving that any log-precision transformer can be translated to an FO(M) sentence that computes the same function as the transformer. The authors also prove that a transformer can be simulated by a log-uniform TC0 family C that obeys the size and depth properties of Theorem 3. The article suggests the possibility of translating real transformers to FO(M) sentences and establishing a hierarchy theorem describing the FO(M) quantifier depth needed to simulate a TC 0 family of a certain size. The article also suggests the potential for proving that any transformer can be simulated by an FO(M) sentence of quantifier depth of at most 2. This document discusses the use of log-precision transformer models in neural sequence models, specifically focusing on the practical computational power of finite precision systems and the effects of parameter norm growth during transformer training. The article covers the limitations of log-precision transformers and the theoretical limitations of self-attention in neural sequence models, citing related works on layer normalization and transformer architecture. The paper provides definitions and proofs for various concepts related to column uniformity and computation graphs, including the concept of column uniformity in computation graph families. The article presents a method for constructing log-precision transformer models, which rely on addition, conditional branching, and a finite number of functions computable in time O(log n). Finally, the size of the model's parameters is at most poly(n). This paper presents the logic for log-precision transformer models, analyzing the uniformity of common neural net building blocks within the transformer. The blocks are computable by a log-uniform, constant-depth, poly-size threshold circuit family. Affine transformations, a core part of neural networks used in the transformer, can be computed by a log-uniform, constant-size threshold circuit family of size polynomial in the weight matrix and bias vector. Layer normalization can be computed by a log-uniform TC0 circuit family. Finally, the output classifier head is computable in log-uniform TC0 if the activation function is log-uniform.

1940 word summary

The paper discusses the logic for log-precision transformer models. It analyzes the uniformity of common neural net building blocks used within the transformer and shows that they are computable by a log-uniform, constant-depth, poly-size threshold circuit family. The paper also discusses affine transformations, which are a core part of neural networks used in various parts of the transformer. It shows that for p = O(log n), any p-precision affine transformation where the weight matrix and bias vector are log-uniform is computable by a log-uniform, constant-size threshold circuit family of size polynomial in the weight matrix and bias vector. The paper also discusses layer normalization and shows that it can be computed by a log-uniform TC0 circuit family. Finally, the paper discusses output classifier head and shows that if the activation function is log-uniform, then the output classifier head is computable in log-uniform TC0. The document presents a method for constructing log-precision transformer models. The activation block of the model consists of two parts: the first simulates the pooling part of the self-attention sublayer, and the second applies layer-norm and simulates the feedforward subnetwork. The self-attention component is computable in log-uniform TC0, and the value function is defined as v(hi) = Whhi?i+bh. The similarity function computes queries and keys and then takes the scaled dot-product between them. The model can compute self-attention, binary multiplication, binary division, and iterated addition in log-uniform TC0. Finally, the size of the model's parameters is at most poly(n). This document discusses the logic for log-precision transformer models. The transformer embedding function represents token positions with a vector and can be expressed as a constant-size computation graph. The parameter vector for each function is log-uniform. The transformer column components rely on addition, conditional branching, and a finite number of functions computable in time O(log n). The size of F? will be of size bsize(n). The article discusses the use of log-uniform TC 0 families in transformer models. Lemma 2 shows that transformer components are computable in log-uniform TC 0, and Lemma 5 states that each component in F is computable in log-uniform TC 0 if T is a log-uniform transformer with log-uniform ? T. The embedding component, self-attention mechanism, activation block, and output classifier head are all computable in log-uniform TC 0. Lemma 3 establishes that there exists a function bsize(n) that is a power of 2 and computable in O(log n) time. Lemma 4 shows that circuit families can be padded to log-time size upper bounds. The article concludes by providing a formula for edge G T (n, i, j) and discussing causally masked attention. The paper discusses the concept of column uniformity in computation graph families, specifically in transformer models. It is shown that any transformer model is a log-column-uniform computation graph family, meaning that it can be computed in log-uniform time using a lookup table. The paper provides definitions and proofs for various concepts related to column uniformity and computation graphs. Common notation for computation graph and circuit families is summarized in Table 1. The paper cites related works on layer normalization and transformer architecture. The document discusses logic for log-precision transformer models, including the practical computational power of finite precision systems and the effects of parameter norm growth during transformer training. It also covers the limitations of log-precision transformers and the theoretical limitations of self-attention in neural sequence models. The article includes references to various papers and conferences on language recognition, neural information processing, and automata. The article discusses the development of logic for log-precision transformer models. It references various works that reconcile deep learning with symbolic artificial intelligence and provide theoretical foundations for guiding mechanistic interpretability work. The article suggests the possibility of translating real transformers to FO(M) sentences and establishing a hierarchy theorem describing the FO(M) quantifier depth needed to simulate a TC 0 family of a certain size. The article also suggests the potential for proving that any transformer can be simulated by an FO(M) sentence of quantifier depth of at most 2. The paper presents a method to translate log-precision transformers to a simple logic, providing the tightest known upper bound on such models. The authors conjecture that it is possible to simulate transformers in an even simpler logic. The results challenge the idea of a rigid division between symbolic and neural models. The paper establishes the equivalence of log-uniform circuits and FO(M), proving that any log-precision transformer can be translated to an FO(M) sentence that computes the same function as the transformer. The authors also prove that a transformer can be simulated by a log-uniform TC0 family C that obeys the size and depth properties of Theorem 3. The paper includes corollaries that establish the applicability of the method to various computation graph families. The results have implications for understanding the inner workings of transformers and for developing frameworks that unify different types of neural models. The document presents a method for constructing log-precision transformer models using a block mapping algorithm. The algorithm creates contiguous blocks of circuit gates simulating each node in the computation graph and routes inputs and outputs between blocks appropriately. The algorithm depends on a block mapping that defines the block node, block start, and block size for each node in the computation graph. The algorithm defines a circuit family simulating the computation graph family and enforces that the block size of each node matches the size of the circuit for that node. The algorithm satisfies three premises that ensure the circuit family and computation graph family compute the same function. The algorithm uses induction over circuit gates with topological ordering to show that the premises hold up to a certain point. The algorithm defines a uniform threshold circuit family for each primitive function using log-uniform threshold circuit families. This document presents a logic for log-precision transformer models. The main theorem shows that any log-precision transformer can be expressed in FO(M), and can be simulated by a TC 0 family C that is both a power of 2 and computable in time O(log n). Lemma 4 shows that if F is a log-uniform TC 0 family, then size F (n) ? bsize(n) for all F ? F. Lemma 3 shows that each component in F is computable in log-uniform TC 0. Lemma 1 shows that a transformer T is a log-uniform computation graph family where F contains embedding, self-attention, feedforward, and output components. The proof of Theorem 2 views transformers as computation graphs and focuses on simulating computation graph families with threshold circuit families. Uniform computation graph families are families where node G and edge G can be computed efficiently, i.e., under some constraints on space or time. Computation graph families are a useful tool for processing inputs of arbitrary length. They generalize computation graphs to define functions over unbounded-length strings as inputs. Threshold circuits are a special case of computation graphs where D=(0,1) and the nodes are labeled by a function f?F computed by this node. The edges represent a value D flowing as output from one node into another node. A computation graph G of arity k parameterizes a function Dk?D in the standard way. The output node is considered the output of the function. The size, depth, and arity of computation graph families become functions that can grow in size and precision with n. Theorem 2 can also be extended to apply to log-precision transformers with log-uniform weights. Threshold circuits can simulate neural network components that output multiple bits. This paper discusses the logic behind log-precision transformer models. Theorem 2 states that any computation of a transformer on input x can be reduced to a single integer division or a finite number of Dyck-language queries. Theorem 2 is the tightest known upper bound for any transformer with up to O(log n) precision. Proposition 1 analyzes a model of transformers where the precision depends on the context length. It shows that fixed-precision transformers are artificially limited because they can only attend over bounded-length windows, making them similar to hard-attention transformers. The document discusses the limitations of log-precision transformer models and proposes a logic for such models. The logic includes various examples of languages that can be expressed in first-order logic with the addition of certain predicates, such as majority quantifiers and parentheses matching. The authors argue that finite precision cannot represent uniform attention over long sequences and provide examples of tasks that require higher precision, such as iterated addition and skip-bigram matching. They also discuss the importance of uniform attention and its potential applications in transformer models. This document discusses Logic for Log-Precision Transformer Models. It introduces FO(M) formulas, which can express conditional majority quantifiers, counting and threshold quantifiers, and define a formal language. FO(M) formulas are constructed using indices, variables ranging over positions 1 to n, and predicates. The document also defines log-precision transformers as transformers where the weights defining T are fixed and the precision used to compute the forward pass is at most O(log n). The core functions in T are embeddings, self attention, activation, and the classifier head. The network prediction on x ? ? n is ?(h dn). The Logic for Log-Precision Transformer Models document discusses the computation and output of transformer models as a function of input x. The transformer is defined as a function of context length n and can be expressed as a sentence in FO(M). Fixed-precision transformers can only attend to a fixed number of tokens, while log-precision transformers are capable of attending over contexts of length n. This sheds new light on how to interpret transformer models and their computation. However, an exact logical characterization of transformers remains an open problem. The empirical results of Bhattamishra et al. (2020) are for a harder variant of transformers, and hard attention is weaker than general attention. Researchers have developed a logic, called FO(M), that can be used to express the computations performed by transformer models. FO(M) is more powerful than previous logics used for this purpose, and can handle a wider range of attention patterns. The researchers prove that any transformer model with finite precision can be expressed using FO(M). They also identify several problems that log-precision transformers cannot solve, such as computing boolean matrix permanents. FO(M) has the advantage of being mechanistically interpretable, meaning that it can provide insights into how transformer models perform their computations. The researchers believe that their work could guide the design of "transformer-complete" programming languages. The paper examines the limitations of finite-precision transformer models and proposes a new logic for log-precision transformer models. The authors demonstrate that finite-precision transformers are fundamentally weak and cannot express uniform attention patterns, which are a core algorithmic primitive of transformers. They also show that any function that cannot be defined in first-order counting logic with modular indexing cannot be expressed by the transformer. The authors propose a new logic, FO(M), which allows majority quantifiers and can express any function computed by a log-precision transformer. They provide an upper bound and first logical characterization of log-precision transformers. The paper raises concerns about the safe deployment, fairness, and accountability of increasingly complex transformer models with hundreds of billions of parameters. Precision transformers are a weak variant of transformers that can be equivalently expressed in a generalization of first-order logic. Recently, Chiang et al. showed that finite-precision transformers can be used to describe the types of logical rules they can resolve over some input text. One way to interpret the reasoning power of transformer-based language models is to describe the types of logical rules they can resolve. The paper "A Logic for Expressing Log-Precision Transformers" explores this concept. The authors are Ashish Sabharwal from Allen Institute for AI and William Merrill from New York University.

Raw indexed text (59,427 chars / 10,830 words / 935 lines)

A Logic for Expressing Log-Precision Transformers

William Merrill

New York University

[email protected]

Ashish Sabharwal

Allen Institute for AI

[email protected]

Abstract

One way to interpret the reasoning power of transformer-based language models

is to describe the types of logical rules they can resolve over some input text.

Recently, Chiang et al. (2023) showed that finite-precision transformers can be

equivalently expressed in a generalization of first-order logic. However, finite-

precision transformers are a weak transformer variant because, as we show, a

single head can only attend to a constant number of tokens and, in particular,

cannot represent uniform attention. Since attending broadly is a core capability

for transformers, we ask whether a minimally more expressive model that can

attend universally can also be characterized in logic. To this end, we analyze

transformers whose forward pass is computed in log n precision on contexts of

length n. We prove that any log-precision transformer can be equivalently expressed

as a first-order logic sentence that, in addition to standard universal and existential

quantifiers, may also contain majority-vote quantifiers. This is the tightest known

upper bound and first logical characterization of log-precision transformers.

Any log-precision transformer can be re-expressed as a sentence in FO(M) logic, e.g.:

Mi. a(i) ∧ Mj. b(j) ∧ ¬∃k, ℓ. (a(k) ∧ b(ℓ) ∧ ℓ < k)

(m a’s followed by m b’s, i.e., a m b m )

aaaabbbb ✓

aaabbbbb ✗

baaaabbb ✗

Figure 1: A first-order logic with majority (FO(M)) sentence for a m b m . In addition to standard ∀

and ∃ quantifiers over string indices, FO(M) allows majority quantifiers (M) that take a majority-vote

across indices. We prove FO(M) can express any function computed by a log-precision transformer.

Introduction

The incredible success of deep learning models, especially very large language and vision transformers

with hundreds of billions of parameters (Brown et al., 2020; Thoppilan et al., 2022), has come at the

cost of increasingly limited understanding of how these models actually work and when they might

fail. This raises many concerns, such as around their safe deployment, fairness, and accountability.

Does the inner working of a transformer defy description in a simpler symbolic system that we

can better understand? Or can transformer computation be described using a familiar symbolic

formalism? Understanding how to view the reasoning process of a transformer in terms of logic could

potentially expand our ability to formally reason about their behavior over large domains of inputs.

Chiang et al. (2023) provide a partial answer to this question, showing that any finite-precision

transformer can be expressed as a sentence in a variant of first-order logic with counting quantifiers

and modular arithmetic over input position indices. Specifically, counting quantifiers take the form

Preprint. Under review.∃ =x i : ϕ(i) where x is a count variable and i is a position index. They show that there exists a

single sentence in this logic that computes the output of the transformer for any input string of any

length. This is a powerful result because it shows that a simple logical formalism is fully sufficient to

describe all the complexity of a massive finite-precision transformer. It also provides an upper bound

on finite-precision transformers: any function that cannot be defined in first-order counting logic with

modular indexing cannot be expressed by the transformer.

However, Chiang et al.’s result is not fully general because it relies on the transformer precision being

fixed with respect to the transformer’s context length. More generally, as we will demonstrate in

Section 3, finite-precision transformers are a fundamentally weak variant of transformers: crucially,

cannot express uniform attention patterns, which are a core algorithmic primitive of transformers

(Weiss et al., 2018). In fact, we show that they can only attend to a constant number of input

positions, which may be seen as a rather limited generalization of hard attention. 1 For example,

Chiang et al. show that their logic for finite-precision transformers cannot recognize a m b m , whereas

in practice, transformers can (Bhattamishra et al., 2020). 2 This motivates studying a formal model

of transformers where precision grows with context length (which we formalize as log-precision),

making it possible to capture uniform attention as well as other broad attention patterns. This is

useful both for recognizing a m b m and more generally for reasoning globally over the input.

We demonstrate that log-precision transformers can also be expressed as sentences in a simple logic:

first-order logic with majority, or FO(M), over inputs strings (Barrington et al., 1990). In addition to

standard existential and universal quantifiers, FO(M) has majority quantifiers that return true iff more

than half of the propositions they quantify over are true. It also allows comparing input positions

(e.g., ℓ < k in Figure 1) and accessing their individual bits. Our main result is as follows:

Theorem 1 (Informal version of Theorem 2). For any log-precision transformer T , there exists an

FO(M) sentence ϕ that computes the same function as T , i.e., ϕ(x) = T (x) for any input string x.

Upper bound. Theorem 2 shows transformers with more than finite precision can also be expressed

in a simple extension of first-order logic, going beyond Chiang et al. (2023)’s result. On the other

hand, FO(M) is a strict superset of Chiang et al.’s counting logic; it can simulate counting quantifiers

(see Section 2.2) and allows non-modular position comparisons. Thus, handling a more general class

of transformers powerful enough express uniform attention slightly weakens the bound.

Still, our result constitutes (to our knowledge) the tightest upper bound applying to log-precision

transformers and the first defined in terms of logic, building on a line of complexity-theoretic work

analyzing the power of transformers (Hahn, 2020; Merrill et al., 2022; Liu et al., 2023; Merrill &

Sabharwal, 2023). In particular, FO(M) strengthens the upper bound of log-space-uniform TC 0

derived by Merrill & Sabharwal (2023). The refined bound adds to the limitations of transformers

identified by Merrill & Sabharwal (2023): for example, it establishes unconditionally that log-

precision transformers cannot compute boolean matrix permanents, and shows that, in a certain

formal sense, integer division and matching parentheses are among the formally hardest problems

that transformers can solve (see Section 4). 3

Mechanistic interpretability. Beyond providing an upper bound on the reasoning problems solv-

able by transformers, we believe Theorem 1 could guide the design of “transformer-complete”

programming languages similar in spirit to RASP (Weiss et al., 2018). RASP is a declarative program-

ming language designed to capture transformer computation, and Lindner et al. (2023) implement a

compiler from RASP into transformers. Unlike RASP, FO(M) can provably express any transformer

(Theorem 1), which we believe justifies using it (or an equivalent but more user-friendly variant) as a

target language for programs extracted from transformers.

Similar to a decision tree, an FO(M) sentence has the interpretable property that each sub-sentence

corresponds to a constraint on input (see Figure 1). In contrast, the internal modules of a transformer

or circuit do not satisfy this since they map between arbitrary latent spaces. We speculate this property

Hard attention is provably substantially weaker than general attention (Hao et al., 2022; Merrill et al., 2022).

Technically, the empirical results of Bhattamishra et al. (2020) are for a m b m c m , a harder variant of a m b m .

To be clear, Theorem 1 is one-sided: every transformer can be expressed as an FO(M) sentence, but

not necessarily the other way. Moreover, we believe that many FO(M) sentences cannot be expressed by

transformers. An exact logical characterization of transformers remains an open problem.

2could facilitate interpreting models by translating them to FO(M), though a careful exploration of the

algorithmic and HCI aspects of this idea lies outside the current paper’s theoretical scope.

Contributions. Our results shed new light on how to view the computation inside transformers in

terms of logic. Specifically, our main contributions is to prove the following:

1. Fixed-precision transformers can only attend to a fixed number of tokens, and those with

precision less than log log n cannot uniformly atend over contexts of length n (Proposition 1).

2. Log-precision transformers can be expressed as sentences in FO(M) (Theorem 2).

Preliminaries: Transformers and FO(M)

S ∞

Let Σ be a finite alphabet. We denote by ∗ the Kleene star operator, i.e., for a set X, X ∗ = n=0 X n .

We will view transformers and FO(M) sentences both as functions from Σ ∗ → {0, 1}, and show that

any function a transformer computes can also be computed by an FO(M) sentence.

2.1

Transformers

We view the transformer precision p as a function of the context length n, writing p(n) where

appropriate. Let D p be the datatype of p-precision floats, i.e., tuples ⟨m, e⟩ where m, e are signed

integers together taking p bits. Using |x| to mean the size of integer x, a float represents the value

e−|m|+1 4

m · 2 P

. Following Appendix A of Merrill & Sabharwal (2023), we define p-truncated addition

(+, ), multiplication (·), and division (/) over D p . We now define the transformer over D p , largely

adopting Merrill & Sabharwal’s notation.

Definition 1. A p-precision transformer T with h heads, d layers, model dimension m (divisible by

h), and feedforward width w is specified by:

1. An embedding function ϕ : Σ × N → D m

p whose form is defined in Appendix C.1;

2. For each 1 ≤ ℓ ≤ d and 1 ≤ k ≤ h, a head similarity function s ℓk : D m

p × D p → D p whose

form is defined in Appendix C.2;

m/h

3. For each 1 ≤ ℓ ≤ d and 1 ≤ k ≤ h, a head value function v k ℓ : D m

whose form is

p → D p

defined in Appendix C.2;

m/h

4. For each 1 ≤ ℓ ≤ d, an activation function f ℓ : (D p ) h × D m

p → D p whose form is

defined in Appendix C.3 and implicitly uses the feedforward dimension w;

5. An output classifier head κ : D m

p → {0, 1} whose form is defined in Appendix C.4.

Definition 2. We define the transformer computation and output as a function of an input x ∈ Σ n .

1. Embeddings: For 1 ≤ i ≤ n, h 0 i = ϕ(x i , i).

2. Self Attention: For 0 ≤ ℓ ≤ d − 1, (multihead) self-attention block ℓ + 1 computes h

attention heads:

a ℓ+1

i,k =

ℓ

s ℓ+1

k (h i , h j )

j=1

Z i,k

· v k ℓ+1 (h ℓj ),

where Z i,k =

ℓ

s ℓ+1

k (h i , h j ).

j=1

3. Activation Block: For 0 ≤ ℓ ≤ d − 1, activation block ℓ + 1 aggregates the head outputs to

produce h ℓ+1 :

ℓ+1

ℓ

h ℓ+1

= f ℓ+1 (a ℓ+1

i,1 , . . . , a i,h , h i ).

4. Classifier Head: The network prediction on x ∈ Σ n is κ(h dn ).

We say T (x) = κ(h d |x| ) and L T is the language of x ∈ Σ ∗ such that T (x) = 1.

We refer to ϕ, s ℓk , v h ℓ , f ℓ , and κ as the core functions in T , and to embeddings, self attention,

activation, and the classifier head as the components of T . We write θ T for the concatenated vector

of parameters for the functions ϕ, s ℓk , v h ℓ , f ℓ , and κ, for all 1 ≤ ℓ ≤ d and 1 ≤ k ≤ h.

E.g., ⟨101, 010⟩ represents 1.01 2 × 2 10 2 . This is closer to the IEEE standard than the m · 2 e interpretation

used in prior work (Merrill & Sabharwal, 2023), allowing us to define the minimum representable float more

realistically in Proposition 1.

3We define a log-precision transformer as one where p is at most O(log n) and is a “simple” function,

i.e., computable in O(log n) time. In our model, the weights θ T defining T are fixed, but the precision

p used to compute the forward pass can depend on n (see Footnote 11 for a generalization).

2.2

First-Order Logic with Majority

As we will show, transformers can be translated into sentences in FO(M). But what do such sentences

look like? Informally, FO(M) is first-order logic extended to also have majority (M) quantifiers.

Following Barrington et al. (1990), our sense of FO(M) takes strings in Σ ∗ as input and returns 0 or

1 to define a formal language. In this setting, quantifiers range over indices (positions) into the string.

Predicates can be applied to the variables introduced by these quantifiers.

Definition 3 (FO(M) index). Indices in FO(M) are integers denoting positions in the input string:

1. The constant 1, representing the first token’s position.

2. The constant n, representing the last token’s position.

3. Strings (e.g., i, j, k) representing variables ranging over positions 1 to n.

4. Any index built by applying addition or subtraction to other indices. 5

Definition 4 (FO(M) formula). Formulas in FO(M) are constructed as follows: 6

1. Let Σ be a finite alphabet. For each σ ∈ Σ and any index i, σ(i), e.g., a(i), is a formula that

is true if the i-th input token is σ. 7

2. For any indices i, j, the formula bit(i, j) returns the j-th bit of the binary expansion of i. 8

3. For two indices i, j, i = j, i ≤ j, and i ≥ j are formulas with their conventional semantics.

4. For two formulas ϕ, ψ,ϕ ∧ ψ and ϕ ∨ ψ are formulas with their conventional semantics.

5. For any formula ϕ (which may refer to i), the following are valid formulas:

(a) ∃i. ϕ means some value of i in [1, n] makes ϕ true.

(b) ∀i. ϕ means all values of i in [1, n] make ϕ true.

In general, formulas may contain references to free (i.e., unbound) variables: e.g., ∀i.i = j. We use

parentheses where necessary to disambiguate the order of operations. We use the term sentence for

an FO(M) formula ϕ with no free variables. Sentences represent functions from from Σ ∗ to {0, 1}

and thus define a formal language. 9

Extensions. Beyond Definition 4, FO(M) can express counting and threshold quantifiers in terms

of majority quantifiers (Barrington et al., 1990). Given a formula ϕ, a counting quantifier creates a

new formula ∃ k i : ϕ that is true iff ϕ is true across exactly k values of i. Threshold quantifiers ∃ ≤k

and ∃ ≥k work similarly but check if ϕ is true for at least or at most k values of i. In addition, we

show in Appendix A that FO(M) can express conditional majority quantifiers, which create a formula

Mi : ϕ [ψ] that is true iff ψ is true for at least half the values of i that make ϕ true.

2.2.1

Examples

To illustrate the formalism, we provide example languages definable in FO(M) with Σ = {a, b}.

First, we show two languages that do not require majority quantifiers to express:

Example 1 (Bigram matching). Strings containing the bigram ab: ∃i [a(i) ∧ b(i + 1)] .

Example 2 (Skip-bigram matching). Strings containing the long-distance pattern a . . . b (cf. “induc-

tion heads” of Elhage et al. 2021): ∃i [b(i) ∧ ∃j [j ≤ i ∧ a(j)]] .

In contrast, Example 3 is a simple example that requires majority quantifiers (Furst et al., 1981):

Example 3 (Majority). Strings with more b’s than a’s: Mi [b(i)] .

Barrington et al. (1990) did not introduce this as a primitive, but it can be simulated using the ≤ predicate.

We write parentheses to indicate the order of operations.

Barrington et al. (1990) define Q b (i) for b ∈ {0, 1}. We generalize this to an arbitrary vocabulary Σ by

assuming each token is one-hot-encoded: σ(i) = Q 1 (|Σ|i + s) where s is the index of σ in the vocabulary.

This predicate is included in the logic for technical reasons; see Barrington et al. (1990).

One can also take multiple sub-sentences within ϕ to be labeled as ordered outputs, thus allowing ϕ to be a

function from Σ ∗ to {0, 1} k for some fixed constant k.

4Figure 1 showed how FO(M) can be used to recognize patterns like a m b m . A similar idea can be

used to do parentheses matching (Barrington et al., 1990), e.g.:

Example 4 (1-Dyck). The well-balanced parentheses language (with a opening and b closing):

∀i. (∃a, b. ((∃ a j : a(j) ∧ j ≤ i) ∧ (∃ b j : b(j) ∧ j ≤ i) ∧ b ≤ a)) ∧ Mi. a(i) ∧ Mj. b(j).

Example 5 (Integer Arithmetic). Iterated addition (i.e., addition of n n-bit numbers), iterated

multiplication, and division (Hesse, 2001).

Finite Precision Transformers Cannot Attend Universally

Attention heads that spread attention weight uniformly across inputs have been observed in trans-

former LMs (Merrill et al., 2021) and make soft attention fundamentally more powerful than hard

attention (Hao et al., 2022; Merrill et al., 2022). In particular, uniform attention is an important

primitive that transformers can use to solve tasks involving counting (Bhattamishra et al., 2020;

Chiang et al., 2023), taking majority votes (Merrill et al., 2022), and matching parentheses or sorting

(Weiss et al., 2021). A transformer with sufficient precision can easily implement uniform attention

by setting the keys and queries across all positions to be constant. However, attention heads with finite

precision cannot represent uniform attention over long sequences as a consequence of the following:

P n

Proposition 1. Let a ∈ R n s.t. i=1 a i = 1 and ã its nearest p-precision float approximation.

1. Then the number

of nonzero entries of ã is upper bounded by its precision: specifically, ã

has at most 2 2 nonzero entries.

2. Moreover, if p < log log n and a is uniform (i.e., a i = 1/n), then ã = ⃗ 0.

pe −1

)

which is

Proof. The smallest positive value representable by a p-precision float is 2 −(p m −2+2

−2 p +1

2 p

−2 p +1

bounded below by 2

. Letting

holds

that

2/k.

ã

gets

the

minimum

value, then a i ≥ 1/k. Since i a i = 1, there can be at most k indices satisfying this property. This

implies there can be at most k nonzero entries in ã. If n > k and a is uniform, 1/n is less than half

of the minimum representable value of 2/k. Thus, ã = ⃗ 0.

Proposition 1 says that fixed-precision transformers are artificially limited because they can only

attend over bounded-length windows, making them similar to hard-attention transformers (Hao

et al., 2022). Morever, they cannot compute uniform attention over contexts of length n with less

than log log n precision. This explains why Chiang et al. (2023) prove finite-precision transformers

provably cannot recognize a m b m , while in practice transformers have been shown to learn even its

harder variant a m b m c m even with long context lengths (Bhattamishra et al., 2020). In essence, their

upper bound only applies in the asymptotic regime when n > 2 2 .

In contrast, transformers in practice have enough precision both to compute uniform attention and

recognize a m b m on practical context lengths. More concretely, the bfloat16 representation allows

uniform attention over 2 6+2 ≈ 10 42 tokens and normal float16 10 allows 2 10+2 ≈ 10 8 tokens, both

well above the typical context window of transformers. This motivates analyzing a model operating

in this regime where transformers have enough precision to compute uniform attention and recognize

languages such as a m b m .

Main Result: Expressing Log-Precision Transformers in FO(M)

Proposition 1 analyzing a model of transformers where the precision depends on the context length

n. Various float schemes with p > log log n will let the transformer compute uniform attention and

other attention patterns with unbounded range, like practical transformers. In this paper, we analyze

any transformer with up to O(log n) precision. We show that any function log-precision transformers

can compute can be expressed in FO(M):

Our calculations here take into account the division of p into p m and p e rather than treating them together.

Our minimum value differs slightly from numpy but is on the same order of magnitude. Moving to float8 lowers

the length upper bound for uniform attention to 2 3+2 ≈ 2048, which suggests float8 LMs will have limited

length generalization.

5Theorem 2. Let T be a log-precision transformer with a parameter vector θ T fixed for all context

lengths n. 11 Then, there exists an FO(M) sentence ϕ that computes the same function as T , i.e.,

ϕ(x) = T (x) for any input string x.

The remainder of the paper proves Theorem 2. Theorem 2 is the tightest known upper bound for

log-precision transformers and shows that it is still possible to characterize transformers in a simple

variant of first-order logic even with log-precision and uniform attention. As alluded to earlier,

Theorem 2 immediately implies that any problem that is complete for FO(M) (or a larger class)

is also transformer-hard. Since integer division and Dyck language membership are known to be

FO(M)-complete (Hesse, 2001; Aaronson et al., 2022), it follows, perhaps surprisingly, that the

computation of an entire transformer G T on input x can be reduced to a single integer division or a

finite number of Dyck-language queries:

Corollary 2.1. Let T be a transformer satisfying Theorem 2. For any input x, there exist first-order

definable integers a, b, and i (dependent on T and x) such that T (x) equals the i-th bit of ⌊a/b⌋. For

any x, there also exist first-order definable strings w 1 , . . . , w m such that T (x) is first-order definable

in terms of the membership of the w i ’s in k-Dyck.

5.1

Preliminaries for Proving Theorem 2

Computation Graphs

A computation graph G over a datatype D ⊆ {0, 1} ∗ and a countable set of primitive functions

F ⊆ D ∗ × D is a directed acyclic graph where:

1. Each node is labelled by a node type: a function f ∈ F computed by this node.

2. Each edge represents a value D flowing as output from one node into another node. We

consider the edges flowing into node j to have an order, i.e., be numbered.

3. F contains the special symbol input, which designates k nodes as input nodes. We refer to k

as the arity and assume w.l.o.g. that nodes 0, . . . , k − 1 are inputs. 12

4. A single node is designated the output node. We assume without loss of generality that the

output is the node with the largest index.

A computation graph G of arity k parameterizes a function D k → D in the standard way: the input

nodes are assigned the input values, and the value of each node is computed (traversing the graph in a

bottom-up topological order) as a function of the values of its children until the output node receives

a value. The value of the output node is considered the output of the function. It is worth noting that

computation graphs can only process inputs of bounded length. To process arbitrary-length inputs,

we will need to generalize them to computation graph families (Section 5.2).

For a computation graph G, size(G) is the number of nodes, depth(G) is the length of the longest

path from an input node to the output, and arity(G, i) is the number of inputs to node i.

Threshold circuits. A threshold circuit is a special case of a computation graph where D = {0, 1}

∗

and F is the set

P of threshold functions of the form θ ≤∆ and θ ≥∆ over D , defined as follows:

θ ≤∆ (x) = 1 if σ∈x σ ≤ ∆ and 0 otherwise; θ ≥∆ (x) is defined analogously. Typical AND, OR,

and NOT gates are a special case of threshold gates, as is an IDENTITY gate. 13

We allow nodes with the k ′ ≥ 1 largest indices to all be designated as (ordered) output nodes. A

′

threshold circuit with arity k and k ′ output nodes will thus be a function from {0, 1} k to {0, 1} k .

This will be convenient when simulating neural network components that output multiple bits.

We will find it useful to consider threshold circuits as a kind of compilation target for computation

graphs: in other words, we will be concerned with simulating computation graphs defined over more

complex functions and data types into threshold circuits.

Theorem 2 can also be extended to apply to log-precision transformers with log-uniform weights, i.e., where

θ T can grow in size and precision with n (see Appendix B).

By convention in computer science, we let computation graph nodes be zero-indexed.

For more background on threshold circuits, see Merrill & Sabharwal (2023) and Merrill et al. (2022).

65.2

Computation Graph Families

A computation graph family over D and F is a mapping from n ∈ N to a computation graph G n

for processing inputs of size n. Thus, G defines a function from D ∗ → D, where G(x) = G |x| (x).

Intuitively, computation graph families are useful because they generalize computation graphs to

define functions over unbounded-length strings as inputs.

Size, depth, and arity. For computation graph families, the size, depth, and arity become functions

of the input length n: size G (n) = size(G n ), depth G (n) = depth(G n ), arity G (n, i) = arity(G n , i).

Uniformity. The infinite set G can be alternatively represented by two functions:

1. node G (n, i), which returns the type of node i in G n if i ≤ size(G n ), and ∅ otherwise. For

example, if node i computes the logical AND of its inputs, then node G (n, i) = ∧.

2. edge G (n, i, j), which returns the argument index of i into node j if G n contains an edge

i → j and −1 otherwise. edge G (n, i, j) only needs to be defined over i, j < size(G n ). For

example, if G n contains a node j with three incoming edges, the second of which comes

from node i, then edge G (n, i, j) = 1.

A pair of algorithms implementing these two functions uniquely specifies a computation graph family,

as it enables building the computation graph G n for any n. Uniform computation graph families

(generalizing uniform circuits; cf. Arora & Barak, 2009) are families where node G and edge G can be

computed efficiently, i.e., under some constraints on space or time:

Definition 5 (Uniformity). A computation graph family G is T (n)-uniform iff node G (n, i) and

edge G (n, i, j) can be computed by a deterministic Turing machine in time T (n). We focus on

log-uniform computation graph families: i.e, where T (n) = O(log n).

Threshold circuit families. These are simply families of threshold circuits. We will be interested

in simulating computation graph families with threshold circuit families.

Proof of Theorem 2

To prove Theorem 2, we view transformers as computation graphs, as justified by the following:

Lemma 1 (Proof in Appendix B.1). A transformer T is a log-uniform computation graph family

where F contains embedding, self-attention, feedforward, and output components.

Theorem 2 assumes realistically that the weights are fixed for all n. This implies that each component

of the transformer computation graph is log-uniform (Appendix B.2) and that we can pad its size to a

power of 2 computable in time O(log n):

Lemma 2 (Proof in Appendix B.2). Let T be a log-precision transformer with fixed parameters θ T .

Then each component in F is computable in log-uniform TC 0 .

Lemma 3 (Proof in Appendix B.3). Let T be a log-precision transformer with fixed parameters

θ T . There exists a function bsize(n) that is a power of 2 and computable in O(log n) time s.t.

size F (n) ≤ bsize(n) for all F ∈ F.

Lemma 4 (Proof in Appendix B.4). If F is a log-uniform TC 0 family and size F (n) ≤ bsize(n),

there exists a log-uniform TC 0 family F ′ s.t. F(x) = F ′ (x) for all x and size F ′ (n) = bsize(n).

At this point, we see that a transformer T is a log-uniform computation graph over F and can combine

Lemmas 3 and 4 to get that each F ∈ F is computable by a log-uniform TC 0 family with size bsize(n)

that is both a power of 2 and computable in time O(log n). The main part of the proof is showing

that these conditions imply T can be simulated by a TC 0 family C (Theorem 3) and moreover that

C is log-uniform (Corollary 3.2). We can then use the equivalence of log-uniform TC 0 and FO(M)

(Barrington et al., 1990) to conclude that any log-precision transformer can be expressed in FO(M).

7Algorithm 1 node C (n, i)

Return the type of gate i in circuit C n .

1: F ← node G (n, bnode(n, i))

2: if F ̸ = ∅ then

return node F (n, i − bstart(n, i ′ ))

4: else return ∅

6.1

Algorithm 2 edge C (n, i, j)

If C n contains an edge i → j, return the argument

number of that edge. Otherwise, return −1.

1: i ′ ← bnode(n, i)

2: j ′ ← bnode(n, j)

3: s i ← bstart(n, i ′ )

4: s j ← bstart(n, j ′ )

5: if i ′ = j ′ then

F ← node G (n, i ′ )

return edge F (n, i − s i , j − s j )

8: else if edge G (n, i ′ , j ′ ) ≥ 0 then

b i ← i − (s i + bsize(n, i ′ ) − p(n))

10:

b j ← j − (s j + p(n) · edge G (n, i ′ , j ′ ))

11:

if b i = b j < p(n) then return j − s j

12:

else return −1

13: else return −1

Simulating Computation Graph Families with Circuit Families

We give algorithms that take a computation graph family and define a circuit family simulating it.

Intuitively, the algorithms creates contiguous blocks of circuit gates simulating each node in the

computation graph and route inputs and outputs between blocks appropriately.

Block mapping. This algorithm depends on a block mapping, which we define as an implementation

of the following three functions:

1. The block node bnode(n, i): the index of the node that gate i’s block is simulating.

2. The block start bstart(n, i ′ ): the smallest gate index in the block simulating node i ′ .

3. The block size bsize(n, i ′ ): the number of gates in the block simulating node i ′ .

Further, we enforce that a valid block mapping must satisfy that, for all i, with i ′ = bnode(n, i),

bstart(n, i ′ ) ≤ i < bstart(n, i ′ ) + bsize(n, i ′ ).

Let G be a computation graph whose primitive functions are computable by log-uniform threshold

circuits. We can identify each primitive function with a log-uniform threshold circuit family F that

computes it, where the first arity F (n) gates are IDENTITY gates reserved for taking input. For such

a graph, the node G function can be taken to return a symbol identifying a circuit family F.

In this case, our algorithm requires that, for all i ′ , the block size of i ′ must match the size of the

circuit for the type of block i ′ , i.e., bsize(n, i ′ ) = size node G (n,i ′ ) (n).

These properties allow us to meaningfully identify a graph node i ′ with a block of nodes that will

simulate it. This intuition enables us to develop Algorithms 1 and 2 for constructing a uniform

threshold circuit family from a uniform computation graph family.

Theorem 3. Let G be a computation graph over a finite set of node types F, where each F ∈ F is

specified by a log-uniform circuit family. Let bnode, bstart, and bsize be a valid block mapping in

the sense above. Then Algorithms 1 and 2 define a circuit family C such that

1. C and G compute the same D ∗ p → D p function (let the final p gates of each C i be its output).

2. depth C (n) ≤ depth G (n) · max F depth F (n).

3. size C (n) ≤ size G (n) · max F size F (n).

Proof. Assume without loss of generality that the gates of C are topologically ordered. We show by

induction over circuit gates j (with j ′ = bnode(n, j)) that:

1. For all i ′ < j ′ , the last p nodes of block i ′ store the value of node i ′ .

2. For all i such that bstart(n, j ′ ) ≤ i ≤ j, gate i of C (as a function of the input nodes of j ′ )

computes gate i − bstart(n, j ′ ) of node G (n, j ′ ).

8Base case. We have two circuits with no gates, so the premises are trivially satisfied.

Inductive case. Assume the premises hold up to j. We will show they hold for j + 1. Let T =

node G (n, j ′ ). By Premise 1, we know that the last p nodes of block i ′ store the output of node i ′ , for

i ′ < j ′ . By Algorithm 2, for each i ′ such that edge G (n, i ′ , j ′ ) = a with 0 ≤ k < arity F (n), gates kp

through k(p + 1) − 1 of block j ′ will copy the final p gates of block i ′ . Thus, the first k · arity F (n)

gates of block j ′ store the inputs to node j ′ .

At this point, we use Premise 2 to conclude that the first j − bstart(n, j ′ ) gates of block j ′ compute

the same function as the first j − bstart(n, j ′ ) gates of F with respect to this input. Thus, we just

need to show that gate j + 1 is also correct. Within Algorithm 2, we fall in case i ′ = j ′ , meaning

that gate j + 1 of block j ′ gates the same inputs as gate j + 1 of F. By Algorithm 1, the type of

gate j + 1 in block j ′ is the type of gate j + 1 of F. Thus, gate j + 1 in block j ′ computes the same

function of the input gates as gate j + 1 in F. If j + 1 = bsize(n, j ′ ), we conclude that the final p

gates of block j ′ store the output of node j ′ .

Let XC 0 denote any family of constant-depth, poly-size circuits, including AC 0 and TC 0 . 14

Corollary 3.1. Let G be a constant-depth, poly-size computation graph family over a finite F. If

every node type in F can be computed by XC 0 circuits, the function computed by G is in XC 0 .

Since a transformer has constant depth and polynomial size, Theorem 3 lets us easily recover prior

results about hard-attention transformers (Hao et al., 2022; Hahn, 2020) and saturated attention

transformers (Merrill et al., 2022) using a common framework. All one has to do is show that all

individual node types in such transformers can be computed by AC 0 and TC 0 circuits, respectively.

Corollary 3.1 established that Algorithms 1 and 2 construct a circuit family that simulates G. With

the right block mapping, C will be log-uniform as long as G and its node types are log-uniform.

Corollary 3.2. Let G be a log-uniform, constant-depth computation graph family over a finite F,

where each F ∈ F is specified by a log-uniform TC 0 family with size F (n) = bsize(n) that is a power

of 2 computable in O(log n) time.Then G can be simulated by a log-uniform TC 0 family C that obeys

the size and depth properties of Theorem 3.

Proof. Let C be the circuit family defined by Algorithms 1 and 2 given G and the following block

mapping: bnode(n, i) = ⌊i/bsize(n)⌋, bstart(n, i ′ ) = i ′ · bsize(n), bsize(n, i ′ ) = bsize(n). Since

bsize(n) is a power of 2, bnode and bstart are reducible to left and right shifting over O(log n)-

bit integers, which can be implemented in O(log n) time. Thus, each block mapping function is

computable in time O(log n). Since node G and edge G are just calling functions computable in time

O(log n) with constant overhead, we conclude that C, the circuit family they define, is log-uniform,

and it is already known to simulate G with constant depth and polynomial size by Theorem 3.

Conclusion

We proved that any log-precision transformer can be translated to an FO(M) sentence that computes

the same function as the transformer (on all inputs of any length). This result comes by first

simulating a transformer with a highly uniform threshold circuit family, and then leveraging the

established equivalence of log-uniform circuits and FO(M). Transformers and other neural nets are

often discussed in contrast with symbolic models based on logical formalisms (Garnelo & Shanahan,

2019)—an immediate implication of our result is that it is possible to express the inner workings of

transformers also in a simple logic, challenging the premise of a rigid division between symbolic and

neural models. Our results also provide the tightest known upper bound on log-precision transformers.

While it is striking that a full transformer can be translated to a sentence in a logic as simple as

FO(M), we believe the bound is not tight. In particular, we conjecture that it is possible to simulate

any transformer with an FO(M) sentence of quantifier depth of at most 2, which could be proven by

establishing a hierarchy theorem describing the FO(M) quantifier depth needed to simulate a TC 0

family of a certain size. It would also be an interesting extension to translate real transformers to

FO(M) sentences. In this sense, we believe our results provide a theoretical foundation to guide

mechanistic interpretability work (cf. Weiss et al., 2021; Lindner et al., 2023).

Formally, F just needs to contain ∧ and ∨.

9References

Aaronson, S., Kuperberg, G., and Habryka, O. TC 0 : Constant depth threshold circuits, 2022. URL

https://complexityzoo.net/Complexity_Zoo:T#tc0.

Arora, S. and Barak, B. Computational Complexity: A Modern Approach. Cambridge Uni-

versity Press, 2009. URL https://books.google.com/books/about/Computational_

Complexity.html?id=8Wjqvsoo48MC.

Barrington, D. A. M., Immerman, N., and Straubing, H. On uniformity within NC 1 . Journal of

Computer and System Sciences, 41(3):274–306, 1990. ISSN 0022-0000. doi: https://doi.org/10.

1016/0022-0000(90)90022-D. URL https://www.sciencedirect.com/science/article/

pii/002200009090022D.

Bhattamishra, S., Ahuja, K., and Goyal, N. On the ability and limitations of transformers to

recognize formal languages. In Proceedings of the 2020 Conference on Empirical Methods in

Natural Language Processing (EMNLP), pp. 7096–7116, Online, November 2020. Association

for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.576. URL https://www.

aclweb.org/anthology/2020.emnlp-main.576.

Brent, R. P. and Zimmermann, P. Modern computer arithmetic, volume 18. Cambridge University

Press, 2010.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam,

P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R.,

Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray,

S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D.

Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.,

and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1877–1901.

Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/

1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.

Chiang, D., Cholak, P., and Pillay, A. Tighter bounds on the expressivity of transformer encoders,

2023.

Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A.,

Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones, A.,

Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S.,

and Olah, C. A mathematical framework for transformer circuits. Transformer Circuits Thread,

2021. https://transformer-circuits.pub/2021/framework/index.html.

Furst, M., Saxe, J. B., and Sipser, M. Parity, circuits, and the polynomial-time hierarchy. In

Proceedings of the 22nd Annual Symposium on Foundations of Computer Science, SFCS ’81,

pp. 260–270, USA, 1981. IEEE Computer Society. doi: 10.1109/SFCS.1981.35. URL https:

//doi.org/10.1109/SFCS.1981.35.

Garnelo, M. and Shanahan, M. Reconciling deep learning with symbolic artificial intelligence: repre-

senting objects and relations. Current Opinion in Behavioral Sciences, 29:17–23, 2019. ISSN 2352-

1546. doi: https://doi.org/10.1016/j.cobeha.2018.12.010. URL https://www.sciencedirect.

com/science/article/pii/S2352154618301943. Artificial Intelligence.

Hahn, M. Theoretical limitations of self-attention in neural sequence models. Transactions of the

Association for Computational Linguistics, 8:156–171, 2020. URL https://www.aclweb.org/

anthology/2020.tacl-1.11.

Hao, Y., Angluin, D., and Frank, R. Formal language recognition by hard attention transformers:

Perspectives from circuit complexity, 2022. URL https://arxiv.org/abs/2204.06618.

Hesse, W. Division is in uniform TC 0 . In International Colloquium on Automata, Languages, and

Programming, pp. 104–114. Springer, 2001.

Hunter, P., Bouyer, P., Markey, N., Ouaknine, J., and Worrell, J. Computing rational radical sums in

uniform TC0. Foundations of Software Technology and Theoretical Computer Science, 2010.

10Lindner, D., Kramár, J., Rahtz, M., McGrath, T., and Mikulik, V. Tracr: Compiled transformers as a

laboratory for interpretability, 2023. URL https://arxiv.org/abs/2301.05062.

Liu, B., Ash, J. T., Goel, S., Krishnamurthy, A., and Zhang, C. Transformers learn shortcuts to

automata. In The Eleventh International Conference on Learning Representations, 2023. URL

https://openreview.net/forum?id=De4FYqjFueZ.

Merrill, W. and Sabharwal, A. The parallelism tradeoff: Limitations of log-precision transformers,

2023. URL https://arxiv.org/abs/2207.00729.

Merrill, W., Ramanujan, V., Goldberg, Y., Schwartz, R., and Smith, N. A. Effects of parameter norm

growth during transformer training: Inductive bias from gradient descent. In Proceedings of the

2021 Conference on Empirical Methods in Natural Language Processing, November 2021. doi: 10.

18653/v1/2021.emnlp-main.133. URL https://aclanthology.org/2021.emnlp-main.133.

Merrill, W., Sabharwal, A., and Smith, N. A. Saturated transformers are constant-depth threshold

circuits. Transactions of the Association for Computational Linguistics, 10, 2022.

Thoppilan, R., Freitas, D. D., Hall, J., Shazeer, N. M., Kulshreshtha, A., Cheng, H.-T., Jin, A., Bos,

T., Baker, L., Du, Y., Li, Y., Lee, H., Zheng, H., Ghafouri, A., Menegali, M., Huang, Y., Krikun,

M., Lepikhin, D., Qin, J., Chen, D., Xu, Y., Chen, Z., Roberts, A., Bosma, M., Zhou, Y., Chang,

C.-C., Krivokon, I. A., Rusch, W. J., Pickett, M., Meier-Hellstern, K. S., Morris, M. R., Doshi, T.,

Santos, R. D., Duke, T., Søraker, J. H., Zevenbergen, B., Prabhakaran, V., Díaz, M., Hutchinson,

B., Olson, K., Molina, A., Hoffman-John, E., Lee, J., Aroyo, L., Rajakumar, R., Butryna, A.,

Lamm, M., Kuzmina, V. O., Fenton, J., Cohen, A., Bernstein, R., Kurzweil, R., Aguera-Arcas, B.,

Cui, C., Croak, M., Chi, E., and Le, Q. Lamda: Language models for dialog applications. ArXiv,

abs/2201.08239, 2022.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and

Polosukhin, I. Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H.,

Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing

Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/

paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.

Weiss, G., Goldberg, Y., and Yahav, E. On the practical computational power of finite precision

RNNs for language recognition. In Proceedings of the 56th Annual Meeting of the Association

for Computational Linguistics (Volume 2: Short Papers), pp. 740–745, Melbourne, Australia,

July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-2117. URL https:

//www.aclweb.org/anthology/P18-2117.

Weiss, G., Goldberg, Y., and Yahav, E. Thinking like transformers. ArXiv, abs/2106.06981, 2021.

URL https://arxiv.org/pdf/2106.06981.pdf.

Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Zhang, H., Lan, Y., Wang, L., and Liu, T.-Y.

On layer normalization in the transformer architecture, 2020. URL https://openreview.net/

forum?id=B1x8anVFPr.

11Table 1: Summary of common notation for computation graph and circuit families.

Graph

i ′

node G (n, i ′ )

edge G (n, i ′ , j ′ )

size G (n)

depth G (n)

Circuit

node C (n, i)

edge C (n, i, j)

size C (n)

depth C (n)

bnode(n, i)

bstart(n, i ′ )

bsize(n, i ′ )

Output Range

F 15

[0, size G (n)]

[0, size C (n)]

Description

index of node/gate

type of node/gate

argument # of edge i → j

# of nodes/gates

longest path length

block containing i

first gate in block i ′

size of block i ′

Conditional Majority

Given formulas ϕ, ψ, Mi : ϕ.ψ is a sentence that is true iff ψ is true for at least half the values of i

that make ϕ true.

Proposition 2. For any two predicates ϕ(i) and ψ(i), Mi : ϕ(i).ψ(i) can be expressed in FO(M).

Proof. Mi : ϕ.ψ can be rewritten using a counting quantifier and a threshold quantifier:

′

∃k ∃k ′ 2k ′ = k ∧ ∃ k i : ϕ(i) ∧ ∃ ≥k j : (ϕ(j) ∧ ψ(j)) .

The formula 2k ′ = k can be defined using bit. We then use the fact that counting and threshold

quantifiers can be expressed in terms of majority quantifiers (Barrington et al., 1990) to conclude that

Mi.ϕ[ψ] can be expressed in FO(M).

B.1

Omitted Proofs

Transformers are Log-Uniform Computation Graph Families

We now justify that the computation graph family defining a transformer is log-uniform. To do this,

we introduce a stronger notion of uniformity called column uniformity that captures the highly regular

structure of the transformer.

Let node(G, i) be the i-th node of computation graph G. Let a mod b be the remainder when a is

divided by b.

Definition 6 (Column uniformity). A computation graph family G is T (n)-column-uniform iff there

exists a computation graph K (with fixed size w.r.t n) such that, for all i, j such that 0 ≤ i, j <

size G (n):

1. node G (n, i) = node (K, i mod size(K)).

2. If ⌊i/size(K)⌋ = ⌊j/size(K)⌋, then

edge G (n, i, j) = edge (K, i mod size(K), j mod size(K)) .

Otherwise, edge G (n, i, j) can be computed by a deterministic Turing machine in time T (n).

We define log-column-uniform analogously to log-uniform: i.e., we let T (n) = O(log n). log-

column-uniform implies log-uniform because our implementations of node G and edge G can store K

in a finite lookup table and compute the quotient and remainder of i and j by size(K) in O(log n)

time using Lemma 12. The edges outside of K are computable in O(log n) time by construction.

Lemma 1 (Proof in Appendix B.1). A transformer T is a log-uniform computation graph family

where F contains embedding, self-attention, feedforward, and output components.

Proof. We show the stronger condition that any transformer T is a log-column-uniform computation

graph family, which implies it is log-uniform.

We have the column K by Definition 2: all that remains to show is that edge G T can be computed

in time O(log n) for edges outside the column. These edges route from the layer ℓ output to the

12self-attention heads of layer ℓ + 1. Following from the column structure, there exists k ℓ such that a

node i is an output vector of layer ℓ iff k ℓ = i mod size(K). In a finite lookup table, we can store

k ℓ for each ℓ + 1, and use this for self-attention routing. For an unmasked self-attention head j, we

compute:

⌊i/size(K)⌋ if k ℓ = i mod size(K)

edge G T (n, i, j) =

−1

otherwise.

For causally masked attention, we extend the first case to check that ⌊i/size(K)⌋ ≤ ⌊j/size(K)⌋.

Either way, this logic can be implemented in time O(log n) via Lemma 12. Thus, we conclude that

G T is column-uniform.

B.2

Transformer Components are Computable in Log-Uniform TC 0

Lemma 2 (Proof in Appendix B.2). Let T be a log-precision transformer with fixed parameters θ T .

Then each component in F is computable in log-uniform TC 0 .

We prove a more general version of Lemma 2 that handles some cases with weights growing with n.

The weights θ T are just a special case of a computation graph (that do not depend on the input); we

can thus apply our definition of log-uniform to them. Lemma 2 follows from a more general result

with log-uniform θ T :

Lemma 5. Let T be a log-uniform transformer with log-uniform θ T . Then each component in F is

computable in log-uniform TC 0 .

Proof. In Appendix C, we show that log-uniform θ T implies:

The embedding component is computable in log-uniform TC 0 (Lemma 6).

The self attention mechanism is computable in log-uniform TC 0 (Lemma 7).

The activation block is computable in log-uniform TC 0 (Lemma 8).

The output classifier head is computable in log-uniform TC 0 (Lemma 9).

We have shown that each F ∈ F is computable in log-uniform TC 0 .

B.3

Transformer Component Size Has a Log-Time Upper Bound

Lemma 3 (Proof in Appendix B.3). Let T be a log-precision transformer with fixed parameters

θ T . There exists a function bsize(n) that is a power of 2 and computable in O(log n) time s.t.

size F (n) ≤ bsize(n) for all F ∈ F.

Proof. Let 2 b(n) be the least power of 2 at least as large as size F (n) for all F. We observe that 2 b(n)

is at most 2 · max F size F (n) for all n. Because each F has poly size, there is a fixed k such that, for

large enough n, 16

2 b(n) ≤ n k

⇒ b(n) ≤ k⌈log n⌉.

′

Define b ′ (n) = k⌈log n⌉ and bsize(n) = 2 b (n) . bsize(n) is both a power of 2 and an upper bound

on 2 b(n) ; what remains to be shown is that it can be computed in time O(log n). We can first

compute ⌈log n⌉ in time O(log n) by finding the greatest nonzero index of n. Next, we can compute

b ′ (n) = k · ⌈log n⌉ in time O(log log n) since k is fixed size and ⌈log n⌉ has size at most O(log log n)

′

(Brent & Zimmermann, 2010). Finally, we compute bsize(n) = 2 b (n) by simply left-shifting 1 at

most O(log n) times.

B.4

Circuit Families Can Be Padded to Log-Time Size Upper Bounds

Recall that the last p bits of our circuits represent the circuit’s output (cf. Section 5.1). In Lemma 4,

we consider F(x) = F ′ (x) if and only if the last p bits of F and F ′ agree for all x.

We can compute bsize(n) for small n using finite lookup.

13Lemma 4 (Proof in Appendix B.4). If F is a log-uniform TC 0 family and size F (n) ≤ bsize(n),

there exists a log-uniform TC 0 family F ′ s.t. F(x) = F ′ (x) for all x and size F ′ (n) = bsize(n).

Proof. The high level idea is that we can pad F to a circuit F ′ that has size bsize(n) and simply

copies over the p output bits of F to its own last p bits using identity gates.

We first set node F ′ to copy over the existing circuit and append identity nodes. Let Id denote an

identity node. Then node F ′ is defined as:



 node F (n, i) if node F (n, i) ̸ = ∅

node F ′ (n, i) = Id

if node F (n, i) = ∅ ∧ i < bsize(n)



∅

otherwise.

We see that the size of F ′ will thus be of size bsize(n).

Next, we extend edge F ′ (n, i, j) to route the original output bits to the new output bits. Recall that

an edge value of 0 means i is the first argument of gate j, and an edge value of −1 means there is

no edge i → j. Let k j = p(n) − (bsize(n) − j) be the index of node j as an output gate in F ′ . For

example, k = 0 for the first output bit. Now let output F (n, i, k) represent whether node i is the k-th

output of F n . We can compute output F (n, i, k) in terms of node F as follows:

output F (n, i, k) ⇐⇒ node F (n, i + p(n) − k − 1) ̸ = ∅ ∧ node F (n, i + p(n) − k) = ∅.

Then edge F ′ is defined:



 edge F (n, i, j) if edge F (n, i, j) ̸ = −1

edge F ′ (n, i, j) = 0

if output F (n, i, k j )



−1

otherwise.

The first condition simply copies over the original edges. The second condition adds p(n) new edges

(for the different values of k) that route the final p(n) nodes of F to the final p(n) nodes of F ′ ,

guaranteeing that the two circuits will compute the same function.

Because both node F ′ and edge F ′ just rely on addition, conditional branching, and a finite number of

calls to functions computable in time O(log n), they are both computable in time O(log n).

Transformer Column Components

In this section, we generally omit layer subscripts for clarity. We assume a pre-norm (Xiong et al.,

2020) parameterization of the transformer for concreteness and because this is more standard in

newer transformers. However, the results would also hold with the original post-norm (Vaswani et al.,

2017).

As mentioned in the main text, we view θ T as a concatenation of the parameters for the transformer

functions. Thus, if m and w are computable in time O(log n) and θ T is log-uniform, it follows that

the parameter vector for each ϕ, s, v, f , and κ is itself log-uniform because we can map indices in the

smaller parameter vectors to indices in θ T in time O(log n).

C.1

Transformer Embeddings

For each position 1 ≤ i ≤ n, the transformer embedding function represents token σ i ∈ Σ and its

position i with a vector. Let V be an embedding matrix of size |Σ| × m where each row represents

the embedding for some σ. Let f : N → D m

p be computable in time O(log n). Then,

ϕ(σ i , i) = v σ i + f (i).

Lemma 6. If θ T is log-uniform, then ϕ is computable in log-uniform TC 0 .

Proof. The embedding block can be expressed as a constant-size computation graph that constructs

V, computes v σ i using an affine transformation, computes f (i), and then, finally, sums v σ i and

f (i). The first step is computable by a log-uniform constant-depth, poly-size threshold circuit family

since θ T is log-uniform. We can compute an affine transformation via a log-uniform constant-depth

14poly-size threshold circuit family via Lemma 10. f (i) can be directly computed by the Turing

machine constructing the circuit by construction. The sum of the two terms can then be computed by

a log-uniform constant-depth threshold circuit of size polynomial in m, which is also polynomial

in n. Since we have a computation graph where all node types are computable by log-uniform,

constant-depth, poly-size threshold circuit families, we conclude by Corollary 3.2 that ϕ can also be

computed by log-uniform, constant-depth, poly-size threshold circuit family.

C.2

Self Attention

The two components of the self attention block are s, the similarity function, and v, the value function.

Let h i be the hidden state at the previous layer and h̄ i = lnorm(h i ). Then, the similarity function

first computes queries and keys, and then takes the scaled dot-product between them:

q i = W q h̄ i + b q

k i = W k h̄ i + b k

s(h i , h j ) = exp

q ⊤ k

p i i

m/h

Then the value function is defined v(h i ) = W h h̄ i + b h . We first show that the value function (and

also the keys and queries by symmetry) is computable in log-uniform TC 0 :

Lemma 7. If θ T is log-uniform, then the self-attention component is computable in log-uniform TC 0 .

Proof. v is a composition of constructing the parameters (in log-uniform TC 0 since θ T is log-

uniform), layer norm (in log-uniform TC 0 by Lemma 11), and an affine transformation (in log-

uniform TC 0 by Lemma 10). Thus, v is computable in log-uniform TC 0 .

Computing s is a constant-depth computation graph. First, we compute q i and k i and then multiply

them, and all of these steps are in log-uniform TC 0 . Next, we can compute m and p h in time O(log n)

and build a log-uniform TC 0 circuit that divides the product of the last step by m/h. Finally, we

compute p-precision exp, which can be expressed in log-uniform TC 0 as multiplication followed by

left-shifting. Thus, by Corollary 3.2, s can be computed in log-uniform TC 0 .

s and v are log-uniform, so their size p is at most poly(n). Computing self attention reduces to

binary multiplication and division over D p , and performing iterated addition (summation) over n

numbers in D p . Binary multiplication, binary division (Hesse, 2001), and iterated addition (Merrill &

Sabharwal, 2023) can all be computed in log-uniform TC 0 , i.e., by a log-uniform, constant-depth

threshold circuit family of size at most poly(p) ≤ poly(n). Thus, self attention can also be computed

in log-uniform TC 0 .

C.3

Activation Block

The activation function f encapsulates the aggregation of the attention head outputs and the feedfor-

m/h

ward subnetwork of the transformer. f takes as input attention head outputs a i,1 , . . . , a i,h ∈ D p

and the previous layer value h i .

The first part of the activation block simulates the pooling part of the self-attention sublayer. The

head outputs are first concatenated to form a vector a i , which is then passed through an affine

transformation (W o , b o ) : D m

p → D p followed by residual connections to form the sublayer output

o i ∈ D m

o i = W o a i + b o + h i .

The second part of the activation block first applies layer-norm and then simulates the feedforward

subnetwork to compute the next layer vector h ′ i . Let ō i = lnorm(o i ). Let σ be a nonlinearity

computable in linear time on its input (in the most standard transformer, ReLU). Then, for affine

transformations (W 1 , b 1 ) : D m

p → D p and (W 2 , b 2 ) : D p → D p , the feedforward subnetwork can

be defined:

h ′ i = W 2 σ(W 1 ō i + b 1 ) + b 2 + o i .

15Lemma 8. If θ T is log-uniform, then f is computable in log-uniform TC 0 .

Proof. The activation block can be expressed as a constant-size computation graph where the nodes

construct affine transformation parameters, apply affine transformations, compute layer-norm, and

compute elementwise nonlinearities. Since each of these nodes is computable by a log-uniform,

constant-depth, poly-size threshold circuit family, the activation block is as well.

C.4

Output Classifier Head

We assume the output from the transformer is computed as follows. First, h̄ 1 = lnorm(h 1 ). Then,

we use a parameter vector w ∈ D m

p and bias term b to compute:

κ(h 1 ) = sgn(w ⊤ h̄ 1 + b).

Lemma 9. If θ T is log-uniform, then κ is computable in log-uniform TC 0 .

Proof. We can express computing κ as a composition of constructing the parameters w, b and

computing the affine transformation. Both parts of this composition are computable by a log-uniform,

constant-depth, poly-size threshold circuit family, so computing κ is as well.

Neural Net Building Blocks

In this section we analyze the uniformity of common neural net building blocks that are used within

the various high-level transformer components.

D.1

Affine Transformations

Affine transformations are a core part of neural networks used in various parts of the transformer. An

affine transformation takes as input parameters (W, b) : D ap → D bp and a vector x ∈ D ap and returns

Wx + b.

Lemma 10. For p = O(log n), any p-precision affine transformation where W, b are log-uniform is

computable by a log-uniform, constant-size threshold circuit family of size polynomial in a and b.

Proof. We first use the uniformity of W, b to construct them in O(log n) time. For the transformation

Wx + b, first compute each w i ⊙ x in parallel, where ⊙ represents elementwise multiplication.

Since binary multiplication over polynomial-size numbers is in log-uniform TC 0 , this can be done

in parallel with log-uniform TC 0 circuits. We then use b log-uniform, constant-depth, poly-size

threshold circuit families, each corresponding to an output index, that compute the sum over the a

entries of each w i ⊙ x. The affine transformation corresponds to the composition of these two steps,

and is thus computable by a log-uniform TC 0 circuit family.

D.2

Layer Norm

P d

The layer norm is applied between sublayers in the transformer. Let µ = (1/d) i=1 x i . The layer

norm y ∈ D m

p of a vector x ∈ D p is computed, for scalars a, b ∈ D p ,

x − µ

y = a

+ b.

∥x − µ∥

Lemma 11. If a, b are log-uniform, the layer norm over a vector of size m can be computed by a

log-uniform threshold circuit family of constant depth and size polynomial in m.

Proof. First compute m using summation over the constant term 1 from 1 to m. This summation can

be computed by a log-uniform constant-depth threshold circuit family of size polynomial in m. Then

compute the sum over x using a similar circuit, and divide them to get µ, using the fact that integer

division is in log-uniform TC 0 (Hesse, 2001). We can then compute x − µ in log-uniform TC 0 .

At this point, we can compute ∥x − µ∥ in log-uniform TC 0 (Hunter et al., 2010), then divide each

x − µ by the norm in log-uniform TC 0 , and then apply the final affine transformation in log-uniform

TC 0 (Lemma 10). Thus, computing layer norm is in log-uniform TC 0 .

16E

Arithmetic Complexity

Lemma 12. Given an m-bit integer a and n-bit integer b, we can compute the quotient ⌊a/b⌋ and

remainder a mod b in time O(mn).

Proof. Let D(m, n) and M (m, n) denote, respectively, the time complexity of dividing and multi-

plying an m-bit integer by an n-bit integer. Brent & Zimmermann (2010) give the following fact:

D(m + n, n) ≤ O(M (m, n)). With the goal of analyzing D(m, n), we apply this as follows:

D(m, n) ≤ D(m + n, n)

≤ O(M (m, n))

≤ O(mn).

Applying Lemma 12 when a has size O(log n) and b has size O(1) says that we can do division in

time O(log n).