Summary of GateLoop Fully Data-Controlled Linear Recurrence for Sequence Modeling

Summary GateLoop Fully Data-Controlled Linear Recurrence for Sequence Modeling arxiv.org

5,384 words - PDF document - View PDF document

One Line

GateLoop is a sequence modeling model that outperforms others by maximizing linear recurrence potential, offering content-aware control and superior performance.

Slides

Slide Presentation (10 slides)

Copy slides outline Copy embed code Download as Word

GateLoop: Maximizing Linear Recurrence Potential for Superior Sequence Modeling

Source: arxiv.org - PDF - 5,384 words - view

Introduction

• GateLoop is a foundational sequence model that utilizes fully data-controlled linear recurrence.

• It outperforms existing models for auto-regressive language modeling.

• GateLoop offers a low-cost recurrent mode and an efficient parallel mode.

Data-Controlled Gating for Content-Aware Control

• GateLoop incorporates data-controlled gating of inputs, hidden states, and outputs.

• Allows for content-aware control over forget- and retention behavior.

• Provides improved sequence modeling performance.

Outperforming Existing Models

• GateLoop is compared to various models such as S4, S4D, LRU, RetNet, Transformer, Hybrid H3, Performer, Reformer, Linear Attention, Transformer-XL, Hyena, and S5-Hyena.

• GateLoop outperforms these models in terms of test perplexity on the WikiText103 benchmark for autoregressive language modeling.

• Demonstrates superior performance capabilities.

Practical Benefits of GateLoop

• Avoids softmax-attention layers.

• Eliminates the need for tedious initialization.

• Does not require long implicit convolutions.

• Offers practical advantages over existing models.

Validating Data-Controlled State Transitions

• The synthetic Memory Horizon dataset is used to validate the advantage of data-controlled state transitions.

• GateLoop significantly outperforms a model with fixed state transitions in terms of test accuracy.

• Fully data-controlled variant maintains performance for twice as long as the fixed variant as the required memory span increases.

Structured Patterns in State Transitions

• GateLoop's state transitions exhibit structured patterns.

• Indicates deliberate utilization of data-controlled gating and forgetting/retention of memories.

• Provides insights into the model's functionality.

Future Work and Exploration

• Future work can explore different initialization strategies, amplitude- and phase-activations.

• Further research can focus on the interpretability of the learned state transitions.

• Opportunities for enhancing the model's capabilities.

Conclusion

• GateLoop demonstrates the effectiveness of fully data-controlled linear recurrence for sequence modeling.

• Offers improved performance and practical advantages over existing models.

• A significant advancement in the field of sequence modeling.

GateLoop: Maximizing Linear Recurrence Potential for Superior Sequence Modeling

• GateLoop utilizes fully data-controlled linear recurrence to improve sequence modeling.

• Outperforms existing models and offers practical benefits.

• A groundbreaking model in the field of autoregressive language modeling.

Key Points

GateLoop is a foundational sequence model that utilizes fully data-controlled linear recurrence to improve sequence modeling.
GateLoop outperforms existing models for auto-regressive language modeling and offers a low-cost recurrent mode and an efficient parallel mode.
GateLoop operates by incorporating data-controlled gating of inputs, hidden states, and outputs, allowing for content-aware control over forget- and retention behavior.
GateLoop is compared to various models and outperforms them in terms of test perplexity on the WikiText103 benchmark for autoregressive language modeling.
GateLoop offers practical benefits such as avoiding softmax-attention layers, eliminating the need for tedious initialization, and not requiring long implicit convolutions.
The synthetic Memory Horizon dataset validates the advantage of GateLoop's data-controlled state transitions, significantly outperforming a model with fixed state transitions in terms of test accuracy.
GateLoop's state transitions exhibit structured patterns, indicating deliberate utilization of data-controlled gating and forgetting/retention of memories.
GateLoop demonstrates the effectiveness of fully data-controlled linear recurrence for sequence modeling, offering improved performance and practical advantages over existing models.

Summaries

18 word summary

GateLoop maximizes linear recurrence potential for sequence modeling, outperforming other models. It offers content-aware control and superior performance.

83 word summary

GateLoop is a sequence model that maximizes the potential of linear recurrence for sequence modeling. It outperforms other models for language modeling and offers low-cost recurrent and parallel modes. GateLoop incorporates data-controlled gating of inputs, hidden states, and outputs, providing content-aware control over forget- and retention behavior. It demonstrates superior performance on the WikiText103 benchmark and offers practical benefits such as avoiding softmax-attention layers and tedious initialization. The model showcases the ability to learn to forget memories input-dependently and exhibits structured state transitions.

145 word summary

GateLoop is a sequence model that utilizes fully data-controlled linear recurrence to enhance sequence modeling. It fills a gap in existing models by maximizing the potential of linear recurrence. GateLoop outperforms other models for auto-regressive language modeling and offers a low-cost recurrent mode and an efficient parallel mode. It incorporates data-controlled gating of inputs, hidden states, and outputs, allowing for content-aware control over forget- and retention behavior. The model can be trained efficiently using optimized associative scan implementations and provides data-controlled relative-positional information to attention. GateLoop demonstrates superior performance compared to various models on the WikiText103 benchmark. It also offers practical benefits such as avoiding softmax-attention layers, eliminating tedious initialization, and not requiring long implicit convolutions. The model showcases the ability to learn to forget memories input-dependently. In addition, GateLoop's state transitions exhibit structured patterns, showcasing deliberate utilization of data-controlled gating and forgetting/retention of memories.

352 word summary

GateLoop is a foundational sequence model that utilizes fully data-controlled linear recurrence to improve sequence modeling. Existing models have not fully taken advantage of the potential of linear recurrence, so GateLoop aims to fill this gap. The model generalizes linear recurrent models such as S4, S5, LRU, and RetNet by incorporating data-controlled state transitions. GateLoop outperforms existing models for auto-regressive language modeling and offers a low-cost recurrent mode and an efficient parallel mode. It also reveals implications for Transformer and other architectures.

GateLoop operates by incorporating data-controlled gating of inputs, hidden states, and outputs. It replaces the static state transition with time-varying state transitions, allowing for content-aware control over the forget- and retention behavior. The model can be trained efficiently using highly optimized associative scan implementations. GateLoop can also be interpreted as providing data-controlled relative-positional information to attention.

GateLoop is compared to other models such as S4, S4D, LRU, RetNet, Transformer, Hybrid H3, Performer, Reformer, Linear Attention, Transformer-XL, Hyena, and S5-Hyena. GateLoop outperforms these models in terms of test perplexity on the WikiText103 benchmark for autoregressive language modeling.

In addition to its performance advantages, GateLoop offers practical benefits such as avoiding softmax-attention layers, eliminating the need for tedious initialization, and not requiring long implicit convolutions. The model demonstrates the ability to learn to forget memories input-dependently, effectively vacating its hidden state for new relevant information.

The synthetic Memory Horizon dataset is designed to validate the advantage of data-controlled state transitions. The dataset requires models to memorize past input information back to the last reset token. GateLoop with fully data-controlled state transitions significantly outperforms a model with fixed state transitions in terms of test accuracy. The performance of the fully data-controlled variant is maintained for twice as long as the fixed variant as the required memory span increases.

GateLoop's state transitions exhibit structured patterns, indicating deliberate utilization of data-controlled gating and forgetting/retention of memories. Future work can explore different initialization strategies, amplitude- and phase-activations, and interpretability of the learned state transitions.

Overall, GateLoop demonstrates the effectiveness of fully data-controlled linear recurrence for sequence modeling, offering improved performance and practical advantages over existing models.

Raw indexed text (32,087 chars / 5,384 words / 658 lines)

Published as a conference paper at ICLR 2024

G ATE L OOP : F ULLY D ATA -C ONTROLLED L INEAR R E -

CURRENCE FOR S EQUENCE M ODELING

Tobias Katsch

Artificial Intelligence Program

Johannes Kepler University

Linz, 4040, Austria

[email protected]

A BSTRACT

Linear Recurrence has proven to be a powerful tool for modeling long sequences

efficiently. In this work, we show that existing models fail to take full advantage

of its potential. Motivated by this finding, we develop GateLoop, a foundational

sequence model that generalizes linear recurrent models such as S4, S5, LRU and

RetNet, by employing data-controlled state transitions. Utilizing this theoretical

advance, GateLoop empirically outperforms existing models for auto-regressive

language modeling. Our method comes with a low-cost O(l) recurrent mode and

an efficient O(l log 2 l) parallel mode making use of highly optimized associative

scan implementations. Furthermore, we derive an O(l 2 ) surrogate attention mode,

revealing remarkable implications for Transformer and recently proposed archi-

tectures. Specifically, we prove that our approach can be interpreted as providing

data-controlled relative-positional information to Attention. While many existing

models solely rely on data-controlled cumulative sums for context aggregation,

our findings suggest that incorporating data-controlled complex cumulative prod-

ucts may be a crucial step towards more powerful sequence models.

Figure 1: The GateLoop framework takes input-dependent values V , keys K, queries Q and state-

transitions A. At each step of the recurrence, the loop’s input, hidden state and output is gated.

While S4, S5, LRU or RetNet forget at a fixed decay rate, the fully data-controlled approach allows

for input-dependent incorporation of new information, retention of memories and forgetting.

I NTRODUCTION

Modeling sequences across different modalities containing long-range dependencies is a central

challenge in machine learning. Historically, Recurrent Neural Networks (RNNs) have been the nat-

ural choice for this task and led to early breakthroughs in the field. However, RNNs suffer from the

vanishing and exploding gradient problem, often making them unstable to train on long sequences

(Hochreiter & Schmidhuber (1997)). Gated variants such as LSTM and GRU were developed to

address this issue but are still inherently inefficient to train due to their non-linear recurrent nature.

Furthermore, their sequential nature leads to an inductive bias towards recent inputs, limiting their

practical ability to draw long-range dependencies. This inspired the attention mechanism (Garg

1Published as a conference paper at ICLR 2024

et al. (2019)), which was first introduced as an addition to RNN for language translation, allowing

the model to draw pairwise global dependencies between input data points.

Vaswani et al. (2023) took this further with Transformer, which completely gets rid of recurrence

and just relies on attention. The main advantages of Transformers are their efficient parallelizable

training on modern hardware and their ability to draw global pairwise dependencies. The latter

property comes at the price of quadratic complexity O(l 2 ) compared to the linear complexity O(l)

of RNNs. This poses a practical bottleneck for many applications, for instance limiting the document

length a transformer based language model can perform reasoning on. Therefore, much effort has

been put into finding attention replacements with improved complexity. While these variants such

as Reformer, Linformer and Performer offer a reduced complexity of O(l log l) or O(l) the orig-

inal transformer with only minor adjustments prevailed due to its stronger practical performance.

Furthermore, the departure from recurrence eliminated the locality bias of the model to pay more

attention the recent inputs. While the absence of this bias is advantageous for some tasks, it has

proven to be disadvantageous for others. This led to a line of work dedicated to injecting locality

bias into Transformer (Ma et al. (2023), Huang et al. (2023)).

Meanwhile, the works of Gu et al. (2022) on the initialization of discretized State Space Mod-

els (SSMs) lead to a resurgence of linear RNNs for modeling long sequences. The most promi-

nent model of this class S4 and its simplified diagonal variant S4D, achieve remarkable results on

the long-range Arena (LRA) (Tay et al. (2020)), a benchmark designed to test a models ability to

model long-range dependencies. SSMs can be trained efficiently by exploiting their linear and time-

invariant nature. By rewriting the linear recurrence as a long convolution, it can be computed through

the Fourier domain in O(l log l) time complexity. Smith et al. (2023b) introduced S5, which further

simplifies the application of SSMs and popularized the use of associative scan implementations for

fast parallelized training.

Still, SSMs are heavily dependent on involved initialization schemes. Motivated by the question

whether such tedious initialization is really necessary, Orvieto et al. (2023) developed the Linear

Recurrent Unit (LRU) which is on par with S4, S4D and S5 while only requiring much simpler

initialization.

Our contributions to this line of work are three-fold:

• We show that existing models only utilize a special case of linear recurrence. Motivated

by this observation, we develop GateLoop, a foundational sequence model that general-

izes existing linear recurrent models by utilizing data-controlled gating of inputs, hidden

states and outputs. GateLoop can be trained efficiently in O(l log l) making use of highly

optimized associative scan implementations.

• Furthermore, we derive an equivalent O(l 2 ) mode which links GateLoop to Transformer

and prove that our approach can be interpreted as providing data-controlled relative-

positional information to attention.

• Finally, we demonstrate the empirical effectiveness of our approach. Specifically, our re-

sults show that GateLoop outperforms the state of the art models Transformer, Hyena (Poli

et al. (2023)) and S5-Hyena (Smith et al. (2023a)) on the WikiText103 benchmark for auto-

regressive language modeling.

P RELIMINARIES

We consider the task of approximating sequence-to-sequence mappings. The model takes a multi-

channel input sequence x = {x 1 , . . . , x l } packed as a matrix X ∈ R l×d x and outputs Y ∈ R l×d y .

A common assumption in this context is causality, implying that for modeling y n , only information

from all x m with m ≤ n may be used. This enables efficient training strategies such as auto-

regressive language modeling.

2Published as a conference paper at ICLR 2024

2.1

R ECURRENT N EURAL N ETWORK

A Recurrent Neural Network (RNN) layer approximates a sequence-to-sequence mapping through

the following recurrence relation involving learnable parameters A ∈ R d h ×d h , B ∈ R d h ×d x , C ∈

R d y ×d h and an activation function σ. 1

h n = σ(Ah n−1 + Bx n ),

y n = Ch n

(1)

Common choices for σ are tanh or sigmoid. If we chose σ to be the identity function, the RNN layer

becomes linear.

2.2

S TATE S PACE M ODEL

The continuous state space model (SSM) is characterized by the differential equation 2. Here,

Ã ∈ C d h ×d h , B̃ ∈ C d h ×d x , C̃ ∈ C d y ×d h are complex valued, the function ℜ(.) extracts the

real part and h̄(0) is defined to be 0.

dh̃(t)

= Ã h̃(t) + B̃x(t),

y(t) = ℜ( C̃ h̃(t))

(2)

Moreover, Ã can be diagonalized through its eigenvalue decomposition Ã = V ΛV −1 . In this repre-

sentation, Λ is a diagonal matrix of eigenvalues, and V is the matrix of corresponding eigenvectors.

Now, by absorbing V and V −1 into C̃ and B̃, respectively, we obtain the diagonalized SSM. For

more details on this procedure, please see Smith et al. (2023b).

B̄ = V −1 B̃,

C̄ = C̃V, h̄(t) = V −1 h̃(t) (3a)

dh̄(t)

= Λ h̄(t) + B̄x(t),

dt y(t) = ℜ( C̄ h̄(t)) (3b)

In order to utilize the SSMs practically for sequence modeling, they can be discretized, e.g., through

the zero-order hold (ZOH), bilinear, or Euler method. Given a fixed discretization step-size ∆ ∈ R + ,

the ZOH method yields the linear recurrence relation

h n = Ah n−1 + Bx n ,

y n = ℜ(Ch n )

(4)

with the parameterization:

A = exp(∆Λ),

B = Λ −1 (A − I) B̄,

C = C̄

(5)

Discretizing the state space model (4) gives a linear RNN layer (1) involving special reparameteri-

zations of its weights. While this result is simply the solution of the ZOH method application, it is

worth paying attention to its interpretability. Specifically, consider the influence of the discretization

step size:

lim (A, B) = (I, 0)

(6)

∆→0

In the limit ∆ → 0, no new information enters the state space model and the hidden state remains

constant. A small ∆ leads to a sequence-to-sequence mapping with small rates of change, while

a large ∆ leads to large rates of change. It becomes clear, that the step-size has vital impact on

the model’s retain/forget properties. For S5, Smith et al. (2023b) define ∆ as a learnable parameter

vector, where the default values for initialization are logarithmically spaced from 0.001 up to 0.1.

This is done in order to facilitate the learning of dependencies across different time scales.

Gu et al. (2022) observe that training SSMs with naive parameter initialization for the state transition

Ā is not effective in practice. Grounded in theoretical memory compression results, they develop the

HiPPO framework, which they utilize to find suitable initializations. Models of this class include

S4, DSS, S4D and S5. Other initializations, which do not rely on HiPPO theory, nor on the cor-

respondence to the continuous SSM representation have been proposed such as for LRU (Orvieto

et al. (2023)) and RetNet (Sun et al. (2023)).

For clarity, we omit the potential use of biases and skip connections throughout this paper. Furthermore,

we consider h 0 to be 0.

3Published as a conference paper at ICLR 2024

S4D: The deterministic S4D-Lin initialization defines the diagonal state transition ā at channel di-

mension k to be ā k = − 12 +iπk. Alternatively, the S4D-Inv initialization is ā k = − 2 1 +i π l ( k+1

+1).

Here, ā is parameterized in continuous space. Through its ZOH discretization, a is obtained.

LRU: The stable exponential initialization is defined as a = exp(− exp(α) + i exp(θ)), where α

and θ are learnable parameters.

RetNet: Sun et al. (2023) applies a fixed state transition formulation closely linked to the xPos

positional embedding for transformers (Sun et al. (2022)). For this model, we have a = γ exp(iθ)

with the magnitude initialization γ = 1 − 2 −5−c , where c is some positive constant.

D ATA C ONTROLLED L INEAR R ECURRENCE

Incorporating data-control into deep learning models has proven to be highly successful for devel-

oping performant sequence models. Transformer, in its core, is built on the data-controlled linear

operator implemented by attention (Massaroli et al. (2021)). Furthermore, Fu et al. (2023) show, that

SSMs lack the data-control required for modeling language adequately. Based on this observation,

they develop H3 which employs SSMs in conjunction with data-controlled element-wise gating.

With this addition, they decrease the expressivity gap between Transformer and SSM-based-models

for language modeling tasks. Inspired by these findings, we take the data-control paradigm further.

Figure 2: Omitting B, C and application of ℜ(.) for clarity. First, we define the input and output

gates k n , q n ∈ C d h following Sun et al. (2023). Next, as our core contribution, we replace the static

state transition with time-varying state transitions a n ∈ C d h . This allows for content aware control

over the forget- and retention behaviour. While q n and k n act as input and output gates respectively,

a n can be interpreted as a forget- and retain gate. Putting everything together, we obtain GateLoop,

characterized by the the linear recurrence relation 7. We hypothesize, that allowing for time-varying

control over the forget/retain behaviour can enable sequence models to keep important memories

longer and discard unimportant memories faster compared to only relying on static gates. In section

5 we present experimental results which confirm this hypothesis.

h n = h n−1 a n + k n ⊤ v n

y n = q n h n

(7)

(8)

Note, that for generality we define an outer product k n ⊤ v n entering the gate loop. Therefore, k n ⊤ v n

and h n are of shape C d h ×d h . Choosing a max-headed variant, that is d h = 1, we obtain the SISO

case which coincides with previous definitions and element-wise gating when parallelized across

multiple channels.

Unfolding the recurrence relation yields equation 9, which involves a cumulative sum over preceding

time steps discounted by a cumulative product of state transitions.

y n = q n

⊤

k m

v m

m=1

j=m+1

a j

(9)Published as a conference paper at ICLR 2024

3.1

R ELATION TO OTHER M ODELS

S4, S4D, LRU: These models are obtained as a special case of GateLoop. Here, no content aware

gating nor data-controlled state transitions are applied. Their defining linear recurrence relation

can be unfolded into an expression which is equivalent to convolving v with a structured filter. In

contrast, GateLoop cannot be computed through convolution and instead we resort to associative

scans for efficient computation. This is outlined in subsection 3.2.

y n =

v m A n−m = (V ∗ (I, A, . . . , A l−1 )) n

(10)

m=1

Hyena: Poli et al. (2023) obtain a Hyena as generalization of the SSM based H3 by considering

arbitrarily defined long implicit convolutions of the form y n = v ∗ (K 1 , . . . , K l ). Therefore, both

GateLoop and Hyena are mutually exclusive generalizations of the linear RNN layer.

RetNet: Our method degenerates to RetNet when keeping data-controlled input and output gates

but fixing the state transition gate.

⊤

y n = q n

k m

v m A n−m

(11)

m=1

3.2

E FFICIENT A SSOCIATIVE S CAN C OMPUTATION

Smith et al. (2023b) popularized the use of associative scan implementations for efficient parallelized

computation of linear recurrence. In this subsection, we generalize their approach to derive an

efficient method for computing the recurrence relation 7 for n = 1 . . . l parallelized in O(l log 2 l)

time complexity. Given an arbitrary associative operator •, and a sequence of elements {x n } ln=1 , an

associative scan computes their all-prefix sum Σ.

Σ({x n } ln=1 ) = ((x 1 ), (x 1 • x 2 ), (x 1 • x 2 • x 3 ), . . . , (x 1 • x 2 • . . . • x l ))

(12)

The recurrence relation in 7 satisfies this form when arranging the elements a n and k n ⊤ v n as the

tuple leaf elements {x n } ln=1 = {(a n , k n ⊤ v n )} ln=1 and defining • as the following.

p • q = (p 1 , p 2 ) • (q 1 , q 2 ) = (p 1 ⊙ q 1 , q 1 ⊙ p 2 + q 2 )

(13)

For more detailed information on prefix sum algorithms we refer to Blelloch (1990). The associative

scan computes the prefix-sum efficiently in parallel through application of the binary operator on a

computational tree graph. In the following, we provide a simple python JAX implementation of

the GateLoop operator. For the proof of the involved binary operator’s associativity, we refer to the

appendix B.

5Published as a conference paper at ICLR 2024

3.3

S URROGATE A TTENTION R EPRESENTATION

In this subsection, we derive an mathematically equivalent surrogate attention mode for computing

the recurrence in O(l 2 ). For this, we first rewrite the cumulative product of state transitions in order

to separate the variables n and m.



 



⊤



y n = q n

k m

v m 

a j  

a −1

(14)

m=1



 q n

m=1

j=1

 

a j   k m

j=1

 ⊤

 v m

a −1

(15)

j=1

Using this arrangement, we can conveniently pre-compute the prefix-cumulative-product π n of the

state transitions.

π n =

y n =

a j

j=1

(16)

−1

(q n π n ) k m π m

⊤

v m

(17)

m=1

From this, the parallel O(l 2 ) surrogate attention formulation can be obtained by packing the prefix-

cumulative-product in a matrix Π(A) ∈ C l×d and by applying a causal mask M ∈ R l×l to the

resulting surrogate attention matrix.

Q = Q ⊙ Π(A)

(18)

−1

K = K ⊙ Π(A)

M nm =

(19)

n ≥ m

(20)

⊤

Y = (QK ⊙ M )V

(21)

Figure 3: Considering this alternative formulation, our approach can be interpreted as providing

data-controlled relative-positional information to Attention.

3.4

G ENERALIZING S OFTMAX -A TTENTION

The O(l 2 ) representation furthermore gives the opportunity of generalization for other forms of

(non-linear) attention. For softmax attention this can be achieved by simply masking out the upper

triangular matrix of the relative-positional-information infused attention scores with −∞ and then

applying softmax. The softmax sets the − inf entries to 0 resulting in the desired re-weighting of

attention scores.

M −∞ (X) =

X ij i ≥ j,

−∞ i < j

(22)

⊤

Y = Softmax(M −∞ (QK ))V

(23)Published as a conference paper at ICLR 2024

P RACTICAL I MPLEMENTATION

For utilizing the GateLoop framework practically, we define a simple yet powerful model. To obtain

values v n , keys k n , and queries q n , we apply linear projections to the input x n , following to Vaswani

et al. (2023). As suggested by Orvieto et al. (2023) and Sun et al. (2023), we control the magnitude

γ and phase θ of the state transitions separately. For some magnitude activation f and a phase

activation g, we define the state transition in polar form:

a n = γ n exp(iθ n ) = f (α n ) exp(ig(β n ))

(24)

Inspired by the discretization of the state space model, Orvieto et al. (2023) utilizes the non-data-

controlled parameterization γ = exp(− exp(α)), θ = exp(β). This restricts γ to the interval (0,

1) which prevents a blow-up of A n−m for n → ∞.

Figure 4: The stable exponential amplitude activation implemented by LRU is biased towards am-

plitudes close to 1. This bias is evident when plotting the (centered) stable-exponential amplitude

activation function. In contrast, the sigmoid function does not have this bias. For our experiments,

we chose γ n = σ(α n ) as the magnitude activation. Because the imaginary part of an individual state

transition is not strictly required to be restricted to a specific interval, we omit the phase activation.

For the model details, we refer to appendix C.

E XPERIMENTAL R ESULTS

In this section, we report experimental results validating our hypothesis that data-controlled state

transitions yield empirical benefits in sequence modeling. First we design a synthetic language

modeling task that offers interpretable insights to our method. Moreover, we assess the performance

of our method for autoregressive natural language modeling. For this we conduct experiments on

the widely recognized WikiText-103 benchmark.

5.1

M EMORY H ORIZON

Synthetic datasets are have played an important role for guiding model development, highlighting

specific model advantages and weaknesses and to improve model interpretability. (Olsson et al.

(2022), Fu et al. (2023)). We define our own synthetic task, specifically designed to validate the em-

pirical advantage of data-controlled over non-data-controlled state transitions. The Memory Horizon

Dataset for autoregressive synthetic language modeling is specified through an input number range,

a reset token, sequence length and the number of randomized resets per sample. In order to solve

this task successfully, at each time step, the past input information back to last preceding reset token

needs to be memorized. We refer to appendix A for details on the underlying target compression

function and dataset construction parameters. The task is designed for favoring models that can

forget memories preceding an encountered reset token. Although this is a synthetic language, we

hypothesize and subsequently demonstrate in section 5.2, that the fundamental capability to forget

memories based on input is crucial for effectively modeling sequences from more practical modali-

ties.

7Published as a conference paper at ICLR 2024

Figure 5: We visualize the applied state transition magnitudes of the trained fully data-controlled

linear recurrent model, using a example sequence from the Memory Horizon dataset. Dataset de-

tails and hyperparameters can be found in appendix A and C.1 respectively. For all models layers

and channels (vertically), the magnitude activations are plotted along the sequence length (hori-

zontally). Moreover, the magnitude activation averages across channels and layers are shown. As

hypothesized, through data-controlled linear recurrence, this model can learn to forget memories

input-dependently by applying a (close to) zero state transition at the ideal reset positions, effec-

tively vacating its hidden state for new relevant information.

State transition type Test Accuracy

Data-Controlled

Fixed 0.43

0.25

Figure 6: We compare the test accuracy of the GateLoop model instance with that of a second trained

linear recurrent model, which differs only in its use of a fixed state transition. The results show that

making the forget/retain mechanism input dependent improves the test accuracy significantly.

Figure 7: We plot the test accuracy over the required memory span. Not surprisingly, predicting the

correct token becomes more difficult as the necessary memory capacity increases. For all required

memory spans, the fully data-controlled variant performs better than the ’fixed’ variant. While the

performance of the latter model variant falls of rapidly after the required memory span exceeds 50,

the former model variant maintains comparable performance for twice as long. Concluding, this

simple synthetic language modeling task confirms that data-dependent control over the forget/retain

properties can improve sequence modeling capabilities in practise.

5.2

W IKI T EXT 103

The WikiText103 dataset for autoregressive natural language modeling comprises over 100 million

tokens extracted from verified Wikipedia articles. We test our fully data-controlled linear recurrent

model against the state of the art competition. The model details are reported in section C.

8Published as a conference paper at ICLR 2024

Table 1: Comparison of WikiText103 test perplexity (lower is better) of different models. All models

use the same tokenizer. The results for the other models are taken from Poli et al. (2023) and Smith

et al. (2023a)

Model

Parameters Test Perplexity

Transformer

Hybrid H3

Performer

Reformer

Linear Attention

Transformer-XL

Hyena

S5-Hyena

GateLoop

125M

258M

125M

18.6

18.5

26.8

26.0

25.6

18.4

18.5

18.3

13.4

GateLoop takes a significant performance leap forward over existing models. while offering advan-

tages such as avoiding softmax-attention layers (unlike Transformer and Hybrid H3), eliminating

the need for tedious initialization (unlike State Space Models), and not requiring long implicit con-

volutions (unlike Hyena).

Figure 8: We plot the state transitions of the trained model for a random test input batch at lay-

ers 0 and 8. We observe structured patterns in the data-controlled state transition. While we leave

interpretability for future work, we point out that these patterns indicate that the trained model de-

liberately utilizes the data-controlled gating of the state transition (and thus forgetting and retention

of memories) by applying large varieties of magnitudes and phases.

F UTURE W ORK

While our primary focus in this paper is to establish the groundwork for constructing fully data-

controlled linear RNNs, we recognize the multitude of opportunities for future research. One avenue

involves exploring the effects of different initialization strategies, amplitude- and phase-activations.

Moreover, we suggest that future work should pay focus to the interpretability of the learned state

transitions for gaining deeper insights into the model’s inner workings.

C ONCLUSION

We introduce GateLoop, a fully data-controlled linear RNN which generalizes existing linear re-

current models by leveraging data controlled gating of inputs and outputs and state transitions.

While our method comes with linear runtime complexity O(l), we derive an efficient parallelizable

O(l log l) training strategy utilizing parallel scans. Furthermore, GateLoop can be reformulated in

an equivalent O(l 2 ) surrogate attention mode which reveals, that its mechanism can be interpreted

as providing relative positional information to Attention. Finally we validate empirically, that fully

data-controlled linear recurrence is highly performant for autoregressive language modeling.

9Published as a conference paper at ICLR 2024

R EFERENCES

Guy Blelloch. Prefix sums and their applications. Tech. rept. CMU-CS-90-190, School of Computer

Science, Carnegie Mellon, 1990.

Daniel Y. Fu, Tri Dao, Khaled K. Saab, Armin W. Thomas, Atri Rudra, and Christopher Ré. Hungry

hungry hippos: Towards language modeling with state space models, 2023.

Sarthak Garg, Stephan Peitz, Udhyakumar Nallasamy, and Matthias Paulik. Jointly learning to align

and translate with transformer models, 2019.

Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured

state spaces, 2022.

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Comput., 9(8):

1735–1780, nov 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.8.1735. URL https:

//doi.org/10.1162/neco.1997.9.8.1735.

Feiqing Huang, Kexin Lu, Yuxi CAI, Zhen Qin, Yanwen Fang, Guangjian Tian, and Guodong Li.

Encoding recurrence into transformers. In The Eleventh International Conference on Learning

Representations, 2023. URL https://openreview.net/forum?id=7YfHla7IxBJ.

Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan

May, and Luke Zettlemoyer. Mega: Moving average equipped gated attention, 2023.

Stefano Massaroli, Michael Poli, Jinkyoo Park, Atsushi Yamashita, and Hajime Asama. Dissecting

neural odes, 2021.

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan,

Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli,

Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane

Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish,

and Chris Olah. In-context learning and induction heads, 2022.

Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pas-

canu, and Soham De. Resurrecting recurrent neural networks for long sequences, 2023.

Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus, Yoshua

Bengio, Stefano Ermon, and Christopher Ré. Hyena hierarchy: Towards larger convolutional

language models, 2023.

Jimmy T. H. Smith, Andrew Warrington, and Scott W. Linderman. Simplified State Space Layers for

Sequence Modeling [source code]. https://github.com/lindermanlab/S5, 2023a.

Jimmy T. H. Smith, Andrew Warrington, and Scott W. Linderman. Simplified state space layers for

sequence modeling, 2023b.

Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaud-

hary, Xia Song, and Furu Wei. A length-extrapolatable transformer, 2022.

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and

Furu Wei. Retentive network: A successor to transformer for large language models, 2023.

Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao,

Liu Yang, Sebastian Ruder, and Donald Metzler. Long range arena: A benchmark for efficient

transformers, 2020.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,

Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023.

10Published as a conference paper at ICLR 2024

M EMORY H ORIZON DATASET DETAILS

In this section, we describe the details of the Memory Horizon Dataset for synthetic language model-

ing. The goal of this dataset is to highlight the advantage of data-controlled over non-data-controlled

state transitions for linear recurrent models.

Table 2: This table lists the parameters we use for constructing the Memory Horizon Dataset. The

input vocabulary consists of a reset token and the number tokens for all numbers within the input

number range. The output vocabulary consists of the number tokens from 0 up to the maximal output

number.

Parameter

Value

Input numbers range

Sequence length

Resets per sample

Max output

Number of samples

[0, 4]

1024

2000

Furthermore, we apply a memory compression function that computes the target token based on a

list of input number tokens. This list extends from the most recent reset token to the end of the input

sequence, or if no reset token is present, from the start of the sequence. The function calculates

an alternating sum of products by multiplying pairs of numbers from opposite ends of the list. The

operation alternates between addition and subtraction for each pair. In cases where the list has an

odd number of elements, the middle element is either added or subtracted, depending on the current

operation. Finally, the result is taken modulo a specified number to compress the memory value.

P ARALLEL S CAN

For completeness, we show the associativity of the utilized binary operator.

11Published as a conference paper at ICLR 2024

Proof.

(a • b) • c = (a 1 ⊙ b 1 , a 2 + b 2 ) • (c 1 , c 2 )

= (a 1 ⊙ b 1 ⊙ c 1 , c 1 ⊙ (a 2 + b 2 ) + c 2 )

= (a 1 ⊙ b 1 ⊙ c 1 , c 1 ⊙ a 2 + c 1 ⊙ b 2 + c 2 )

a • (b • c) = a • (b 1 ⊙ c 1 , b 2 + c 2 )

= (a 1 ⊙ b 1 ⊙ c 1 , c 1 ⊙ a 2 + c 1 ⊙ b 2 + c 2 )

Figure 9: For the defined operator and leaf elements, the associative scan procedure is illustrated for

1 ≤ n ≤ 4. The right tuple elements act as a kv-caches of dimension C d h ×d h for the compressed

memories of the respective sub-graph.

12Published as a conference paper at ICLR 2024

M ODEL D ETAILS

Each model layer is composed of:

• A Time-Mixing block that aggregates information across the temporal dimension. In

this case, this is the GateLoop operator with the defined content aware inputs. We use

real-valued weights for the involved linear projection and return only the real part of the

GateLoop output.

• A Channel-Mixing block designed to approximate functions along the channel dimension.

In this experiment, a simple FNN is applied point-wise to the sequence vectors.

• Skip-Connections and Layer Normalization, which are recommended to allow information

to skip channel/time mixing and stabilize training.

The models consist of:

• An learned input token embedding.

• A stack of L model layers, with the specific number depending on the model type.

• A language head, which is a linear projection that maps the output of the last layer to a

probability distribution (actually the logits) over the vocabulary. The model is trained to

model the probability distribution over the possible output tokens given the current input

context.

Figure 10: Visualization of the full model architecture.

13Published as a conference paper at ICLR 2024

C.1

M EMORY H ORIZON HYPERPARAMETERS

Table 3: Model hyperparmeters used for the MemoryHorizon experiment.

Hyperparameter

Value

Number of epochs

Batch size

Learning rate

Optimizer

Optimizer momentum (β 1 , β 2 )

Weight decay

Learning rate schedule

Number of warmup steps

300

0.0025

AdamW

0.9, 0.98

0.05

cosine decay (linear warm-up)

10000

n layer

d channel mixing

d model

d qk

n head

magnitude activation

phase activation

C.2

128

sigmoid

identity

W IKI T EXT 103 HYPERPARAMETERS

Table 4: Hyperparmeters used for the WikiText103 experiment. We apply a smaller learning to

the projections which control the state transition. Moreover, no weight decay is applied to these

parameters.

Hyperparameter

Value

Number of epochs

Batch size

Base learning rate

State transition learning rate

Optimizer

Optimizer momentum (β 1 , β 2 )

Weight decay

Learning rate schedule

Number of warmup steps

100

0.000125

0.0001

AdamW

0.9, 0.98

0.25

cosine decay (linear warm-up)

5000

n layer

d channel mixing

d model

d qk

n head

magnitude activation

phase activation

1872

624

sigmoid

identity